WO2022012164A1 - Method and apparatus for converting voice into rap music, device, and storage medium - Google Patents

Method and apparatus for converting voice into rap music, device, and storage medium Download PDF

Info

Publication number
WO2022012164A1
WO2022012164A1 PCT/CN2021/095236 CN2021095236W WO2022012164A1 WO 2022012164 A1 WO2022012164 A1 WO 2022012164A1 CN 2021095236 W CN2021095236 W CN 2021095236W WO 2022012164 A1 WO2022012164 A1 WO 2022012164A1
Authority
WO
WIPO (PCT)
Prior art keywords
alignment
rhythm
information
period
unit
Prior art date
Application number
PCT/CN2021/095236
Other languages
French (fr)
Chinese (zh)
Inventor
徐雯
Original Assignee
百果园技术(新加坡)有限公司
徐雯
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 百果园技术(新加坡)有限公司, 徐雯 filed Critical 百果园技术(新加坡)有限公司
Publication of WO2022012164A1 publication Critical patent/WO2022012164A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/0008Associated control or indicating means
    • G10H1/0025Automatic or semi-automatic music composition, e.g. producing random music, applying rules from music theory or modifying a musical piece
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/101Music Composition or musical creation; Tools or processes therefor
    • G10H2210/111Automatic composing, i.e. using predefined musical rules
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/341Rhythm pattern selection, synthesis or composition
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02BCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO BUILDINGS, e.g. HOUSING, HOUSE APPLIANCES OR RELATED END-USER APPLICATIONS
    • Y02B20/00Energy efficient lighting technologies, e.g. halogen lamps or gas discharge lamps
    • Y02B20/40Control techniques providing energy savings, e.g. smart controller or presence detection

Definitions

  • the present application relates to the technical field of music production, for example, to a method, apparatus, device and storage medium for converting speech into rap music.
  • Rap culture has gradually entered the public's field of vision.
  • the characteristic of rap music is that the creator quickly and rhythmically speaks a series of rhythmic words under the background music.
  • the production process of rap music often has to go through a complicated process. For most non-audio processing personnel It will take a long time to learn to use professional audio processing software and perform complex manual operations on audio processing software.
  • the present application provides a method, device, device and storage medium for converting speech into rap music, so as to solve the problems of limited speech content and poor speech conversion effect during the process of speech conversion into rap music.
  • a method of converting speech to rap music including:
  • the text attribute information and the music rhythm information determine at least one alignment period for aligning the speech segment with the background music, and obtain an alignment information table for each alignment period;
  • Also provided is a device for converting speech into rap music comprising:
  • an information determination module configured to recognize a speech segment and process background music, and obtain text attribute information of the text in the speech segment and music rhythm information of the background music;
  • an alignment information determination module configured to determine at least one alignment period for aligning the speech segment with the background music according to the text attribute information and the music rhythm information, and obtain an alignment information table for each alignment period ;
  • the conversion control module is configured to control the text in the speech segment to align with the rhythm points in the background music according to the alignment information table of the at least one alignment period, to obtain the aligned audio, and to align the aligned audio.
  • the audio is transposed and processed with special effects to form rap audio.
  • Also provided is a computer device comprising:
  • processors one or more processors
  • storage means arranged to store one or more programs
  • the one or more programs are executed by the one or more processors, so that the one or more processors implement the above-described method of converting speech to rap music.
  • FIG. 1 is a schematic flowchart of a method for converting speech into rap music provided by Embodiment 1 of the present application;
  • FIG. 2 is a schematic flowchart of a method for converting speech into rap music provided by Embodiment 2 of the present application;
  • Embodiment 3 is a flow chart of the realization of determining the alignment period in a method for converting speech into rap music provided by Embodiment 2 of the present application;
  • Fig. 4 is the realization flow chart of determining the alignment unit in the alignment period and alignment unit information in a kind of method for converting speech into rap music provided by Embodiment 2 of this application;
  • Fig. 5 provides the unfolding flow chart of the alignment unit and alignment unit information in a kind of determination alignment cycle that the second embodiment of the application provides;
  • FIG. 6 is a structural block diagram of a device for converting speech into rap music provided by Embodiment 3 of the present application;
  • FIG. 7 is a schematic diagram of a hardware structure of a computer device according to Embodiment 4 of the present application.
  • Embodiment 1 is a schematic flowchart of a method for converting speech into rap music provided in Embodiment 1 of the present application.
  • the method is suitable for converting a voice segment recorded by a user into rap music.
  • the method can be performed by converting speech into rap music.
  • Means implementation of music, wherein the means may be implemented in software and/or hardware, and may generally be integrated on computer equipment.
  • a selection interface for background music can be provided to the user first, thereby obtaining the background music selected by the user; after that, a selection interface for voice content can also be provided to the user, thereby obtaining the user's recording by triggering the recording.
  • the voice segment recorded in real time by the button, or the pre-recorded voice segment uploaded by the user by triggering the upload button is obtained; then the method for converting voice into rap music provided by this embodiment can be implemented, so as to convert the obtained voice segment into Said segment with background music.
  • a method for converting speech into rap music includes the following operations:
  • S101 Recognize a speech segment and process background music, and obtain text attribute information of the text in the speech segment and music rhythm information of the background music.
  • the speech segment can be understood as the real-time recording or pre-recorded speech segment obtained by the user before executing S101
  • the background music can be understood as the to-be-used selected by the user from the background music set received before executing S101 music.
  • speech recognition can be performed on the speech segment, so that the text serial number of the text included in the speech segment, the pronunciation duration of the text (the start and end time of the text), and the starting position of the first vowel in the text can be obtained. It is also possible to detect and process the music beat of the background music, so as to obtain the rhythm point serial number of the rhythm points included in the background music, the location of the rhythm point, and the rhythm points included in each beat cycle formed by the division. rhythm points and other related music rhythm information.
  • This embodiment does not limit the methods of speech recognition, text detection, and rhythm point detection, as long as required text attribute information and music rhythm information can be obtained.
  • S102 Determine at least one alignment period for aligning the speech segment with the background music according to the text attribute information and the music rhythm information, and obtain an alignment information table for each alignment period.
  • the most important link is to align the corresponding text in the speech segment with the rhythm points in the background music.
  • the so-called alignment of speech segments and background music can be considered as dividing the speech into individual words, aligning each word with strong rhythm and regular accent points, which may be accompanied by repetition of some first and last words or middle words. Strengthen the sense of rhythm. Therefore, when implementing the conversion from speech to rap music in this embodiment, it is necessary to first determine an alignment period and a corresponding alignment information table for aligning the speech segment and the background music through S102.
  • the alignment period can be understood as a minimum repeating unit including rhythm points that can be aligned with all the characters in the speech segment, that is, starting from a time t, the rhythm of the background music is a fixed period that can align all the characters in the speech segment. T to repeat.
  • the alignment information table can be understood as including information on the correspondence between the required rhythm points and the characters to be aligned (such as the rhythm point serial number, text serial number) and the gear ratio when aligning the rhythm points with the characters in one alignment cycle. information declaration form.
  • S102 The realization of S102 can be expressed as:
  • the total amount of text included in the speech segment can be determined from the text attribute information, and the total amount of rhythm points included in the background music can be determined from the music rhythm information, and the beat cycle formed by dividing these rhythm points. cycle information.
  • the beat period can be understood as a minimum rhythm repetition unit found according to the rhythm points, that is, starting from a time, the rhythm of the background music is repeated in a fixed period Z.
  • the beat period can satisfy the condition of being an alignment period.
  • Each tick period is regarded as an alignment period. If the takt period does not satisfy the condition of being an alignment period, the period length of the takt period needs to be updated to obtain a takt period that can be used as an alignment period.
  • an alignment period can be randomly selected, combining the start and end times of the characters in the character attribute information, the start position of the first vowel of the characters, and a sequence extracted from the music rhythm information.
  • the rhythm point information of the rhythm points in the alignment period is used to determine the rhythm point to be aligned for each character in the speech segment within the alignment period, and the gear ratio to be possessed when aligning the to-be-aligned rhythm point.
  • the rhythm point serial number, the text serial number of the associated text, and the information table of the corresponding gear ratio are used as the alignment information table of the alignment period.
  • the alignment information table can be regarded as the alignment information table of each complete alignment period, and for a non-complete alignment period, part of the alignment information can be extracted from the alignment information table to form the corresponding alignment information table, thus At least one alignment period and an alignment information table corresponding to each alignment period are obtained through S102.
  • the matching text and rhythm points can be determined directly through the alignment period formed by the above-mentioned division of the rhythm points of the background music and the alignment information table including the alignment relationship between the text of the speech segment and the rhythm points, and control the rhythm points in the speech segment.
  • the text is aligned with the rhythm points in the background music, and the aligned audio is shifted based on the corresponding gear ratio. After that, you can also adjust the pitch of the shifted audio and add reverb, etc. according to the pitch of the background music. special effects to form converted rap audio.
  • a method for converting speech into rap music provided in the first embodiment of the present application can first identify the speech segment and process the background music to obtain the text attribute information of the text in the speech segment and the music rhythm information of the background music; The text attribute information and the music rhythm information are used to determine at least one alignment period for matching the speech segment with the background music, and the alignment information table of each alignment period is obtained; finally, according to the alignment information table of at least one alignment period, the text in the speech segment and the The rhythm points in the background music are aligned to obtain the aligned audio, and the rap audio is formed after adjusting the pitch of the aligned audio and processing the special effects.
  • the above technical solution effectively realizes the conversion of voice content clips randomly recorded by the user into narration clips matched with background music, simplifies the tedious process of manual audio editing and production, and provides non-professional audio processing personnel with the possibility of rap music production;
  • the above technical solution does not need to limit the voice content to be converted, guarantees the free recording of the voice content to be converted, simplifies the realization process of voice conversion, avoids the misplacement of voice text and music rhythm points, and improves the voice conversion rap music. Scope of application.
  • Embodiment 1 of the present application according to the text attribute information and the music rhythm information, determine at least one alignment period for aligning the speech segment with the background music, and obtain each Before the alignment information table for each alignment period, the method further includes: if the total amount of text in the text attribute information is greater than the total amount of rhythm points in the music rhythm information, ending the process of converting the speech segment into rap music, And output a prompt to regain the speech segment or background music.
  • the execution conditions of the above S102 and S103 may be by default: the total amount of characters in the text attribute information obtained through S101 is less than or equal to the music rhythm information.
  • the total amount of rhythm points that is, the total number of words in the obtained speech segment is less than or equal to the number of rhythm points in the background music.
  • the operation of this optional embodiment may be performed, that is, when it is determined that the total amount of text is greater than the total amount of rhythm points, the end of Following the execution of the step of converting the speech segment into rap music, a prompt for re-recording the speech segment is output to inform the user to re-record the speech segment.
  • a prompt for re-selection of background music may be output to inform the user to re-selection of background music.
  • this optional embodiment ensures an effective match between the speech segment to be converted and the background music, thereby improving the user experience of converting speech into rap music.
  • FIG. 2 is a schematic flowchart of a method for converting speech into rap music provided by Embodiment 2 of the present application.
  • Embodiment 2 is described based on Embodiment 1.
  • recognizing speech segments and processing background music obtain the text attribute information of the text in the speech segment and the music rhythm information of the background music, including: performing noise reduction processing and endpoint detection processing on the speech segment selected by the user, Obtain the character serial number, start and end time, starting position of the first vowel and the total amount of characters of each character in the speech segment to form the character attribute information of the speech segment; perform rhythm point detection on the background music selected by the user and the beat cycle division, determine the total amount of rhythm points, the rhythm point serial number, and the cycle information of each beat cycle contained in the background music, and constitute the music rhythm information of the background music; wherein, the cycle information includes: cycle number, the number of rhythm points included in the beat cycle, and the rhythm point number and start time of each rhythm point.
  • At least one alignment period for aligning the speech segment with the background music is determined, and an alignment information table for each alignment period is obtained, including : According to the total amount of text in the text attribute information and the cycle information of each beat cycle in the music rhythm information, determine at least one alignment cycle for aligning the speech segment with the background music; select a complete As the rhythm segment to be aligned, according to the text attribute information and the rhythm point information of the rhythm point to be aligned in the rhythm segment to be aligned, determine at least one alignment unit and corresponding alignment unit information; summarize the at least one alignment unit information A current alignment information table of the to-be-aligned rhythm segment is formed, and an alignment information table of each remaining alignment period is determined according to the current alignment information table.
  • a method for converting speech into rap music includes the following operations:
  • S201 Perform noise reduction processing and endpoint detection processing on the speech segment selected by the user, and obtain the character serial number, start and end time, and start of the first vowel of each character in the speech segment through speech recognition of the processed speech segment.
  • the starting position and the total amount of characters constitute the character attribute information of the speech segment.
  • the noise processing strategy in audio processing can be used to perform noise reduction processing on the recorded speech segment, and the endpoint detection strategy can be used to remove the mute segment from the noise-reduced speech segment, and then the speech recognition strategy can be used
  • the processed speech segment is recognized to obtain relevant information of each character that constitutes the speech segment.
  • the information obtained above may include the total amount of characters included in the entire speech segment, the character serial number of each character, the corresponding start and end time of the character in the speech segment, and the start of the first vowel of the corresponding pronunciation of the character.
  • the starting and ending time of the text and the starting position of the first vowel can be regarded as relative time points, that is, the starting time of the first character can be regarded as 0 seconds according to the playback sequence of the entire voice.
  • the above information may be recorded as text attribute information corresponding to the speech segment.
  • Table 1 is a data table of text attribute information. As shown in Table 1, each column in Table 1 can be regarded as a text attribute item, which can at least include the text serial number, the start time of the text, and the first item in the text. The start time of vowels and the end time of words, the number of rows in Table 1 can be regarded as the total number of words included in the speech segment.
  • the rhythm point detection strategy in audio processing can be used to first detect the accent points (ie, rhythm points) of strong rhythm from the background music, and then the beat division strategy can be used to find the pronunciation rule of the detected rhythm points , thereby dividing the beat period with the smallest rhythm repeating unit.
  • the detected rhythm point itself has certain attribute information, such as the sequence number of the rhythm point, the total amount of the rhythm point, the position of the rhythm point (that is, the relative time when the rhythm point appears).
  • corresponding period information will also be formed corresponding to each beat period.
  • the period information may include: the period number, the number of rhythm points included in the beat period, and the number of rhythm points of each rhythm point. Rhythm point number and rhythm point start time. This embodiment can aggregate these pieces of information to form a piece of music rhythm information.
  • this embodiment provides music rhythm information in the form of a data table, thereby displaying the music rhythm information in the form of an information table.
  • Table 2 is a data table of music rhythm information. It can be seen that Table 2 is a cascade table. The first column of Table 2 shows the beat cycle identified by the cycle number, and the second column gives The rhythm point number is displayed, and at the same time, which rhythm points are included in the rhythm period with the cycle number of 1 in the form of cascade. Start time), the number of rhythm points cascaded under each cycle number can be used as the number of rhythm points in the beat cycle.
  • the following S203 to S205 in this embodiment provide an implementation process of determining an alignment period and an alignment information table for aligning the speech segment with the background music by using the text attribute information and the music rhythm information.
  • S203 Determine at least one alignment period for aligning the speech segment with the background music according to the total amount of text in the text attribute information and the period information of each beat period in the music rhythm information.
  • the determined last alignment period may be a non- Full cycle (ie, does not contain all text).
  • S203 is equivalent to firstly dividing the background music by a rough alignment period. The whole division process requires the total amount of text included in the speech segment and the number of rhythm points in a complete beat cycle in the background music, and the ratio of the total amount of text to the number of rhythm points in the beat period. Yes, to determine whether to use the beat period as the alignment period directly, or to obtain the alignment period by combining the beat periods.
  • FIG. 3 is a flow chart of the implementation of determining an alignment period in a method for converting speech into rap music provided by Embodiment 2 of the present application. As shown in FIG. 3 , according to the total amount of text in the text attribute information and the The period information of each beat period in the music rhythm information determines at least one alignment period for aligning the speech segment with the background music, including:
  • At least one beat period can be detected in the entire background music.
  • the beat period is considered to be a complete period.
  • the last cycle formed by division may be an incomplete period. That is, it does not contain all the rhythm points in a complete beat cycle.
  • a beat period can be selected from a complete beat period, and the number of rhythm points in the period information corresponding to the beat period can be acquired. The values of the number of rhythm points in different complete beat cycles are the same.
  • S2032. Determine whether the number of rhythm points is greater than or equal to the total amount of characters. If the number of rhythm points is greater than or equal to the total amount of characters, execute S2033; if the number of rhythm points is less than the total amount of characters, perform S2033. Then execute S2034.
  • the determination purpose of S2032 is mainly to determine whether a complete beat period obtained by the current detection can accommodate all the characters in the speech segment, if the complete beat period can accommodate all the characters in the speech segment, then execute S2033; If the period cannot accommodate all the characters in the speech segment, S2034 needs to be executed.
  • each beat period as an alignment period.
  • the rhythm period can be directly regarded as an alignment period.
  • a complete beat period can be regarded as an alignment period
  • other detected complete beat periods can be regarded as a complete alignment period
  • an incomplete beat period can be regarded as an incomplete alignment period.
  • S2034 determine whether the number of beat periods included in the background music is greater than 1, and if the number of beat periods included in the background music is greater than 1, execute S2035; if the number of beat periods included in the background music is greater than 1 If the number is not greater than 1, execute S2036.
  • the beat cycles need to be merged, and the precondition for merging is the beats included in the background music.
  • the number of cycles is at least two. Determine whether the number of beat periods in the background music is greater than 1 through S2034, if the number of beat periods in the background music is greater than 1, then the merging condition is satisfied, and S2035 can be continued; if the number of beat periods in the background music is not If it is greater than 1, it means that the background music does not match the speech segment, and S2036 needs to be executed.
  • the beat periods when the number of beat periods is greater than 1, the beat periods may be merged in pairs in the order of the period numbers, thereby forming a new beat period, and the corresponding period information of the newly formed beat period will also be Corresponding changes occur.
  • the number of rhythm points contained in the new beat period formed is the number of rhythm points contained in the previous two beat periods. sum of numbers.
  • the number of the takt cycles formed is half or half of the original takt cycles plus 1, and then returns to S2031 to align the cycles according to the cycle information of the newly formed takt cycles is determined, and so on, until a suitable alignment period is found, or when the search fails, the subsequent voice-to-rap music conversion operation is ended.
  • the voice segment is re-uploaded or recorded again, or the background music is re-selected.
  • S204 Select a complete alignment period as the to-be-aligned rhythm segment, and determine at least one alignment unit and corresponding alignment unit information according to the text attribute information and the rhythm point information of the to-be-aligned rhythm point in the to-be-aligned rhythm segment.
  • an alignment period may be used as a reference to determine the matching situation of each character included in the speech segment with respect to the rhythm points within the alignment period.
  • the matching between the rhythm points included in a period of time and the characters in the speech segment is regarded as an alignment unit, and each alignment unit information includes the rhythm point number of the rhythm point existing in the alignment unit, and the rhythm point number associated with the rhythm point.
  • Each alignment unit has alignment unit information including at least the rhythm point serial number, the character serial number and the gear ratio. At the same time, since the number of rhythm points included in the multiple alignment periods is the same and the musical tempo is the same, the alignment unit and the alignment unit information may only be determined for any complete alignment period.
  • the implementation process of determining the alignment unit and the alignment unit information in S204 can be described as follows: first, the alignment period selected for information determination is recorded as the rhythm segment to be aligned, and the rhythm point information in the alignment period can be directly used as the rhythm segment to be aligned.
  • rhythm point information of the rhythm point to be aligned after that, an alignment matching value for aligning the text and the rhythm point can be determined according to the text attribute information and the rhythm point information; then it is determined that the alignment matching value is in the preset rhythm point-text
  • the alignment range belonging to the alignment rule table based on the alignment rule corresponding to the alignment range, the alignment unit in the rhythm segment to be aligned is determined, and the alignment unit information corresponding to each alignment unit is determined, wherein the rhythm point-text
  • the alignment range in the alignment rule table and the corresponding alignment rules can be preset through historical experience.
  • FIG. 4 is a flow chart of the realization of determining alignment units and alignment unit information in an alignment period in a method for converting speech into rap music provided by Embodiment 2 of the present application.
  • the text attribute information and the information to be Align the rhythm point information of the rhythm points to be aligned in the rhythm segment, and determine at least one alignment unit and the corresponding alignment unit information may include:
  • S2041 selecting a complete alignment period as the rhythm segment to be aligned, and based on the rhythm point information of a plurality of rhythm points to be aligned in the rhythm segment to be aligned, form a plurality of rhythms to be aligned corresponding to the plurality of rhythm points to be aligned one-to-one block, and record the number of the plurality of rhythm points to be aligned as the initial number of remaining points.
  • a complete alignment period may be selected from the above-determined alignment periods as the to-be-aligned rhythm segment determined by the alignment information table, and the to-be-aligned rhythm point in the to-be-aligned rhythm segment is the complete alignment period. All rhythm points included in the alignment cycle, the rhythm point information of the rhythm points included in the complete alignment cycle is the rhythm point information of the rhythm points to be aligned.
  • the interval formed between two adjacent to-be-aligned rhythm points may be recorded as a to-be-aligned rhythm block, so that the same number of to-be-aligned rhythm points may be formed first according to the number of to-be-aligned rhythm points included in the to-be-aligned rhythm segment.
  • the formed rhythm blocks to be aligned correspond to the rhythm points to be aligned, respectively, and the corresponding block serial number can be set for each rhythm block to be aligned, and the number of rhythm points to be aligned can also be recorded as the initial the number of remaining points.
  • S2042. Determine the ratio of the number of remaining points to the total amount of characters in the character attribute information, and record the ratio as an alignment matching value.
  • the number of rhythm points to be matched is all the to-be-aligned rhythm points. Therefore, initially, the number of remaining points is initially the to-be-aligned rhythm points included in the to-be-aligned rhythm segment. quantity.
  • This embodiment presets a rhythm point-text alignment rule table, the rule table is a binary association table, and the two associated objects are the length ratio range and the alignment rule respectively.
  • the length ratio range can be set by the ratio of the number of unmatched rhythm points in one alignment period to the total amount of characters included in the entire speech segment.
  • six different ranges of length ratios are formed, namely: (0,0.2], (0.2,0.8], (0.8,1], (1,1.1], (1.1,1.3] and ( 1.3, ⁇ ).
  • the range of length ratios in which the above-obtained alignment matching value is located in the rhythm point-character alignment rule table can be determined.
  • S2044 Determine according to the alignment rule corresponding to the length ratio range that there is a rhythm block to be aligned that matches the text, and record it as a candidate alignment unit.
  • the alignment rule associated with the length ratio range can be obtained, and the candidate alignment unit is divided for the to-be-aligned rhythm segment by the alignment rule.
  • the matching of text and rhythm points can be regarded as the matching of text and a rhythm block to be aligned, and based on the alignment rule corresponding to the length ratio range, the matching text can be determined for each rhythm block to be aligned. (The number of characters is uncertain, but the number of characters is at least 1), and the matched rhythm block to be aligned can be used as a candidate alignment unit.
  • the present embodiment sets corresponding alignment rules.
  • Table 3 provides a preset rhythm point-text alignment rule table. Character matching is performed for the remaining rhythm points (the remaining rhythm blocks to be aligned) according to the alignment rules corresponding to the multiple length ratio ranges in Table 3.
  • S2046 Determine whether the number of remaining points is 0, if the number of remaining points is 0, execute S2047; if the number of remaining points is not 0, return to execute S2042.
  • the number of candidate alignment units formed by it is actually the same as the number of included rhythm points to be aligned. That is, it can be considered that a rhythm point to be aligned (rhythm block to be aligned) corresponds to a candidate alignment unit, and the unit serial numbers of the multiple candidate alignment units can be marked sequentially increasing from 0 according to their alignment sequence.
  • this embodiment provides an exemplary description.
  • the number of rhythm points to be aligned in a rhythm segment to be aligned is 8, and the currently determined number of remaining points is 8;
  • the total amount of text included in the speech segment obtained by the user is 5, such as "light yellow long skirt”, the process of matching the “light yellow long skirt” with the 8 remaining rhythm points to determine each candidate alignment unit is described as:
  • the alignment rule is: "Select 10% of the total text from the first text to match from the first remaining rhythm point, and then match the remaining rhythm points of 100% of the total text with the text in text order, and then For the remaining rhythm points of the following 20% of the total text, starting from the last text, select the text with 20% of the total text for repeated matching". Based on this alignment rule, it is first necessary to start from the first word of "light yellow dress" and select 10% of the total text, that is, 0.5 words to repeat. When the length of the word to be repeated is less than 1, the round-down operation is performed. Therefore, the current number of words to be repeated is 0. After that, you can directly start from the first remaining rhythm point, select a rhythm point with 100% of the total text, and match the 5 text sequences respectively.
  • rhythm points 0-4 formed by the rhythm points 0-4 to be aligned correspond respectively.
  • “Light”, “Yellow”, “Color”, “Long” and “Skirt” are 5 characters; then, starting from the last character of “Light Yellow Long Skirt”, select 20% of the total text, that is, the last character “Skirt” ", at this time, the rhythm block to be aligned formed by rhythm point 5 corresponds to the word "skirt”. So far, the operation of matching text and rhythm points according to the alignment rules associated with the length ratio range (1.3, ⁇ ) has been completed.
  • the unit numbers of the currently determined candidate alignment units are 0-5 respectively, and the six candidate matching units correspond The characters are: “light”, “yellow”, “color”, “long”, “skirt” and “skirt”.
  • the alignment rule is: "When L is less than or equal to 0.5, randomly select the text to be repeated with L * the total amount of text, adjust the position of the matched rhythm point-text, and repeat after the selected text; when L If it is greater than 0.5, randomly select 50% of the total text to be repeated, adjust the position of the matched rhythm point-text, repeat after the selected text, and add the remaining (L-0.5)*total text. The remaining rhythm points are added to the silent segment, where L is the alignment matching value.”
  • the alignment matching value 0.4 is less than 0.5, so the operation of randomly selecting 40% of the total text (that is, 2 characters) can be directly performed, assuming that the font size is randomly selected from 0-4.
  • the fixed font size is 1 and 3, and the corresponding words are "yellow” and "long” respectively, then it is necessary to adjust the "light yellow long skirt” that has been matched and formed, so that the word to be repeated can be located in the selected word.
  • the remaining two rhythm blocks to be aligned are "yellow” and "long” respectively, thus forming new candidates matching the two characters “yellow” and "long” respectively.
  • the alignment unit due to the adjustment of the above-mentioned "light yellow long skirt” that has been matched, based on this operation, the characters corresponding to the 8 candidate matching units are: “light”, “yellow”, “yellow” and "color” "long” “long” “skirt” “skirt”.
  • the remaining unmatched rhythm blocks to be aligned are 0, that is, the number of remaining points is 0, which meets the matching conditions for ending the candidate alignment unit, and the above operation can be ended.
  • step 5 8 candidate alignment units with unit serial numbers 0-7 in sequence can be formed. In this way, the alignment and matching of the text in the speech segment to the rhythm segment to be aligned is completed.
  • S2047 Determine at least one alignment unit and obtain a corresponding gear ratio according to the unit duration of each candidate alignment unit and the matching character attribute information of the characters matched by each candidate alignment unit.
  • the number of candidate alignment units determined from the to-be-aligned rhythm segment is the same as the number of to-be-aligned rhythm blocks included in the to-be-aligned rhythm segment, and one to-be-aligned rhythm block is the corresponding rhythm point to phase
  • the interval block formed by the next rhythm point or the rhythm end point (this case is mainly for the last rhythm point), that is, the duration of a rhythm block to be aligned is the interval between two rhythm points (or rhythm end points).
  • the duration of the to-be-aligned rhythm block may be used as the unit duration of the corresponding candidate alignment unit.
  • the alignment can be directly mixed with the pronunciation of the matched text while playing the audio signal of the candidate alignment unit.
  • the unit duration of the matching candidate alignment unit is longer, or, some words have a longer pronunciation time, but the unit duration of the matching candidate alignment unit is shorter, in order to To align the text with the unit to be aligned, it is necessary to adjust the pronunciation rate of the text, such as stretching the pronunciation time of the text (reducing the pronunciation speed) or compressing (speeding up the pronunciation speed) to make it equal to the unit duration.
  • the ratio value of the text that needs to be stretched or compressed is recorded as the speed change ratio, which can be based on the unit duration of the candidate alignment unit and the matching text attribute information of the text matched with the candidate alignment unit (such as the start and end of the text of the matched text). time, the starting position of the first vowel in the text, etc.), to determine the gear ratio required when the matched text is aligned with the corresponding candidate alignment unit.
  • the extent to which the pronunciation of the text can be stretched or compressed is limited.
  • the formed audio has the risk of distortion. Therefore, in this embodiment, it is necessary to set an appropriate range for the compression or stretching of the pronunciation of the text, that is, it is necessary to ensure that the speed change ratio corresponding to the text is in a normal range.
  • the range of ratios can be regarded as suitable conditions for stretching or compression.
  • the gear ratio calculated above can be compared with the set suitable conditions to determine whether the corresponding candidate aligning unit is suitable as the aligning unit. If the candidate aligning unit is suitable as the aligning unit, the candidate aligning unit can be directly It is determined as an alignment unit, and its corresponding gear ratio is determined as the gear ratio of the alignment unit; if the candidate alignment unit is not suitable as an alignment unit, the candidate alignment unit needs to be silenced or filled with two or more candidate alignment units. By combining the processing, an alignment unit that satisfies the above-mentioned suitable conditions is obtained, and the gear ratio for which the suitable condition is determined is used as the gear ratio of the alignment unit.
  • the above determined number of candidate alignment units for the number of rhythm points to be aligned can eventually form at least one alignment unit, and each alignment unit may include at least one rhythm point and at least one matching character.
  • the ratio can be regarded as the ratio value required to stretch or stretch the text when aligning the included text with the included rhythm points.
  • S2048 Determine the unit serial number of each alignment unit, the initial rhythm point serial number in the included rhythm points, the character serial number of the matched characters, and the gear ratio as the corresponding alignment unit information.
  • the unit serial number of each alignment unit and the rhythm point serial number of each rhythm point included in the alignment unit are also obtained accordingly.
  • the alignment unit can also be obtained.
  • At least one alignment unit included in the rhythm segment to be aligned and the corresponding alignment unit information can be determined through the above S204, and the above determined alignment unit information can be arranged and summarized in the order of the unit serial numbers of the alignment units, thereby forming a Current alignment information table. Afterwards, the alignment information table of each remaining alignment period determined in the above S203 may also be determined according to the current alignment information table.
  • the above current alignment information table can be copied directly as the corresponding alignment information table; if it is an incomplete alignment period, the current alignment information table can be retrieved from the current alignment information table
  • the alignment unit information of the same row with the same number of rhythm points included in the alignment period forms a corresponding alignment information table.
  • Table 4 Alignment information table formed based on the information of the alignment unit in an alignment cycle
  • Table 4 is an alignment information table formed based on the information of the alignment unit in an alignment cycle.
  • each column in the alignment information table is equivalent to the attribute information of the alignment unit, and may include: Unit serial number, the rhythm point serial number of the starting rhythm point in the alignment unit, the character serial number of the matched text, and the gear ratio required for alignment, the number of rows in the alignment information table represents the unit of the alignment unit provided in the alignment cycle number.
  • the determining of the alignment information table of each remaining alignment period according to the current alignment information table includes: for each remaining alignment period, if the alignment period is a complete period, then using the current alignment information table as the The alignment information table of the alignment cycle; if the alignment cycle is an incomplete cycle and a row in the current alignment information table corresponds to an alignment unit, then determine the number of rhythm points of the rhythm points included in the alignment cycle; The alignment unit information of the number of lines of the rhythm points is selected from the current alignment information table in reverse order to form the alignment information table of the alignment period.
  • the above description in this embodiment provides the process of determining the alignment information table of the remaining alignment periods in the background music. For an incomplete alignment period, assuming that the incomplete alignment period includes 2 rhythm points, the current alignment can be directly obtained from the current alignment. In the information table, two rows of alignment unit information are selected from bottom to top to form a corresponding alignment information table.
  • the alignment information table formed by each alignment period includes at least one alignment unit and corresponding alignment unit information, and each alignment unit information includes the rhythm point actually used for the alignment of text and rhythm points serial number, matching text serial number, and gear ratio required for alignment, etc.
  • each alignment unit information includes the rhythm point actually used for the alignment of text and rhythm points serial number, matching text serial number, and gear ratio required for alignment, etc.
  • a method for converting speech into rap music provides a determination operation for text attribute information and music rhythm information, and also provides an alignment period for aligning the speech segment with the background music and related The operation of the alignment information table.
  • the user can determine the match between the word and the rhythm point by obtaining the obtained rhythm point position, the start and end time of a single word, and the start time of a vowel. Alignment and variable-speed alignment strategy, so that the rap music formed by aligning words and rhythm points can be obtained in a short time through the alignment strategy.
  • this optional embodiment determines the total amount of rhythm points included in the background music, the sequence number of rhythm points, and the period information of each beat period in the execution of the above S202.
  • the method further includes: acquiring a plurality of detected initial rhythm points, and determining the interval duration formed between two adjacent initial rhythm points; The word long time is combined with the interval time length to determine the to-be-deleted rhythm point among the plurality of initial rhythm points and delete the to-be-deleted rhythm point to obtain an effective rhythm point in the background music.
  • an operation for processing the rhythm points detected from the background music is given, and through this operation, the detected rhythm points (referred to as initial rhythm points in this optional embodiment) can be obtained. Remove the densely spaced rhythm points where the interval between two adjacent rhythm points is less than half of the average word length.
  • the average character length of a character is the ratio of the time occupied by all characters to the total amount of characters. Generally speaking, if the interval between two adjacent rhythm points is less than half of the average character length, it is not conducive to the difference between characters and rhythm points. Therefore, it is necessary to delete any one of the two adjacent rhythm points, so that the undeleted rhythm point and the rhythm point before or after the deleted rhythm point constitute a new interval duration, and The newly formed interval duration can be determined again in the manner of this optional embodiment, whereby invalid rhythm points are removed by cyclic updating, and valid rhythm points are retained.
  • FIG. 5 is an expanded flowchart for determining the alignment unit and the alignment unit information in the alignment period provided by the second embodiment of the present application, As shown in FIG. 5 , according to the unit duration of each candidate alignment unit and the matching text attribute information of the text matched by each candidate alignment unit, at least one alignment unit is determined and the corresponding gear ratio is obtained. Describe:
  • This optional embodiment is the execution process of the foregoing S2047. Through the above operation of S2046, a certain number of candidate alignment units can be obtained in the to-be-aligned rhythm segment. The following operations in this optional embodiment can determine the alignment units in the candidate alignment units and the gear ratio corresponding to the alignment units.
  • a candidate alignment unit in the rhythm segment to be aligned has a corresponding unit serial number, and a candidate alignment unit that has not been selected before can be selected in the order of the unit serial number as the current processing unit. Unselected can be understood as unselected. Selected as the current processing unit.
  • the first candidate processing unit is selected as the current processing unit.
  • the alignment of the text and the candidate alignment unit is mainly manifested in the alignment of the actual pronunciation duration of the text and the unit duration of the candidate alignment unit.
  • the alignment of the two can be achieved by stretching or compressing the pronunciation duration of the text. Realization, and the stretching or compression of the text pronunciation time can be determined by a gear ratio.
  • the gear ratio is equivalent to the ratio of the pronunciation time to the actual pronunciation time of the text.
  • the actual pronunciation duration starts from the starting position of the first vowel, and the actual pronunciation ending time can be regarded as the starting position of the first vowel of the next adjacent character. If the text is considered in combination with the candidate alignment unit, in a candidate alignment unit, the time occupied by the actual pronunciation of all the matched text should be from the position of the first vowel of the first matched text in the candidate alignment unit, to Ends at the first vowel position of the first matching text in the next candidate alignment unit adjacent to it.
  • the actual pronunciation duration of all characters matched by the current processing unit can be determined by the start and end times of the characters matched by the current processing unit and the adjacent next candidate alignment unit respectively and the start position of the first vowel, and thus The current gear ratio of the current processing unit is obtained according to the known unit duration and the determined actual sounding duration.
  • the unit duration of the current processing unit in combination with the start and end times of the characters and the start position of the first vowel respectively matched by the current processing unit and the adjacent next candidate alignment unit, determine the The current gear ratio of the current processing unit, including:
  • the matching character attribute information of all characters matched in the current processing unit can be obtained.
  • the matching character attribute information can be the starting and ending time of each character and the starting position of the first cause of the character. Based on these information, it can be determined that The pronunciation of all the matched characters in the current processing unit occupies a long time.
  • the starting and ending times of the character are t1 and t2 respectively, and the starting position of the first vowel is t3, where t1 ⁇ t3 ⁇ t2, then the character is currently being processed.
  • the pronunciation occupying time in the unit is actually t2-t3.
  • the start and end times of the first character are t1 and t2 respectively
  • the start position of the first vowel is t3
  • the start and end times of the second character are t2 and t4 respectively
  • the first The starting position of the sound is t5, where t1 ⁇ t3 ⁇ t2 ⁇ t5 ⁇ t4, then the two characters in the current processing unit have a common pronunciation occupying time: t4-t3 or (t2-t3)+( t4-t5).
  • the pronunciation occupied duration of all characters matched in the current processing unit is only the difference between the sum of the start and end durations of all characters and the interval duration from the start of the first character to the first vowel.
  • the pronunciation occupying time here is not the actual pronunciation duration of all characters matched in the current processing unit.
  • the actual pronunciation duration also includes the vowel interval duration of the first character matched by the next candidate alignment unit adjacent to the current processing unit.
  • the vowel interval duration can be obtained through the following S22.
  • the purpose of determining the pronunciation occupied duration in the above-mentioned manner is to enable the position of the first vowel of each character in the speech segment to be aligned with the rhythm point of the matched candidate alignment processing unit, thereby ensuring that alignment is adopted in this manner
  • the playback effect of the following text and rhythm points is better than that of directly aligning the prefix with the rhythm points.
  • the start and end times of the first character in the characters matched by the next candidate alignment unit adjacent to the current processing unit are t4 and t6, and the start position of the first vowel is t7, where , t1 ⁇ t3 ⁇ t2 ⁇ t5 ⁇ t4 ⁇ t7 ⁇ t6, then the vowel interval of the first character is t7-t4.
  • the actual pronunciation duration of all characters matched in the current processing unit is the unit of the first character matched by the pronunciation occupying duration of all characters matched in the current processing unit and the adjacent next candidate alignment unit sound interval.
  • the actual pronunciation duration of all characters matched in the current processing unit is (t4-t3)+(t7-t4).
  • the current gear ratio of the current processing unit can be expressed as: t/[(t4-t3)+(t7-t4)].
  • the current gear ratio of the current processing unit After the current gear ratio of the current processing unit is determined through S2, the current gear ratio can be compared with the set first gear ratio value and the second gear ratio value, so as to determine that the matching text is to be stretched by the current gear ratio Or whether the compression meets the normal tension/compression conditions.
  • This embodiment assumes that the speed ratio between the first speed ratio and the second speed ratio satisfies the stretching/compression condition, the speed ratio smaller than the first speed ratio does not meet the compression condition, and the speed ratio greater than the second speed ratio does not meet the stretching condition.
  • the current gear ratio of the current processing unit satisfies the conventional stretching/compression conditions, and the current processing unit can be directly viewed at this time.
  • the current speed ratio is greater than the second speed ratio value, it is considered that the current speed ratio of the current processing unit does not meet the normal stretching conditions, which is equivalent to that the unit duration of the current processing unit is longer than the length of all matching characters.
  • a mute duration needs to be added to the current processing unit to increase the actual pronunciation duration of the text.
  • the added mute duration is the start and end duration of a character, and thus, according to the combination of the mute duration and the determined actual pronunciation duration, a unit duration is re-determined, and the mute duration is the same as the actual pronunciation duration.
  • the sum of the durations is the current gear ratio in the denominator. After that, it returns to S3 to perform the comparison operation of the gear ratio.
  • the current speed ratio is smaller than the first speed ratio value, it is considered that the current speed ratio of the current processing unit does not meet the conventional compression conditions, which is equivalent to that the unit duration of the current processing unit is shorter than the actual length of all matching characters.
  • the pronunciation duration needs to be combined with a candidate alignment unit on the basis of the current processing unit to form a new current processing unit, so as to increase the unit duration of the current processing unit.
  • the candidate alignment unit to be merged is the next candidate alignment unit adjacent to the current processing unit.
  • the unit duration of the newly formed current processing unit is the duration of the original unit and the next candidate alignment unit. After the sum of the unit durations, it is possible to return to S2 to recalculate the actual pronunciation durations of all characters matched in the newly formed current processing unit.
  • next candidate alignment unit the operation of selecting the next candidate alignment unit to be incorporated into the current processing unit is performed, and it is also considered that the selected next candidate alignment unit has been selected.
  • the next candidate alignment unit can be skipped.
  • the next candidate alignment unit is no longer individually selected as the current processing unit.
  • S7 determine whether all candidate alignment units are selected to participate in processing, if all candidate alignment units are selected to participate in processing, then execute S8; If there are candidate alignment units that are not selected to participate in processing in all candidate alignment units, return to execute S1;
  • an alignment unit After an alignment unit is determined through the above steps, there may be unselected candidate alignment units in the to-be-aligned rhythm segment, which can be determined by S7. If all the candidate alignment units are selected to participate in the above processing, the execution can be executed. S8, if there is a candidate aligning unit that is not selected to participate in the processing among all the candidate aligning units, it is necessary to return to S1 to re-select an unselected candidate aligning unit to perform the above operations in a loop.
  • Each of the above-determined alignment units and corresponding gear ratios may be aggregated to obtain at least one alignment unit and a gear ratio included in the rhythm section to be aligned.
  • This optional embodiment provides an implementation process for determining the alignment unit in the rhythm segment to be aligned and the corresponding speed change ratio. Through the execution of this optional embodiment, it is possible to ensure that the rhythm points in the rhythm segment to be aligned and the text in the speech segment The effective alignment of the rhythm points avoids the occurrence of misalignment between the speech text and the music rhythm point, thereby providing an effective theoretical support for the conversion of speech to rap music in this embodiment.
  • FIG. 6 is a structural block diagram of an apparatus for converting speech into rap music according to Embodiment 3 of the present application.
  • the apparatus is suitable for converting the voice recorded by a user to rap music.
  • the apparatus can be implemented by software or hardware. , and can generally be integrated on computer equipment.
  • the apparatus includes: an information determination module 31 , an alignment information determination module 32 and a conversion control module 33 .
  • the information determination module 31 is set to recognize the speech segment and process the background music, and obtain the text attribute information of the text in the speech segment and the music rhythm information of the background music; the alignment information determination module 32 is set to be based on the text attribute information. and the music rhythm information, determine at least one alignment period for aligning the speech segment with the background music, and obtain an alignment information table for each alignment period; the conversion control module 33 is set to be in accordance with the at least one The alignment information table of the alignment period controls the alignment of the text in the speech segment with the rhythm points in the background music to obtain aligned audio, and the rap audio is formed after performing pitch adjustment and special effects processing on the aligned audio .
  • the device for converting voice into rap music provided by the third embodiment of the present application effectively realizes the conversion of voice content clips randomly recorded by users into voice clips matched with background music, simplifies the tedious process of manually performing audio editing, and provides Non-professional audio processing personnel provide the possibility of rap music production; at the same time, the above technical solution does not need to limit the voice content to be converted, which ensures the free recording of the voice content to be converted, and also simplifies the implementation process of voice conversion, avoiding the need for voice and text to be converted.
  • the situation that the music rhythm points are misplaced has improved the application scope of voice-converted rap music.
  • FIG. 7 is a schematic diagram of the hardware structure of a computer device according to Embodiment 4 of the present application, where the computer device includes: a processor and a storage device. At least one instruction is stored in the storage device, and the instruction is executed by the processor, so that the computer device executes the method for converting speech into rap music according to the above method embodiments.
  • the computer equipment may include: a processor 40 , a storage device 41 , a display screen 42 , an input device 43 , an output device 44 and a communication device 45 .
  • the number of processors 40 in the computer device may be one or more, and one processor 40 is taken as an example in FIG. 6 .
  • the number of storage devices 41 in the computer device may be one or more, and one storage device 41 is taken as an example in FIG. 7 .
  • the processor 40 , storage device 41 , display screen 42 , input device 43 , output device 44 and communication device 45 of the computer equipment may be connected by a bus or in other ways. In FIG. 7 , the connection by a bus is taken as an example.
  • the processor 40 executes one or more programs stored in the storage device 41, the following operations are implemented: recognizing the speech segment and processing the background music, and obtaining the text attribute information of the text in the speech segment and the information of the background music.
  • Music rhythm information according to the text attribute information and the music rhythm information, determine at least one alignment period for aligning the speech segment with the background music, and obtain an alignment information table for each alignment period;
  • the alignment information table of the at least one alignment period controls the text in the speech segment to be aligned with the rhythm points in the background music, to obtain aligned audio, and after the aligned audio is adjusted and processed with special effects Form rap audio.
  • Embodiments of the present application further provide a computer-readable storage medium, where a program in the storage medium, when executed by a processor of a computer device, enables the computer device to execute the method for converting speech into rap music as described in the foregoing embodiments .
  • the method for converting speech into rap music described in the above embodiments includes: recognizing speech segments and processing background music, and obtaining text attribute information of words in the speech segment and music rhythm information of the background music; the text attribute information and the music rhythm information, determine at least one alignment period for aligning the speech segment with the background music, and obtain an alignment information table for each alignment period; according to the at least one alignment period
  • the alignment information table controls the text in the speech segment to align with the rhythm points in the background music to obtain the aligned audio, and the rap audio is formed after the alignment of the audio is adjusted and processed with special effects.

Abstract

A method and apparatus for converting voice into rap music, a computer device, and a computer readable storage medium. The method comprises: identifying a voice segment and processing background music to obtain the character attribute information of characters in the voice segment and the music rhythm information of the background music (S101); according to the character attribute information and the music rhythm information, determining at least one alignment period used for aligning the voice segment with the background music, and obtaining the alignment information table of each alignment period (S102); and according to the alignment information table of the at least one alignment period, controlling the characters in the voice segment to be aligned with rhythm points in the background music so as to obtain aligned audio, and forming rap audio after performing tone change adjustment and special effect processing on the aligned audio (S103).

Description

将语音转换为说唱音乐的方法、装置、设备及存储介质Method, device, device and storage medium for converting speech into rap music
本申请要求在2020年07月16日提交中国专利局、申请号为202010688502.3的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with application number 202010688502.3 filed with the China Patent Office on July 16, 2020, the entire contents of which are incorporated into this application by reference.
技术领域technical field
本申请涉及音乐制作技术领域,例如涉及将语音转换为说唱音乐的方法、装置、设备及存储介质。The present application relates to the technical field of music production, for example, to a method, apparatus, device and storage medium for converting speech into rap music.
背景技术Background technique
随着K歌软件的普及,修音算法以及人声转音乐算法的研究逐渐受到广泛的关注,人们对于自动修音以及说话变唱歌的兴趣也越来越高涨。说唱文化逐渐进入大众的视野,说唱音乐的特点是创作者在背景音乐下快速有节奏地说出一连串押韵的文字,在说唱音乐制作过程中往往要经过复杂的过程,对于大多数非音频处理人员来说学习使用专业的音频处理软件,以及对音频处理软件进行复杂的手动操作均会耗费较长的时间。With the popularization of K-song software, the research of voice-modification algorithm and vocal-to-music algorithm has gradually attracted widespread attention, and people's interest in automatic voice-modification and speech-to-singing is also increasing. Rap culture has gradually entered the public's field of vision. The characteristic of rap music is that the creator quickly and rhythmically speaks a series of rhythmic words under the background music. The production process of rap music often has to go through a complicated process. For most non-audio processing personnel It will take a long time to learn to use professional audio processing software and perform complex manual operations on audio processing software.
针对上述问题,出现了一些适合非音频处理人员操作的语音转换软件,然而,不同语音转换软件在实现语音转换说唱的过程中,存在不同的缺陷,如,一种语音转换软件中的语音转说唱的技术方案,其限定了需要朗读特定歌词,由于歌词与背景音乐完全匹配,因此字与节奏点的对齐位置是固定的,该方案对于未知歌词内容、长度的情况,不能很好地处理,由此缩小了用户应用该软件时的创作空间,进而限制了该软件的应用前景。又如,另一种语音转换软件中的语音转说唱的技术方案,其在音频分割以及音频对齐上的算法设计都较为复杂,增加了转换难度,同时存在语音文字与音乐节奏点错位的问题,该种转换方式并不利于对用户自行上传的音乐进行有效地处理。In response to the above problems, some voice conversion software suitable for non-audio processing personnel have appeared. However, different voice conversion software has different defects in the process of realizing voice conversion and rap. The technical solution of the rhythm, which limits the need to read specific lyrics, because the lyrics completely match the background music, so the alignment position of the word and the rhythm point is fixed, this solution can not handle the situation of unknown lyrics content and length, by This reduces the creative space when the user applies the software, thereby limiting the application prospect of the software. Another example is the technical solution of voice-to-rap in another voice conversion software. The algorithm design of audio segmentation and audio alignment is relatively complicated, which increases the difficulty of conversion, and also has the problem of misalignment of voice text and music rhythm. This conversion method is not conducive to effectively processing the music uploaded by the user.
发明内容SUMMARY OF THE INVENTION
本申请提供了将语音转换为说唱音乐的方法、装置、设备及存储介质,以解决语音转换为说唱音乐的过程中语音内容受限以及语音转换效果差的问题。The present application provides a method, device, device and storage medium for converting speech into rap music, so as to solve the problems of limited speech content and poor speech conversion effect during the process of speech conversion into rap music.
提供了一种将语音转换为说唱音乐的方法,包括:Provides a method of converting speech to rap music, including:
识别语音段以及处理背景音乐,获得所述语音段中的文字的文字属性信息以及所述背景音乐的音乐节奏信息;Identifying speech segments and processing background music, and obtaining text attribute information of the text in the speech segment and music rhythm information of the background music;
根据所述文字属性信息以及所述音乐节奏信息,确定用于将所述语音段与 所述背景音乐对齐的至少一个对齐周期,并获得每个对齐周期的对齐信息表;According to the text attribute information and the music rhythm information, determine at least one alignment period for aligning the speech segment with the background music, and obtain an alignment information table for each alignment period;
按照所述至少一个对齐周期的对齐信息表控制所述语音段中的文字与所述背景音乐中的节奏点对齐,得到对齐后的音频,并在对所述对齐后的音频进行变调调整以及特效处理后形成说唱音频。Control the text in the speech segment to align with the rhythm points in the background music according to the alignment information table of the at least one alignment period to obtain aligned audio, and perform pitch adjustment and special effects on the aligned audio Rap audio is formed after processing.
还提供一种将语音转换为说唱音乐的装置,包括:Also provided is a device for converting speech into rap music, comprising:
信息确定模块,设置为识别语音段以及处理背景音乐,获得所述语音段中的文字的文字属性信息以及所述背景音乐的音乐节奏信息;an information determination module, configured to recognize a speech segment and process background music, and obtain text attribute information of the text in the speech segment and music rhythm information of the background music;
对齐信息确定模块,设置为根据所述文字属性信息以及所述音乐节奏信息,确定用于将所述语音段与所述背景音乐对齐的至少一个对齐周期,并获得每个对齐周期的对齐信息表;an alignment information determination module, configured to determine at least one alignment period for aligning the speech segment with the background music according to the text attribute information and the music rhythm information, and obtain an alignment information table for each alignment period ;
转换控制模块,设置为按照所述至少一个对齐周期的对齐信息表控制所述语音段中的文字与所述背景音乐中的节奏点对齐,得到对齐后的音频,并在对所述对齐后的音频进行变调调整以及特效处理后形成说唱音频。The conversion control module is configured to control the text in the speech segment to align with the rhythm points in the background music according to the alignment information table of the at least one alignment period, to obtain the aligned audio, and to align the aligned audio. The audio is transposed and processed with special effects to form rap audio.
还提供了一种计算机设备,包括:Also provided is a computer device comprising:
一个或多个处理器;one or more processors;
存储装置,设置为存储一个或多个程序;storage means arranged to store one or more programs;
所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现上述的将语音转换为说唱音乐的方法。The one or more programs are executed by the one or more processors, so that the one or more processors implement the above-described method of converting speech to rap music.
还提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现上述的将语音转换为说唱音乐的方法。Also provided is a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the above-mentioned method for converting speech into rap music.
附图说明Description of drawings
图1为本申请实施例一提供的一种将语音转换为说唱音乐的方法的流程示意图;1 is a schematic flowchart of a method for converting speech into rap music provided by Embodiment 1 of the present application;
图2为本申请实施例二提供的一种将语音转换为说唱音乐的方法的流程示意图;2 is a schematic flowchart of a method for converting speech into rap music provided by Embodiment 2 of the present application;
图3为本申请实施例二提供的一种将语音转换为说唱音乐方法中确定对齐周期的实现流程图;3 is a flow chart of the realization of determining the alignment period in a method for converting speech into rap music provided by Embodiment 2 of the present application;
图4为本申请实施例二提供的一种将语音转换为说唱音乐方法中确定对齐周期中的对齐单元以及对齐单元信息的实现流程图;Fig. 4 is the realization flow chart of determining the alignment unit in the alignment period and alignment unit information in a kind of method for converting speech into rap music provided by Embodiment 2 of this application;
图5为本申请实施例二提供的一种确定对齐周期中对齐单元以及对齐单元 信息的展开流程图;Fig. 5 provides the unfolding flow chart of the alignment unit and alignment unit information in a kind of determination alignment cycle that the second embodiment of the application provides;
图6为本申请实施例三提供的一种将语音转换为说唱音乐的装置的结构框图;6 is a structural block diagram of a device for converting speech into rap music provided by Embodiment 3 of the present application;
图7为本申请实施例四提供的一种计算机设备的硬件结构示意图。FIG. 7 is a schematic diagram of a hardware structure of a computer device according to Embodiment 4 of the present application.
具体实施方式detailed description
下面将结合附图对本申请实施例方式进行描述。The embodiments of the present application will be described below with reference to the accompanying drawings.
在本申请的描述中,术语“第一”、“第二”、“第三”等仅用于区别类似的对象,而不必用于描述特定的顺序或先后次序,也不能理解为指示或暗示相对重要性。可以根据情况理解上述术语在本申请中的含义。In the description of the present application, the terms "first", "second", "third", etc. are only used to distinguish similar objects, and are not necessarily used to describe a specific sequence or sequence, nor should they be construed as indicating or implying relative importance. The meanings of the above terms in the present application can be understood according to the situation.
实施例一Example 1
图1为本申请实施例一提供的一种将语音转换为说唱音乐的方法的流程示意图,该方法适用于将用户录制的语音段转换为说唱音乐的情况,该方法可以由将语音转换为说唱音乐的装置执行,其中,该装置可以由软件和/或硬件实现,并一般可集成在计算机设备上。1 is a schematic flowchart of a method for converting speech into rap music provided in Embodiment 1 of the present application. The method is suitable for converting a voice segment recorded by a user into rap music. The method can be performed by converting speech into rap music. Means implementation of music, wherein the means may be implemented in software and/or hardware, and may generally be integrated on computer equipment.
在该应用模式下,可以先向用户提供一个背景音乐的选择界面,由此获得用户所选定的背景音乐;之后,还可以向用户提供一个语音内容的选择界面,由此获得用户通过触发录制按钮实时录制的语音段,或者,获得用户通过触发上传按钮而上传的预先录制语音段;然后就可以执行本实施例提供的将语音转换为说唱音乐的方法,实现将上述所获得语音段转换为配合背景音乐的说唱片段。In this application mode, a selection interface for background music can be provided to the user first, thereby obtaining the background music selected by the user; after that, a selection interface for voice content can also be provided to the user, thereby obtaining the user's recording by triggering the recording. The voice segment recorded in real time by the button, or the pre-recorded voice segment uploaded by the user by triggering the upload button is obtained; then the method for converting voice into rap music provided by this embodiment can be implemented, so as to convert the obtained voice segment into Said segment with background music.
如图1所示,本申请实施例一提供的一种将语音转换为说唱音乐的方法,包括如下操作:As shown in FIG. 1 , a method for converting speech into rap music provided in Embodiment 1 of the present application includes the following operations:
S101、识别语音段以及处理背景音乐,获得所述语音段内文字的文字属性信息以及所述背景音乐的音乐节奏信息。S101. Recognize a speech segment and process background music, and obtain text attribute information of the text in the speech segment and music rhythm information of the background music.
在本实施例中,语音段可理解为在执行S101之前获得到的用户实时录制或预先录制的语音片段,背景音乐可理解为在执行S101之前接收到的用户从背景音乐集中选定的待使用音乐。In this embodiment, the speech segment can be understood as the real-time recording or pre-recorded speech segment obtained by the user before executing S101, and the background music can be understood as the to-be-used selected by the user from the background music set received before executing S101 music.
本实施例中,可以对语音段进行语音识别,由此可以获取到语音段所包括的文字的文字序号、文字发音时长(文字起止时间)以及文字中的首个元音的起始位置等相关的文字属性信息;也可以对背景音乐进行音乐节拍的检测处理,由此可以获取到背景音乐所包括的节奏点的节奏点序号、节奏点所在位置以及 划分所形成的每个节拍周期内所包括的节奏点数等相关的音乐节奏信息。In this embodiment, speech recognition can be performed on the speech segment, so that the text serial number of the text included in the speech segment, the pronunciation duration of the text (the start and end time of the text), and the starting position of the first vowel in the text can be obtained. It is also possible to detect and process the music beat of the background music, so as to obtain the rhythm point serial number of the rhythm points included in the background music, the location of the rhythm point, and the rhythm points included in each beat cycle formed by the division. rhythm points and other related music rhythm information.
本实施例不对语音识别、文字检测以及节奏点检测的方式进行限定,只要可以获取到所需的文字属性信息以及音乐节奏信息即可。This embodiment does not limit the methods of speech recognition, text detection, and rhythm point detection, as long as required text attribute information and music rhythm information can be obtained.
S102、根据所述文字属性信息以及所述音乐节奏信息,确定用于将所述语音段与所述背景音乐对齐的至少一个对齐周期,并获得每个对齐周期的对齐信息表。S102. Determine at least one alignment period for aligning the speech segment with the background music according to the text attribute information and the music rhythm information, and obtain an alignment information table for each alignment period.
对于将用户语音转换为音乐片段的过程,除了S101中的语音识别以及节拍检测,最重要的环节在于将语音段中对应的文字与背景音乐中的节奏点进行对齐。所谓语音段和背景音乐的对齐,可认为是将语音分割为单个文字后,将每个文字与强节奏、规律性的重音点对齐,其中可能还伴有一些首尾字或中间字的重复用来加强节奏感。由此,本实施例在实现由语音到说唱音乐的转换时,需要先通过S102来确定用于将语音段和背景音乐对齐的对齐周期以及相应的对齐信息表。For the process of converting the user's speech into a music segment, in addition to the speech recognition and beat detection in S101, the most important link is to align the corresponding text in the speech segment with the rhythm points in the background music. The so-called alignment of speech segments and background music can be considered as dividing the speech into individual words, aligning each word with strong rhythm and regular accent points, which may be accompanied by repetition of some first and last words or middle words. Strengthen the sense of rhythm. Therefore, when implementing the conversion from speech to rap music in this embodiment, it is necessary to first determine an alignment period and a corresponding alignment information table for aligning the speech segment and the background music through S102.
所述对齐周期可理解为一个包括了能够与语音段内所有文字对齐的节奏点的最小重复单元,即从一个时间t开始,该背景音乐的节奏以一个能够对齐语音段内所有文字的固定周期T进行重复。所述对齐信息表可理解为包含了在一个对齐周期内进行节奏点与文字对齐时,所需的节奏点与待对齐文字的对应关系信息(如节奏点序号、文字序号)以及变速比等信息的信息声明表。The alignment period can be understood as a minimum repeating unit including rhythm points that can be aligned with all the characters in the speech segment, that is, starting from a time t, the rhythm of the background music is a fixed period that can align all the characters in the speech segment. T to repeat. The alignment information table can be understood as including information on the correspondence between the required rhythm points and the characters to be aligned (such as the rhythm point serial number, text serial number) and the gear ratio when aligning the rhythm points with the characters in one alignment cycle. information declaration form.
S102的实现可表述为:The realization of S102 can be expressed as:
首先可以从文字属性信息中确定语音段所包括的文字总量,以及可以从音乐节奏信息中确定背景音乐所包括的节奏点的节奏点总量,以及对这些节奏点进行划分所形成的节拍周期的周期信息。其中,所述节拍周期可理解为根据节奏点找到的一个最小的节奏重复单元,即,从一个时间开始,该背景音乐的节奏以一个固定周期Z进行重复。First, the total amount of text included in the speech segment can be determined from the text attribute information, and the total amount of rhythm points included in the background music can be determined from the music rhythm information, and the beat cycle formed by dividing these rhythm points. cycle information. The beat period can be understood as a minimum rhythm repetition unit found according to the rhythm points, that is, starting from a time, the rhythm of the background music is repeated in a fixed period Z.
之后,根据文字总量以及一个节拍周期中所包括的节奏点的个数,可以确定节拍周期能否满足作为一个对齐周期的条件,如果节拍周期能满足作为一个对齐周期的条件,则直接将每个节拍周期看作对齐周期,如果节拍周期能不满足作为一个对齐周期的条件,则需要更新节拍周期的周期长度,获得可作为对齐周期的节拍周期。After that, according to the total amount of text and the number of rhythm points included in a beat period, it can be determined whether the beat period can satisfy the condition of being an alignment period. Each tick period is regarded as an alignment period. If the takt period does not satisfy the condition of being an alignment period, the period length of the takt period needs to be updated to obtain a takt period that can be used as an alignment period.
然后,因为至少一个对齐周期的节奏是重复的,可以随机选取一个对齐周期,结合文字属性信息中的文字起止时间、文字的首个元音的起始位置,以及从音乐节奏信息中提取的一个对齐周期内节奏点的节奏点信息,来确定语音段中每个文字在该对齐周期内待对齐的节奏点,以及对该待对齐的节奏点进行对 齐时所要具备的变速比,由此形成包括节奏点序号与所关联文字的文字序号以及相应变速比的信息表,来作为该对齐周期的对齐信息表。Then, because the rhythm of at least one alignment period is repeated, an alignment period can be randomly selected, combining the start and end times of the characters in the character attribute information, the start position of the first vowel of the characters, and a sequence extracted from the music rhythm information. The rhythm point information of the rhythm points in the alignment period is used to determine the rhythm point to be aligned for each character in the speech segment within the alignment period, and the gear ratio to be possessed when aligning the to-be-aligned rhythm point. The rhythm point serial number, the text serial number of the associated text, and the information table of the corresponding gear ratio are used as the alignment information table of the alignment period.
最终,可以将该对齐信息表看作每个完整的对齐周期的对齐信息表,而对于非完整的对齐周期,则可以从对齐信息表中提取部分对齐信息构成相对应的对齐信息表,由此通过S102获得了至少一个对齐周期,以及与每个对齐周期对应的对齐信息表。Finally, the alignment information table can be regarded as the alignment information table of each complete alignment period, and for a non-complete alignment period, part of the alignment information can be extracted from the alignment information table to form the corresponding alignment information table, thus At least one alignment period and an alignment information table corresponding to each alignment period are obtained through S102.
S103、按照所述至少一个对齐周期的对齐信息表控制所述语音段中的文字与所述背景音乐中的节奏点对齐,得到对齐后的音频,并在对所述对齐后的音频进行变调调整以及特效处理后形成说唱音频。S103. Control the text in the speech segment to align with the rhythm points in the background music according to the alignment information table of the at least one alignment period, obtain aligned audio, and perform pitch adjustment on the aligned audio As well as special effects processing to form rap audio.
在本实施例中,可以直接通过上述对背景音乐的节奏点划分形成的对齐周期以及包括语音段文字与节奏点对齐关系的对齐信息表,确定相匹配的文字和节奏点,并控制语音段中的文字与背景音乐中的节奏点对齐以及基于相应的变速比对相对齐后的音频进行变速,在此之后,还可以根据背景音乐的音高对变速后的音频进行变调调整以及加入混响等特效,形成转换后的说唱音频。In this embodiment, the matching text and rhythm points can be determined directly through the alignment period formed by the above-mentioned division of the rhythm points of the background music and the alignment information table including the alignment relationship between the text of the speech segment and the rhythm points, and control the rhythm points in the speech segment. The text is aligned with the rhythm points in the background music, and the aligned audio is shifted based on the corresponding gear ratio. After that, you can also adjust the pitch of the shifted audio and add reverb, etc. according to the pitch of the background music. special effects to form converted rap audio.
本申请实施例一提供的一种将语音转换为说唱音乐的方法,首先可以识别语音段以及对背景音乐进行处理,获得语音段内文字的文字属性信息以及背景音乐的音乐节奏信息;之后可以根据文字属性信息以及音乐节奏信息来确定用于语音段与背景音乐匹配的至少一个对齐周期,并获得每个对齐周期的对齐信息表;最终按照至少一个对齐周期的对齐信息表控制语音段中文字与背景音乐中的节奏点对齐,得到对齐后的音频,并在对对齐后的音频进行变调调整以及特效处理后形成说唱音频。上述技术方案,有效实现了将用户随意录制的语音内容片段转化为配合背景音乐的说唱片段,简化了手动进行音频剪辑制作的繁琐过程,为非专业音频处理人员提供了说唱音乐制作的可能;同时,上述技术方案无需限制待转换语音内容,保证了待转换语音内容的自由化录制,还简化了语音转换的实现过程,避免了语音文字与音乐节奏点错位的情况,提升了语音转换说唱音乐的应用范围。A method for converting speech into rap music provided in the first embodiment of the present application can first identify the speech segment and process the background music to obtain the text attribute information of the text in the speech segment and the music rhythm information of the background music; The text attribute information and the music rhythm information are used to determine at least one alignment period for matching the speech segment with the background music, and the alignment information table of each alignment period is obtained; finally, according to the alignment information table of at least one alignment period, the text in the speech segment and the The rhythm points in the background music are aligned to obtain the aligned audio, and the rap audio is formed after adjusting the pitch of the aligned audio and processing the special effects. The above technical solution effectively realizes the conversion of voice content clips randomly recorded by the user into narration clips matched with background music, simplifies the tedious process of manual audio editing and production, and provides non-professional audio processing personnel with the possibility of rap music production; The above technical solution does not need to limit the voice content to be converted, guarantees the free recording of the voice content to be converted, simplifies the realization process of voice conversion, avoids the misplacement of voice text and music rhythm points, and improves the voice conversion rap music. Scope of application.
作为本申请实施例一的一个可选实施例,在根据所述文字属性信息以及所述音乐节奏信息,确定用于将所述语音段与所述背景音乐对齐的至少一个对齐周期,并获得每个对齐周期的对齐信息表之前,还包括:如果所述文字属性信息中的文字总量大于所述音乐节奏信息中的节奏点总量,则结束将所述语音段转换为说唱音乐的处理,并输出重新获得语音段或背景音乐的提示。As an optional embodiment of Embodiment 1 of the present application, according to the text attribute information and the music rhythm information, determine at least one alignment period for aligning the speech segment with the background music, and obtain each Before the alignment information table for each alignment period, the method further includes: if the total amount of text in the text attribute information is greater than the total amount of rhythm points in the music rhythm information, ending the process of converting the speech segment into rap music, And output a prompt to regain the speech segment or background music.
本实施例所提供的将语音转换为说唱音乐的方法的实现中,上述S102以及S103的执行条件可默认为:通过S101所获得的文字属性信息中的文字总量小于或等于音乐节奏信息中的节奏点总量,即,所获得语音段中的总字数小于或等 于背景音乐中的节奏点数。当上述条件不满足时,可认为不具备继续进行语音转换为说唱音乐的条件,此时,可以执行本可选实施例的操作,即,在确定文字总量大于节奏点总量时,可以结束后续将语音段转换为说唱音乐的步骤的执行,同时输出一个重新录制语音段的提示,以告知用户进行语音段的重新录制。或者,还存在其他可选的操作,如,本可选实施例也可以输出一个重新选择背景音乐的提示,告知用户重新选择背景音乐。In the implementation of the method for converting speech into rap music provided by this embodiment, the execution conditions of the above S102 and S103 may be by default: the total amount of characters in the text attribute information obtained through S101 is less than or equal to the music rhythm information. The total amount of rhythm points, that is, the total number of words in the obtained speech segment is less than or equal to the number of rhythm points in the background music. When the above conditions are not satisfied, it may be considered that there is no condition for continuing to convert speech into rap music. At this time, the operation of this optional embodiment may be performed, that is, when it is determined that the total amount of text is greater than the total amount of rhythm points, the end of Following the execution of the step of converting the speech segment into rap music, a prompt for re-recording the speech segment is output to inform the user to re-record the speech segment. Alternatively, there are other optional operations. For example, in this optional embodiment, a prompt for re-selection of background music may be output to inform the user to re-selection of background music.
本可选实施例的操作,保证了待转换语音段与背景音乐的有效匹配,从而提升语音转换为说唱音乐的用户体验。The operation of this optional embodiment ensures an effective match between the speech segment to be converted and the background music, thereby improving the user experience of converting speech into rap music.
实施例二Embodiment 2
图2为本申请实施例二提供的一种将语音转换为说唱音乐的方法的流程示意图,本实施例二以实施例一为基础进行说明,在本实施例中,识别语音段以及处理背景音乐,获得所述语音段内文字的文字属性信息以及所述背景音乐的音乐节奏信息,包括:对用户选定的语音段进行降噪处理以及端点检测处理,通过对处理后的语音段的语音识别获得所述语音段内每个文字的文字序号、起止时间、首个元音的起始位置以及文字总量,构成所述语音段的文字属性信息;对用户选定的背景音乐进行节奏点检测和节拍周期划分,确定所述背景音乐中包含的节奏点总量、节奏点序号、以及每个节拍周期的周期信息,构成所述背景音乐的音乐节奏信息;其中,所述周期信息包括:周期号、节拍周期内所包括的节奏点的节奏点个数以及每个节奏点的节奏点序号和节奏点起始时间。FIG. 2 is a schematic flowchart of a method for converting speech into rap music provided by Embodiment 2 of the present application. Embodiment 2 is described based on Embodiment 1. In this embodiment, recognizing speech segments and processing background music , obtain the text attribute information of the text in the speech segment and the music rhythm information of the background music, including: performing noise reduction processing and endpoint detection processing on the speech segment selected by the user, Obtain the character serial number, start and end time, starting position of the first vowel and the total amount of characters of each character in the speech segment to form the character attribute information of the speech segment; perform rhythm point detection on the background music selected by the user and the beat cycle division, determine the total amount of rhythm points, the rhythm point serial number, and the cycle information of each beat cycle contained in the background music, and constitute the music rhythm information of the background music; wherein, the cycle information includes: cycle number, the number of rhythm points included in the beat cycle, and the rhythm point number and start time of each rhythm point.
本实施例中,根据所述文字属性信息以及所述音乐节奏信息,确定用于将所述语音段与所述背景音乐对齐的至少一个对齐周期,并获得每个对齐周期的对齐信息表,包括:根据所述文字属性信息中的文字总量以及所述音乐节奏信息中每个节拍周期的周期信息,确定用于将所述语音段与所述背景音乐对齐的至少一个对齐周期;选取一个完整的对齐周期作为待对齐节奏段,根据所述文字属性信息以及待对齐节奏段内待对齐节奏点的节奏点信息,确定至少一个对齐单元以及相应的对齐单元信息;汇总所述至少一个对齐单元信息形成所述待对齐节奏段的当前对齐信息表,并根据所述当前对齐信息表确定余下每个对齐周期的对齐信息表。In this embodiment, according to the text attribute information and the music rhythm information, at least one alignment period for aligning the speech segment with the background music is determined, and an alignment information table for each alignment period is obtained, including : According to the total amount of text in the text attribute information and the cycle information of each beat cycle in the music rhythm information, determine at least one alignment cycle for aligning the speech segment with the background music; select a complete As the rhythm segment to be aligned, according to the text attribute information and the rhythm point information of the rhythm point to be aligned in the rhythm segment to be aligned, determine at least one alignment unit and corresponding alignment unit information; summarize the at least one alignment unit information A current alignment information table of the to-be-aligned rhythm segment is formed, and an alignment information table of each remaining alignment period is determined according to the current alignment information table.
如图2所示,本实施例二提供的一种将语音转换为说唱音乐的方法,包括如下操作:As shown in FIG. 2 , a method for converting speech into rap music provided by the second embodiment includes the following operations:
S201、对用户选定的语音段进行降噪处理以及端点检测处理,通过对处理后的语音段的语音识别获得所述语音段内每个文字的文字序号、起止时间、首 个元音的起始位置以及文字总量,构成所述语音段的文字属性信息。S201. Perform noise reduction processing and endpoint detection processing on the speech segment selected by the user, and obtain the character serial number, start and end time, and start of the first vowel of each character in the speech segment through speech recognition of the processed speech segment. The starting position and the total amount of characters constitute the character attribute information of the speech segment.
在本实施例中,可以采用音频处理中的噪声处理策略对所录制的语音段进行降噪处理,以及采用端点检测策略对降噪后的语音段进行静音段的去除,之后可以采用语音识别策略对经过处理的语音段进行识别,从而获得构成语音段的每个文字的相关信息。In this embodiment, the noise processing strategy in audio processing can be used to perform noise reduction processing on the recorded speech segment, and the endpoint detection strategy can be used to remove the mute segment from the noise-reduced speech segment, and then the speech recognition strategy can be used The processed speech segment is recognized to obtain relevant information of each character that constitutes the speech segment.
上述所获得的信息中可以包括整个语音段所包括文字的文字总量、每个文字的文字序号、该文字在语音段中对应的起止时间,该文字所对应发音的首个元音的起始位置,其中,文字的起止时间以及首个元音的起始位置均可看作相对时间点,即,可以按照整个语音的播放顺序,将首个文字的起始时间看作0秒。本实施例可以将上述信息记为语音段对应的文字属性信息。The information obtained above may include the total amount of characters included in the entire speech segment, the character serial number of each character, the corresponding start and end time of the character in the speech segment, and the start of the first vowel of the corresponding pronunciation of the character. The starting and ending time of the text and the starting position of the first vowel can be regarded as relative time points, that is, the starting time of the first character can be regarded as 0 seconds according to the playback sequence of the entire voice. In this embodiment, the above information may be recorded as text attribute information corresponding to the speech segment.
示例性的,表1为一个文字属性信息的数据表,如表1所示,表1中的每列可以看做一个文字属性项,至少可以包括文字序号、文字的开始时间、文字中首个元音的开始时间以及文字的结束时间,表1的行数则可看作语音段内所包括的文字的总数。Exemplarily, Table 1 is a data table of text attribute information. As shown in Table 1, each column in Table 1 can be regarded as a text attribute item, which can at least include the text serial number, the start time of the text, and the first item in the text. The start time of vowels and the end time of words, the number of rows in Table 1 can be regarded as the total number of words included in the speech segment.
表1 语音段内文字的文字属性信息Table 1 Text attribute information of text in speech segment
Figure PCTCN2021095236-appb-000001
Figure PCTCN2021095236-appb-000001
S202、对用户选定的背景音乐进行节奏点检测和节拍周期划分,确定所述背景音乐中包含的节奏点总量、节奏点序号、以及每个节拍周期的周期信息,构成所述背景音乐的音乐节奏信息。S202, perform rhythm point detection and beat cycle division on the background music selected by the user, determine the total amount of rhythm points, the rhythm point sequence number, and the period information of each beat cycle contained in the background music, and constitute the background music. Music tempo information.
在本实施例中,可以采用音频处理中的节奏点检测策略从背景音乐中先检测出强节奏的重音点(即,节奏点),然后采用节拍划分策略来发现所检测出节奏点的发音规律,从而划分出具备最小节奏重复单元的节拍周期。对于一段背景音乐来说,检测出的节奏点本身具备一定的属性信息,如节奏点的序号,节奏点的总量、节奏点所处的位置(即节奏点出现的相对时间),同时,进行节拍检测后,也会对应每个节拍周期形成相应的周期信息,示例性的,所述周期信息可以包括:周期号、节拍周期内所包括的节奏点的节奏点个数以及每个节奏点的节奏点序号和节奏点起始时间。本实施例可以把这些信息汇总形成一 个音乐节奏信息。In this embodiment, the rhythm point detection strategy in audio processing can be used to first detect the accent points (ie, rhythm points) of strong rhythm from the background music, and then the beat division strategy can be used to find the pronunciation rule of the detected rhythm points , thereby dividing the beat period with the smallest rhythm repeating unit. For a piece of background music, the detected rhythm point itself has certain attribute information, such as the sequence number of the rhythm point, the total amount of the rhythm point, the position of the rhythm point (that is, the relative time when the rhythm point appears). After the beat detection, corresponding period information will also be formed corresponding to each beat period. Exemplarily, the period information may include: the period number, the number of rhythm points included in the beat period, and the number of rhythm points of each rhythm point. Rhythm point number and rhythm point start time. This embodiment can aggregate these pieces of information to form a piece of music rhythm information.
示例性的,本实施例给出了一个数据表形式的音乐节奏信息,由此将音乐节奏信息以信息表的形式进行展示。如表2所示,表2为一个音乐节奏信息的数据表,可以看出表2为一个级联表,表2的第一列展示了以周期号为标识的节拍周期,第二列给出了节奏点序号,同时以级联的形式展现了哪些节奏点包含在周期号为1的节拍周期内,每个周期号下级联了该节拍周期所包括的节奏点序号以及节奏点位置(即起始时间),每个周期号下所级联的节奏点的行数可以作为该节拍周期的节奏点个数。Exemplarily, this embodiment provides music rhythm information in the form of a data table, thereby displaying the music rhythm information in the form of an information table. As shown in Table 2, Table 2 is a data table of music rhythm information. It can be seen that Table 2 is a cascade table. The first column of Table 2 shows the beat cycle identified by the cycle number, and the second column gives The rhythm point number is displayed, and at the same time, which rhythm points are included in the rhythm period with the cycle number of 1 in the form of cascade. Start time), the number of rhythm points cascaded under each cycle number can be used as the number of rhythm points in the beat cycle.
表2 背景音乐对应的音乐节奏信息Table 2 Music rhythm information corresponding to background music
Figure PCTCN2021095236-appb-000002
Figure PCTCN2021095236-appb-000002
本实施例下述S203至S205给出了通过文字属性信息以及音乐节奏信息确定用于将语音段与背景音乐对齐的对齐周期及对齐信息表的实现过程。The following S203 to S205 in this embodiment provide an implementation process of determining an alignment period and an alignment information table for aligning the speech segment with the background music by using the text attribute information and the music rhythm information.
S203、根据所述文字属性信息中的文字总量以及所述音乐节奏信息中每个节拍周期的周期信息,确定用于将所述语音段与所述背景音乐对齐的至少一个对齐周期。S203. Determine at least one alignment period for aligning the speech segment with the background music according to the total amount of text in the text attribute information and the period information of each beat period in the music rhythm information.
在本实施例中,可以先确定整段背景音乐中可以包括多少个用于对齐语音段内所有文字的对齐周期,当对齐周期的数量大于1时,确定出的最后一个对齐周期可能是一个非完整周期(即,并未包含全部文字)。S203相当于先对背 景音乐进行一个大概的对齐周期划分。整个划分的过程需要借助语音段内所包括的文字的文字总量,以及背景音乐中一个完整的节拍周期内的节奏点个数,通过文字总量与该节拍周期中的节奏点个数的比对,来确定是否直接将节拍周期作为对齐周期,或者通过对节拍周期进行合并来获对齐周期。In this embodiment, it may be determined how many alignment periods for aligning all characters in the speech segment can be included in the entire background music. When the number of alignment periods is greater than 1, the determined last alignment period may be a non- Full cycle (ie, does not contain all text). S203 is equivalent to firstly dividing the background music by a rough alignment period. The whole division process requires the total amount of text included in the speech segment and the number of rhythm points in a complete beat cycle in the background music, and the ratio of the total amount of text to the number of rhythm points in the beat period. Yes, to determine whether to use the beat period as the alignment period directly, or to obtain the alignment period by combining the beat periods.
图3为本申请实施例二提供的一种将语音转换为说唱音乐方法中确定对齐周期的实现流程图,如图3所示,所述根据所述文字属性信息中的文字总量以及所述音乐节奏信息中每个节拍周期的周期信息,确定用于将所述语音段与所述背景音乐对齐的至少一个对齐周期,包括:FIG. 3 is a flow chart of the implementation of determining an alignment period in a method for converting speech into rap music provided by Embodiment 2 of the present application. As shown in FIG. 3 , according to the total amount of text in the text attribute information and the The period information of each beat period in the music rhythm information determines at least one alignment period for aligning the speech segment with the background music, including:
S2031、选定一个完整的节拍周期,并获取该完整的节拍周期所对应的周期信息内节奏点个数。S2031. Select a complete beat cycle, and acquire the number of rhythm points in the cycle information corresponding to the complete beat cycle.
整段背景音乐中可以检测出至少一个节拍周期,当检测出一个节拍周期时认为该节拍周期为一个完整周期,当检测出大于1个节拍周期时,划分形成的最后一个周期可能是非完整周期,即并没有包含一个完整节拍周期内的全部节奏点。本实施例可以从完整的节拍周期中挑选一个节拍周期,获取该节拍周期所对应的周期信息中的节奏点个数。其中,不同完整的节拍周期中的节奏点个数的值是相同的。At least one beat period can be detected in the entire background music. When a beat period is detected, the beat period is considered to be a complete period. When more than one beat period is detected, the last cycle formed by division may be an incomplete period. That is, it does not contain all the rhythm points in a complete beat cycle. In this embodiment, a beat period can be selected from a complete beat period, and the number of rhythm points in the period information corresponding to the beat period can be acquired. The values of the number of rhythm points in different complete beat cycles are the same.
S2032、判断该节奏点个数是否大于或等于所述文字总量,若该节奏点个数大于或等于所述文字总量,则执行S2033;若该节奏点个数小于所述文字总量,则执行S2034。S2032. Determine whether the number of rhythm points is greater than or equal to the total amount of characters. If the number of rhythm points is greater than or equal to the total amount of characters, execute S2033; if the number of rhythm points is less than the total amount of characters, perform S2033. Then execute S2034.
S2032的判定目的主要在于确定当前检测获得的一个完整的节拍周期能否容纳语音段中的所有文字,如果该完整的节拍周期可以容纳语音段中的所有文字,则执行S2033;如果该完整的节拍周期不可以容纳语音段中的所有文字,则需要执行S2034。The determination purpose of S2032 is mainly to determine whether a complete beat period obtained by the current detection can accommodate all the characters in the speech segment, if the complete beat period can accommodate all the characters in the speech segment, then execute S2033; If the period cannot accommodate all the characters in the speech segment, S2034 needs to be executed.
S2033、将每个节拍周期看作一个对齐周期。S2033, regard each beat period as an alignment period.
在节奏点个数大于或等于文字总量时,可以直接将该节拍周期看作一个对齐周期。当判定一个完整的节拍周期可以看作一个对齐周期时,检测出的其他完整的节拍周期均可看做完整的对齐周期,一个非完整的节拍周期可看做一个非完整的对齐周期。When the number of rhythm points is greater than or equal to the total amount of characters, the rhythm period can be directly regarded as an alignment period. When it is determined that a complete beat period can be regarded as an alignment period, other detected complete beat periods can be regarded as a complete alignment period, and an incomplete beat period can be regarded as an incomplete alignment period.
S2034、判断所述背景音乐中所包括的节拍周期的个数是否大于1,若背景音乐中所包括的节拍周期的个数大于1,则执行S2035;若背景音乐中所包括的节拍周期的个数不大于1,则执行S2036。S2034, determine whether the number of beat periods included in the background music is greater than 1, and if the number of beat periods included in the background music is greater than 1, execute S2035; if the number of beat periods included in the background music is greater than 1 If the number is not greater than 1, execute S2036.
在节奏点个数小于文字总量时,相当于一个完整的节拍周期不能容纳语音段中的所有文字,此时,需要对节拍周期进行合并,而合并的前提条件是背景 音乐中所包括的节拍周期的个数至少为两个。通过S2034判定背景音乐中的节拍周期的个数是否大于1,若背景音乐中的节拍周期的个数大于1,则满足合并条件,可继续执行S2035;若背景音乐中的节拍周期的个数不大于1,相当于该段背景音乐与语音段不匹配,则需要执行S2036。When the number of rhythm points is less than the total amount of text, it is equivalent to that a complete beat cycle cannot accommodate all the characters in the speech segment. At this time, the beat cycles need to be merged, and the precondition for merging is the beats included in the background music. The number of cycles is at least two. Determine whether the number of beat periods in the background music is greater than 1 through S2034, if the number of beat periods in the background music is greater than 1, then the merging condition is satisfied, and S2035 can be continued; if the number of beat periods in the background music is not If it is greater than 1, it means that the background music does not match the speech segment, and S2036 needs to be executed.
S2035、按照周期号的排列顺序进行节拍周期的两两合并,形成至少一个新的节拍周期,返回执行S2031。S2035 , merge the takt cycles in pairs according to the arrangement order of the cycle numbers to form at least one new tick cycle, and return to executing S2031 .
在本实施例中,当节拍周期的个数大于1时,可以按照周期号顺序进行节拍周期的两两合并,由此形成新的节拍周期,新形成的节拍周期,其对应的周期信息也将发生相应的变化。以上述表2为例,假设周期号为1与周期号为2的两个节拍周期合并,形成的新节拍周期内所包含的节奏点个数为之前两个节拍周期内所包含的节奏点个数之和。按照周期号顺序进行节拍周期的两两合并后,所形成的节拍周期个数为原有节拍周期个数的一半或一半加1,之后可返回S2031依据新形成的节拍周期的周期信息进行对齐周期的确定,如此循环,直至找到适合的对齐周期,或者在查找失败时结束后续的语音到说唱音乐的转换操作。In this embodiment, when the number of beat periods is greater than 1, the beat periods may be merged in pairs in the order of the period numbers, thereby forming a new beat period, and the corresponding period information of the newly formed beat period will also be Corresponding changes occur. Taking the above Table 2 as an example, assuming that the two beat cycles with cycle number 1 and cycle number 2 are merged, the number of rhythm points contained in the new beat period formed is the number of rhythm points contained in the previous two beat periods. sum of numbers. After the takt cycles are merged in the order of the cycle numbers, the number of the takt cycles formed is half or half of the original takt cycles plus 1, and then returns to S2031 to align the cycles according to the cycle information of the newly formed takt cycles is determined, and so on, until a suitable alignment period is found, or when the search fails, the subsequent voice-to-rap music conversion operation is ended.
S2036、结束将所述语音段转换为说唱音乐的处理,并输出重新获得语音段或背景音乐的提示。S2036. End the process of converting the speech segment into rap music, and output a prompt for regaining the speech segment or background music.
在本实施例中,如果节拍周期的个数只有一个,且该节拍周期中的节奏点个数小于文字总量,可认为语音段与选定的背景音乐不匹配,需要通过本实施例一可选实施例的操作重新获得再次上传或再次录制语音段,或者,重选背景音乐。In this embodiment, if there is only one beat period, and the number of rhythm points in the beat period is less than the total amount of text, it can be considered that the speech segment does not match the selected background music, and it is necessary to use this embodiment 1 to adjust the number of rhythm points. According to the operation of the selected embodiment, the voice segment is re-uploaded or recorded again, or the background music is re-selected.
S204、选取一个完整的对齐周期作为待对齐节奏段,根据所述文字属性信息以及待对齐节奏段内待对齐节奏点的节奏点信息,确定至少一个对齐单元以及相应的对齐单元信息。S204: Select a complete alignment period as the to-be-aligned rhythm segment, and determine at least one alignment unit and corresponding alignment unit information according to the text attribute information and the rhythm point information of the to-be-aligned rhythm point in the to-be-aligned rhythm segment.
上述进行了对齐周期划分后,可以以一个对齐周期为基准,来确定语音段所包括的每个文字相对该对齐周期内的节奏点的匹配情况。本实施例将一段时长内所包括的节奏点与语音段内文字的匹配看作一个对齐单元,每个对齐单元信息包括了对齐单元中所存在的节奏点的节奏点序号,以及与该节奏点所匹配的文字的文字序号,以及将所存在的节奏点与所匹配的文字进行对齐时所需的变速比。After the above-mentioned division of the alignment period, an alignment period may be used as a reference to determine the matching situation of each character included in the speech segment with respect to the rhythm points within the alignment period. In this embodiment, the matching between the rhythm points included in a period of time and the characters in the speech segment is regarded as an alignment unit, and each alignment unit information includes the rhythm point number of the rhythm point existing in the alignment unit, and the rhythm point number associated with the rhythm point. The character serial number of the matched character, and the gear ratio required to align the existing rhythm point with the matched character.
每个对齐单元存在至少包括了节奏点序号、文字序号以及变速比的对齐单元信息。同时,由于多个对齐周期所包括的节奏点数量相同,所具备的音乐节奏相同,因此,可以仅对任一个完整的对齐周期进行有关对齐单元以及对齐单 元信息的确定。Each alignment unit has alignment unit information including at least the rhythm point serial number, the character serial number and the gear ratio. At the same time, since the number of rhythm points included in the multiple alignment periods is the same and the musical tempo is the same, the alignment unit and the alignment unit information may only be determined for any complete alignment period.
S204中确定对齐单元以及对齐单元信息的实现过程可描述为:首先将选定进行信息确定的对齐周期记为待对齐节奏段,该对齐周期所具备的节奏点信息可直接作为待对齐节奏段内待对齐节奏点的节奏点信息;之后,可以根据文字属性信息以及节奏点信息确定一个用于将文字和节奏点对齐的对齐匹配值;然后确定该对齐匹配值在预先设定的节奏点-文字对齐规则表中所属的对齐范围;最终基于该对齐范围对应的对齐规则来进行待对齐节奏段内对齐单元的确定,以及每个对齐单元所对应的对齐单元信息的确定,其中,节奏点-文字对齐规则表中的对齐范围以及相应的对齐规则均可通过历史经验预先设定。The implementation process of determining the alignment unit and the alignment unit information in S204 can be described as follows: first, the alignment period selected for information determination is recorded as the rhythm segment to be aligned, and the rhythm point information in the alignment period can be directly used as the rhythm segment to be aligned. The rhythm point information of the rhythm point to be aligned; after that, an alignment matching value for aligning the text and the rhythm point can be determined according to the text attribute information and the rhythm point information; then it is determined that the alignment matching value is in the preset rhythm point-text The alignment range belonging to the alignment rule table; finally, based on the alignment rule corresponding to the alignment range, the alignment unit in the rhythm segment to be aligned is determined, and the alignment unit information corresponding to each alignment unit is determined, wherein the rhythm point-text The alignment range in the alignment rule table and the corresponding alignment rules can be preset through historical experience.
图4为本申请实施例二提供的一种将语音转换为说唱音乐方法中确定对齐周期中的对齐单元及对齐单元信息的实现流程图,如图4所示,根据所述文字属性信息以及待对齐节奏段内待对齐节奏点的节奏点信息,确定至少一个对齐单元以及相应的对齐单元信息可包括:FIG. 4 is a flow chart of the realization of determining alignment units and alignment unit information in an alignment period in a method for converting speech into rap music provided by Embodiment 2 of the present application. As shown in FIG. 4 , according to the text attribute information and the information to be Align the rhythm point information of the rhythm points to be aligned in the rhythm segment, and determine at least one alignment unit and the corresponding alignment unit information may include:
S2041、选取一个完整的对齐周期作为待对齐节奏段,基于待对齐节奏段内多个待对齐节奏点的节奏点信息,形成与所述多个待对齐节奏点一一对应的多个待对齐节奏块,并将所述多个待对齐节奏点的数量记为初始的剩余点个数。S2041, selecting a complete alignment period as the rhythm segment to be aligned, and based on the rhythm point information of a plurality of rhythm points to be aligned in the rhythm segment to be aligned, form a plurality of rhythms to be aligned corresponding to the plurality of rhythm points to be aligned one-to-one block, and record the number of the plurality of rhythm points to be aligned as the initial number of remaining points.
在本实施例中,可以从上述确定的对齐周期中,选定一个完整的对齐周期作为进行对齐信息表确定的待对齐节奏段,所述待对齐节奏段中的待对齐节奏点为该完整的对齐周期中包括的所有节奏点,该完整的对齐周期所包括的节奏点的节奏点信息为待对齐节奏点的节奏点信息。In this embodiment, a complete alignment period may be selected from the above-determined alignment periods as the to-be-aligned rhythm segment determined by the alignment information table, and the to-be-aligned rhythm point in the to-be-aligned rhythm segment is the complete alignment period. All rhythm points included in the alignment cycle, the rhythm point information of the rhythm points included in the complete alignment cycle is the rhythm point information of the rhythm points to be aligned.
本实施例可以将相邻两个待对齐节奏点之间形成的间隔记为一个待对齐节奏块,由此,可首先按照待对齐节奏段所包括的待对齐节奏点的数量形成相同数量个待对齐节奏块,即可认为所形成的待对齐节奏块分别与待对齐节奏点一一对应,可以为每个待对齐节奏块设置相应的块序号,还可以将待对齐节奏点的数量记为初始的剩余点个数。In this embodiment, the interval formed between two adjacent to-be-aligned rhythm points may be recorded as a to-be-aligned rhythm block, so that the same number of to-be-aligned rhythm points may be formed first according to the number of to-be-aligned rhythm points included in the to-be-aligned rhythm segment. To align the rhythm blocks, it can be considered that the formed rhythm blocks to be aligned correspond to the rhythm points to be aligned, respectively, and the corresponding block serial number can be set for each rhythm block to be aligned, and the number of rhythm points to be aligned can also be recorded as the initial the number of remaining points.
S2042、确定所述剩余点个数与所述文字属性信息中的文字总量的比值,将所述比值记为对齐匹配值。S2042. Determine the ratio of the number of remaining points to the total amount of characters in the character attribute information, and record the ratio as an alignment matching value.
本实施例为了实现待对齐节奏段内每个待对齐节奏点与语音段内文字的匹配,首先确定待对齐节奏段内未与文字匹配的待对齐节奏点的数量与文字总量的比值,并将该比值记为对齐匹配值。In this embodiment, in order to realize the matching between each to-be-aligned rhythm point in the to-be-aligned rhythm segment and the text in the speech segment, first determine the ratio of the number of to-be-aligned rhythm points that are not matched with the text in the to-be-aligned rhythm segment to the total amount of text, and This ratio is recorded as the alignment match value.
当在待对齐节奏段内不存在已匹配的节奏点时,需要匹配的节奏点数为全部待对齐节奏点,因此,最初时将剩余点个数初始为待对齐节奏段所包括的待对齐节奏点的数量。When there is no matched rhythm point in the to-be-aligned rhythm segment, the number of rhythm points to be matched is all the to-be-aligned rhythm points. Therefore, initially, the number of remaining points is initially the to-be-aligned rhythm points included in the to-be-aligned rhythm segment. quantity.
S2043、查找预设的节奏点-文字对齐规则表,确定所述对齐匹配值归属的长度比值范围。S2043. Search a preset rhythm point-text alignment rule table, and determine the length ratio range to which the alignment matching value belongs.
本实施例预先设定了一个节奏点-文字对齐规则表,该规则表为一个二元关联表,两个关联的对象分别为长度比值范围以及对齐规则。所述长度比值范围可通过一个对齐周期内未匹配的节奏点个数与整个语音段所包括的文字总量的比值来设定。本实施例基于历史经验形成6个不同区间的长度比值范围,分别为:(0,0.2]、(0.2,0.8]、(0.8,1]、(1,1.1]、(1.1,1.3]和(1.3,∞)。This embodiment presets a rhythm point-text alignment rule table, the rule table is a binary association table, and the two associated objects are the length ratio range and the alignment rule respectively. The length ratio range can be set by the ratio of the number of unmatched rhythm points in one alignment period to the total amount of characters included in the entire speech segment. In this embodiment, based on historical experience, six different ranges of length ratios are formed, namely: (0,0.2], (0.2,0.8], (0.8,1], (1,1.1], (1.1,1.3] and ( 1.3,∞).
在本实施例中,可以确定上述所获得的对齐匹配值在节奏点-文字对齐规则表中所处的长度比值范围。In this embodiment, the range of length ratios in which the above-obtained alignment matching value is located in the rhythm point-character alignment rule table can be determined.
S2044、按照所述长度比值范围对应的对齐规则确定存在相匹配文字的待对齐节奏块,并记为候选对齐单元。S2044: Determine according to the alignment rule corresponding to the length ratio range that there is a rhythm block to be aligned that matches the text, and record it as a candidate alignment unit.
确定对齐匹配值归属的长度比值范围后,可以获得该长度比值范围所关联的对齐规则,通过该对齐规则为该待对齐节奏段进行候选对齐单元的划分。After determining the length ratio range to which the alignment matching value belongs, the alignment rule associated with the length ratio range can be obtained, and the candidate alignment unit is divided for the to-be-aligned rhythm segment by the alignment rule.
本实施例中,可以将文字与节奏点的匹配,看作文字与一个待对齐节奏块的匹配,基于长度比值范围对应的对齐规则,可以为每个待对齐节奏块确定出与其相匹配的文字(文字个数不确定,但文字个数至少为1个),且匹配后的待对齐节奏块就可作为一个候选对齐单元。In this embodiment, the matching of text and rhythm points can be regarded as the matching of text and a rhythm block to be aligned, and based on the alignment rule corresponding to the length ratio range, the matching text can be determined for each rhythm block to be aligned. (The number of characters is uncertain, but the number of characters is at least 1), and the matched rhythm block to be aligned can be used as a candidate alignment unit.
对应于不同的长度比值范围,本实施例设定了相应的对齐规则,示例性的,表3给出了预先设定的节奏点-文字对齐规则表。通过表3中多个长度比值范围对应的对齐规则为剩余节奏点(剩余的待对齐节奏块)进行文字匹配。Corresponding to different length ratio ranges, the present embodiment sets corresponding alignment rules. Exemplarily, Table 3 provides a preset rhythm point-text alignment rule table. Character matching is performed for the remaining rhythm points (the remaining rhythm blocks to be aligned) according to the alignment rules corresponding to the multiple length ratio ranges in Table 3.
表3 节奏点-文字对齐规则表Table 3 Rhythm point-text alignment rule table
Figure PCTCN2021095236-appb-000003
Figure PCTCN2021095236-appb-000003
Figure PCTCN2021095236-appb-000004
Figure PCTCN2021095236-appb-000004
S2045、统计剩余待对齐节奏块的块数,作为新的剩余点个数。S2045. Count the number of remaining rhythm blocks to be aligned as the new number of remaining points.
采用上述S2044进行一次对齐匹配后,还可能存在未匹配的待对齐节奏块,统计出待对齐节奏段中剩余的待对齐节奏块的块数,并将该块数作为新的剩余点个数。After performing one alignment and matching using the above S2044, there may also be unmatched rhythm blocks to be aligned. The number of blocks of the remaining to-be-aligned rhythm blocks in the to-be-aligned rhythm segment is counted, and the number of blocks is used as the new number of remaining points.
S2046、确定所述剩余点个数是否为0,若所述剩余点个数为0,则执行S2047;若所述剩余点个数不为0,返回执行S2042。S2046: Determine whether the number of remaining points is 0, if the number of remaining points is 0, execute S2047; if the number of remaining points is not 0, return to execute S2042.
本实施例中,可以判定剩余点个数是否为0,如果剩余点个数为0,则可认为待对齐节奏段中剩余的待对齐节奏块为0,即所有的待对齐节奏块均已完成了匹配,此时可以执行S2047的操作;如果剩余点个数不为0,则可认为待对齐节奏段中还存在未匹配的待对齐节奏块,此时可以返回重新执行S2042的对齐匹配值确定操作。In this embodiment, it can be determined whether the number of remaining points is 0. If the number of remaining points is 0, it can be considered that the remaining rhythm blocks to be aligned in the rhythm segment to be aligned are 0, that is, all rhythm blocks to be aligned have been completed. If the number of remaining points is not 0, it can be considered that there are unmatched rhythm blocks to be aligned in the to-be-aligned rhythm segment, and at this time, it can return to re-execute the alignment matching value of S2042 to determine operate.
基于S2046的操作,当一个待对齐节奏段内所有的待对齐节奏块均完成匹 配后,其所形成的候选对齐单元的个数实际与所包括待对齐节奏点的个数相同。即,可认为一个待对齐节奏点(待对齐节奏块)对应存在一个候选对齐单元,多个候选对齐单元的单元序号可以按照其对齐顺序从0开始依次递增标记。Based on the operation of S2046, when all the rhythm blocks to be aligned in a rhythm section to be aligned have been matched, the number of candidate alignment units formed by it is actually the same as the number of included rhythm points to be aligned. That is, it can be considered that a rhythm point to be aligned (rhythm block to be aligned) corresponds to a candidate alignment unit, and the unit serial numbers of the multiple candidate alignment units can be marked sequentially increasing from 0 according to their alignment sequence.
为便于理解候选对齐单元的确定过程,本实施例给出了一个示例性描述。示例性的,假设一个待对齐节奏段中的待对齐节奏点数量为8,且当前确定的剩余点个数为8;用户所获得的语音段包括的文字总量为5,如“淡黄色长裙”,则将“淡黄色长裙”与8个剩余节奏点进行匹配确定每个候选对齐单元的过程描述为:To facilitate understanding of the process of determining the candidate alignment unit, this embodiment provides an exemplary description. Exemplarily, it is assumed that the number of rhythm points to be aligned in a rhythm segment to be aligned is 8, and the currently determined number of remaining points is 8; the total amount of text included in the speech segment obtained by the user is 5, such as "light yellow long skirt”, the process of matching the “light yellow long skirt” with the 8 remaining rhythm points to determine each candidate alignment unit is described as:
1)对齐匹配值为:8/5=1.6,落入(1.3,∞)的长度比值范围,查找上述表3,可以获得其对应的对齐规则。1) The alignment matching value is: 8/5=1.6, which falls within the length ratio range of (1.3, ∞). By looking up Table 3 above, the corresponding alignment rule can be obtained.
2)按照长度比值范围(1.3,∞)所关联对齐规则进行文字与节奏点的匹配。2) According to the alignment rule associated with the length ratio range (1.3, ∞), the text and the rhythm point are matched.
该对齐规则为:“从首个文字开始选择10%文字总量的文字进行从首个剩余节奏点开始的匹配,之后将100%文字总量的剩余节奏点分别按照文字顺序与文字匹配,然后对之后20%文字总量的剩余节奏点,从最后一个文字开始选择20%文字总量的文字进行重复匹配”。基于该对齐规则,首先需要从“淡黄色长裙”的首个字开始,选择10%文字总量,即0.5个字进行重复。当遇到待重复的字长小于1时,就进行向下取整的操作,因此,当前需要重复的字数为0。之后,可以直接从首个剩余节奏点开始,选择100%文字总量的节奏点,分别与5个文字顺序匹配,此时,节奏点0-4形成的待对齐节奏块0-4分别对应了“淡”“黄”“色”“长”“裙”5个字;然后,需要从“淡黄色长裙”的最后一个字开始,选择20%文字总量的文字,即最后一个字“裙”,此时,节奏点5形成的待对齐节奏块对应了“裙”这个字。至此,完成了按照长度比值范围(1.3,∞)所关联对齐规则进行文字与节奏点匹配的操作,当前确定出的候选对齐单元的单元序号分别为0-5,且6个候选匹配单元所对应的文字分别为:“淡”“黄”“色”“长”“裙”“裙”。The alignment rule is: "Select 10% of the total text from the first text to match from the first remaining rhythm point, and then match the remaining rhythm points of 100% of the total text with the text in text order, and then For the remaining rhythm points of the following 20% of the total text, starting from the last text, select the text with 20% of the total text for repeated matching". Based on this alignment rule, it is first necessary to start from the first word of "light yellow dress" and select 10% of the total text, that is, 0.5 words to repeat. When the length of the word to be repeated is less than 1, the round-down operation is performed. Therefore, the current number of words to be repeated is 0. After that, you can directly start from the first remaining rhythm point, select a rhythm point with 100% of the total text, and match the 5 text sequences respectively. At this time, the rhythm points 0-4 formed by the rhythm points 0-4 to be aligned correspond respectively. "Light", "Yellow", "Color", "Long" and "Skirt" are 5 characters; then, starting from the last character of "Light Yellow Long Skirt", select 20% of the total text, that is, the last character "Skirt" ", at this time, the rhythm block to be aligned formed by rhythm point 5 corresponds to the word "skirt". So far, the operation of matching text and rhythm points according to the alignment rules associated with the length ratio range (1.3, ∞) has been completed. The unit numbers of the currently determined candidate alignment units are 0-5 respectively, and the six candidate matching units correspond The characters are: "light", "yellow", "color", "long", "skirt" and "skirt".
3)上述操作后,8个待对齐节奏块中还余下2个未匹配的待对齐节奏块,认为剩余点个数大于0,由此可重新进行对齐匹配值确定,新的对齐匹配值为2/5=0.4,落入(0.2,0.8]的长度比值范围,查找上述表3,可以获得其对应的对齐规则。3) After the above operation, there are still 2 unmatched rhythm blocks to be aligned in the 8 rhythm blocks to be aligned. It is considered that the number of remaining points is greater than 0, so the alignment matching value can be determined again, and the new alignment matching value is 2. /5=0.4, which falls within the length ratio range of (0.2, 0.8], and by looking up Table 3 above, the corresponding alignment rule can be obtained.
4)按照长度比值范围(0.2,0.8]所关联对齐规则进行文字与节奏点的匹配。4) The text and the rhythm point are matched according to the alignment rules associated with the length ratio range (0.2, 0.8].
该对齐规则为:“当L小于或等于0.5,则随机选定L*文字总量的待重复文字,调整已匹配的节奏点-文字的位置,在所选定文字后进行重复添加;当L大于0.5,随机选定50%文字总量的待重复文字,调整已匹配的节奏点-文字的位 置,在所选定文字后进行重复添加,并对余下(L-0.5)*文字总量的剩余节奏点进行静音段添加,其中,L为对齐匹配值。”The alignment rule is: "When L is less than or equal to 0.5, randomly select the text to be repeated with L * the total amount of text, adjust the position of the matched rhythm point-text, and repeat after the selected text; when L If it is greater than 0.5, randomly select 50% of the total text to be repeated, adjust the position of the matched rhythm point-text, repeat after the selected text, and add the remaining (L-0.5)*total text. The remaining rhythm points are added to the silent segment, where L is the alignment matching value."
分析该对齐匹配值0.4,可知按照该对齐规则,该对齐匹配值0.4小于0.5,因此可以直接进行随机选定40%文字总量(即2个字)的操作,假设字号0-4中随机选定的字号为1和3,分别对应的字为“黄”和“长”,则之后需要对上述已匹配形成的“淡黄色长裙裙”进行调整,使得待重复的字能够位于所选定文字的位置之后,按照该对齐规则,余下两个待对齐节奏块匹配的文字分别为“黄”和“长”,由此形成了分别匹配“黄”和“长”两个文字的新的候选对齐单元,由于对上述已匹配形成的“淡黄色长裙裙”进行了调整,基于该操作后,8个候选匹配单元所对应的文字分别为:“淡”“黄”“黄”“色”“长”“长”“裙”“裙”。Analysis of the alignment matching value of 0.4 shows that according to the alignment rule, the alignment matching value 0.4 is less than 0.5, so the operation of randomly selecting 40% of the total text (that is, 2 characters) can be directly performed, assuming that the font size is randomly selected from 0-4. The fixed font size is 1 and 3, and the corresponding words are "yellow" and "long" respectively, then it is necessary to adjust the "light yellow long skirt" that has been matched and formed, so that the word to be repeated can be located in the selected word. After the position of the text, according to the alignment rule, the remaining two rhythm blocks to be aligned are "yellow" and "long" respectively, thus forming new candidates matching the two characters "yellow" and "long" respectively. The alignment unit, due to the adjustment of the above-mentioned "light yellow long skirt" that has been matched, based on this operation, the characters corresponding to the 8 candidate matching units are: "light", "yellow", "yellow" and "color" "long" "long" "skirt" "skirt".
5)上述操作后,余下未匹配的待对齐节奏块为0,即剩余点个数为0,符合结束候选对齐单元的匹配条件,由此可以结束上述操作。5) After the above operation, the remaining unmatched rhythm blocks to be aligned are 0, that is, the number of remaining points is 0, which meets the matching conditions for ending the candidate alignment unit, and the above operation can be ended.
步骤5)之后,就可以形成8个单元序号依次为0-7的候选对齐单元。由此完成语音段内文字到待对齐节奏段的对齐匹配。After step 5), 8 candidate alignment units with unit serial numbers 0-7 in sequence can be formed. In this way, the alignment and matching of the text in the speech segment to the rhythm segment to be aligned is completed.
S2047、根据每个候选对齐单元的单元时长以及每个候选对齐单元所匹配的文字的匹配文字属性信息,确定至少一个对齐单元并获得相应的变速比。S2047: Determine at least one alignment unit and obtain a corresponding gear ratio according to the unit duration of each candidate alignment unit and the matching character attribute information of the characters matched by each candidate alignment unit.
根据上述描述可知,从待对齐节奏段内确定出的候选对齐单元的个数与待对齐节奏段内所包括的待对齐节奏块的块数相同,而一个待对齐节奏块为相应节奏点到相邻下一节奏点或者节奏结束点(该种情况主要针对最后一个节奏点)所形成的间隔块,即,一个待对齐节奏块的持续时长为两节奏点(或节奏结束点)的间隔时长。本实施例中,由于一个候选对齐单元对应一个待对齐节奏块,所以可将待对齐节奏块的持续时长作为相应候选对齐单元的单元时长。According to the above description, the number of candidate alignment units determined from the to-be-aligned rhythm segment is the same as the number of to-be-aligned rhythm blocks included in the to-be-aligned rhythm segment, and one to-be-aligned rhythm block is the corresponding rhythm point to phase The interval block formed by the next rhythm point or the rhythm end point (this case is mainly for the last rhythm point), that is, the duration of a rhythm block to be aligned is the interval between two rhythm points (or rhythm end points). In this embodiment, since one candidate alignment unit corresponds to one to-be-aligned rhythm block, the duration of the to-be-aligned rhythm block may be used as the unit duration of the corresponding candidate alignment unit.
确定出候选对齐单元相匹配的文字后,需要做的就是将所匹配文字与候选对齐单元进行文字发音与单元时长的对齐。一般情况下,该种对齐可以直接是在播放该候选对齐单元音频信号的同时混入所匹配文字的发音。考虑到有些文字的发音时间较短,但与其匹配的候选对齐单元的单元时长又较长,又或者,有些文字的发音时间较长,但与其匹配的候选对齐单元的单元时长又较短,为了实现文字与待对齐单元的对齐,则需要调整文字的发音速率,比如将文字的发音时间拉伸(减小发音速度)或者压缩(加快发音速度)以使其等于单元时长。After determining the text matched by the candidate alignment unit, what needs to be done is to align the text pronunciation and unit duration between the matched text and the candidate alignment unit. In general, the alignment can be directly mixed with the pronunciation of the matched text while playing the audio signal of the candidate alignment unit. Considering that some words have a short pronunciation time, but the unit duration of the matching candidate alignment unit is longer, or, some words have a longer pronunciation time, but the unit duration of the matching candidate alignment unit is shorter, in order to To align the text with the unit to be aligned, it is necessary to adjust the pronunciation rate of the text, such as stretching the pronunciation time of the text (reducing the pronunciation speed) or compressing (speeding up the pronunciation speed) to make it equal to the unit duration.
本实施例将文字需要拉伸或者压缩的比例值记为变速比,可以根据候选对齐单元的单元时长,以及与候选对齐单元所匹配的文字的匹配文字属性信息(如 所匹配的文字的文字起止时间,文字中首个元音的起始位置等),来确定所匹配的文字与相应候选对齐单元进行对齐时所需的变速比。In this embodiment, the ratio value of the text that needs to be stretched or compressed is recorded as the speed change ratio, which can be based on the unit duration of the candidate alignment unit and the matching text attribute information of the text matched with the candidate alignment unit (such as the start and end of the text of the matched text). time, the starting position of the first vowel in the text, etc.), to determine the gear ratio required when the matched text is aligned with the corresponding candidate alignment unit.
通过拉伸或压缩文字发音,将文字与待对齐单元对齐的实现中,文字发音能够拉伸或者压缩的程度是有限定的,如果仅考虑对齐而无限的拉伸或压缩文字发音,在实际执行对齐操作后,所形成的音频就存在失真的风险,因此,本实施例需要为文字发音的压缩或者拉伸设定一个合适的范围,即,需要保证文字所对应的变速比为处于一个正常的比值范围,该比值范围就可看作适合拉伸或压缩的适合条件。In the realization of aligning the text with the unit to be aligned by stretching or compressing the pronunciation of the text, the extent to which the pronunciation of the text can be stretched or compressed is limited. After the alignment operation, the formed audio has the risk of distortion. Therefore, in this embodiment, it is necessary to set an appropriate range for the compression or stretching of the pronunciation of the text, that is, it is necessary to ensure that the speed change ratio corresponding to the text is in a normal range. The range of ratios can be regarded as suitable conditions for stretching or compression.
由此,可以将上述计算所得的变速比与所设定的适合条件比对,来确定相应的候选对齐单元是否适合作为对齐单元,如果候选对齐单元适合作为对齐单元,则可直接将候选对齐单元确定为对齐单元,并将其对应的变速比确定为该对齐单元的变速比;如果候选对齐单元不适合作为对齐单元,则需要对候选对齐单元进行静音填补或者两个或多个候选对齐单元的合并处理,从而获得满足上述适合条件的对齐单元,且将进行适合条件判定的变速比作为该对齐单元的变速比。Therefore, the gear ratio calculated above can be compared with the set suitable conditions to determine whether the corresponding candidate aligning unit is suitable as the aligning unit. If the candidate aligning unit is suitable as the aligning unit, the candidate aligning unit can be directly It is determined as an alignment unit, and its corresponding gear ratio is determined as the gear ratio of the alignment unit; if the candidate alignment unit is not suitable as an alignment unit, the candidate alignment unit needs to be silenced or filled with two or more candidate alignment units. By combining the processing, an alignment unit that satisfies the above-mentioned suitable conditions is obtained, and the gear ratio for which the suitable condition is determined is used as the gear ratio of the alignment unit.
上述所确定的待对齐节奏点数量个的候选对齐单元最终能形成至少一个对齐单元,每个对齐单元中至少可以包括一个节奏点,同时至少可以包括一个相匹配的文字,每个对齐单元的变速比就可看作将所包括的文字与所包括的节奏点进行对齐时,对文字进行拉伸或者伸缩所需的比例值。The above determined number of candidate alignment units for the number of rhythm points to be aligned can eventually form at least one alignment unit, and each alignment unit may include at least one rhythm point and at least one matching character. The ratio can be regarded as the ratio value required to stretch or stretch the text when aligning the included text with the included rhythm points.
S2048、将每个对齐单元的单元序号、所包括节奏点中的起始节奏点序号、所匹配的文字的文字序号以及变速比确定为相应的对齐单元信息。S2048: Determine the unit serial number of each alignment unit, the initial rhythm point serial number in the included rhythm points, the character serial number of the matched characters, and the gear ratio as the corresponding alignment unit information.
在进行上述对齐单元以及相应变速比的确定操作时,也相应获得了每个对齐单元的单元序号,以及该对齐单元中所包括的每个节奏点的节奏点序号,同时,也可以获得该对齐单元中所匹配的每个文字的文字序号。可以针对每个对齐单元进行上述信息的汇总,由此对应每个对齐单元形成相应的对齐单元信息。When the above-mentioned determination of the alignment unit and the corresponding gear ratio is performed, the unit serial number of each alignment unit and the rhythm point serial number of each rhythm point included in the alignment unit are also obtained accordingly. At the same time, the alignment unit can also be obtained. The literal number of each literal matched in the cell. The above-mentioned information may be summarized for each alignment unit, so that corresponding alignment unit information is formed corresponding to each alignment unit.
S205、汇总所述至少一个对齐单元对应的对齐单元信息形成所述待对齐节奏段的当前对齐信息表,并根据所述当前对齐信息表确定余下每个对齐周期的对齐信息表。S205. Summarize the alignment unit information corresponding to the at least one alignment unit to form a current alignment information table of the to-be-aligned rhythm segment, and determine an alignment information table for each remaining alignment period according to the current alignment information table.
本实施例通过上述S204可以确定出待对齐节奏段中包括的至少一个对齐单元,以及相应的对齐单元信息,可以对上述确定的对齐单元信息按照对齐单元的单元序号顺序排列汇总,由此形成一个当前对齐信息表。之后还可以根据该当前对齐信息表确定上述S203确定出的其余每个对齐周期的对齐信息表。In this embodiment, at least one alignment unit included in the rhythm segment to be aligned and the corresponding alignment unit information can be determined through the above S204, and the above determined alignment unit information can be arranged and summarized in the order of the unit serial numbers of the alignment units, thereby forming a Current alignment information table. Afterwards, the alignment information table of each remaining alignment period determined in the above S203 may also be determined according to the current alignment information table.
对于剩余的对齐周期,如果是一个完整的对齐周期,则可复制上述当前对 齐信息表直接作为相应的对齐信息表;如果是一个非完整的对齐周期,则可从当前对齐信息表中取出与该对齐周期所包括的节奏点个数相同行的对齐单元信息形成相应的对齐信息表。For the remaining alignment period, if it is a complete alignment period, the above current alignment information table can be copied directly as the corresponding alignment information table; if it is an incomplete alignment period, the current alignment information table can be retrieved from the current alignment information table The alignment unit information of the same row with the same number of rhythm points included in the alignment period forms a corresponding alignment information table.
表4 一个对齐周期中基于对齐单元的信息形成的对齐信息表Table 4 Alignment information table formed based on the information of the alignment unit in an alignment cycle
Figure PCTCN2021095236-appb-000005
Figure PCTCN2021095236-appb-000005
示例性的,表4为一个对齐周期中基于对齐单元的信息形成的对齐信息表,如表4所示,该对齐信息表中的每列相当于对齐单元的属性信息,可以包括:对齐单元的单元序号、该对齐单元内起始节奏点的节奏点序号、所匹配的文字的文字序号以及进行对齐所需的变速比,该对齐信息表的行数代表了该对齐周期所具备对齐单元的单元个数。Exemplarily, Table 4 is an alignment information table formed based on the information of the alignment unit in an alignment cycle. As shown in Table 4, each column in the alignment information table is equivalent to the attribute information of the alignment unit, and may include: Unit serial number, the rhythm point serial number of the starting rhythm point in the alignment unit, the character serial number of the matched text, and the gear ratio required for alignment, the number of rows in the alignment information table represents the unit of the alignment unit provided in the alignment cycle number.
所述根据所述当前对齐信息表确定余下每个对齐周期的对齐信息表,包括:针对余下的每个对齐周期,如果所述对齐周期为一个完整周期,则将所述当前对齐信息表作为所述对齐周期的对齐信息表;如果所述对齐周期为一个非完整周期且所述当前对齐信息表中的一行对应一个对齐单元,则确定所述对齐周期所包括的节奏点的节奏点个数;从所述当前对齐信息表中逆序选定所述节奏点个数行的对齐单元信息,形成所述对齐周期的对齐信息表。The determining of the alignment information table of each remaining alignment period according to the current alignment information table includes: for each remaining alignment period, if the alignment period is a complete period, then using the current alignment information table as the The alignment information table of the alignment cycle; if the alignment cycle is an incomplete cycle and a row in the current alignment information table corresponds to an alignment unit, then determine the number of rhythm points of the rhythm points included in the alignment cycle; The alignment unit information of the number of lines of the rhythm points is selected from the current alignment information table in reverse order to form the alignment information table of the alignment period.
本实施例中的上述描述给出了背景音乐内其余对齐周期的对齐信息表的确定过程,对于非完整的对齐周期,假设非完整的对齐周期中包括2个节奏点,则可直接从当前对齐信息表中由下至上选择两行对齐单元信息构成相应的对齐信息表。The above description in this embodiment provides the process of determining the alignment information table of the remaining alignment periods in the background music. For an incomplete alignment period, assuming that the incomplete alignment period includes 2 rhythm points, the current alignment can be directly obtained from the current alignment. In the information table, two rows of alignment unit information are selected from bottom to top to form a corresponding alignment information table.
S206、按照所述至少一个对齐周期的对齐信息表控制所述语音段中的文字与所述背景音乐中的节奏点对齐,得到对齐后的音频,并在对所述对齐后的音频进行变调调整以及特效处理后形成说唱音频。S206. Control the text in the speech segment to align with the rhythm points in the background music according to the alignment information table of the at least one alignment period, obtain aligned audio, and perform pitch adjustment on the aligned audio As well as special effects processing to form rap audio.
在本实施例中,每个对齐周期所形成的对齐信息表中至少包括了一个对齐单元以及相应的对齐单元信息,而每个对齐单元信息中包括了实际用于文字和 节奏点对齐的节奏点序号、所匹配的文字序号以及对齐所需的变速比等。本实施例通过上述步骤获得每个对齐周期的对齐信息表后,就可以依据每个对齐信息表中包括的对齐单元信息,控制相应的节奏点与所匹配的文字按照对应的变速比进行对齐,由此来实现语音段中的文字与背景音乐中的节奏点的对齐匹配。In this embodiment, the alignment information table formed by each alignment period includes at least one alignment unit and corresponding alignment unit information, and each alignment unit information includes the rhythm point actually used for the alignment of text and rhythm points serial number, matching text serial number, and gear ratio required for alignment, etc. In this embodiment, after the alignment information table of each alignment period is obtained through the above steps, it is possible to control the corresponding rhythm point and the matched text to be aligned according to the corresponding speed change ratio according to the alignment unit information included in each alignment information table, In this way, the text in the speech segment is aligned and matched with the rhythm points in the background music.
在控制语音段内的文字与所匹配的节奏点对齐时,对于每个对齐周期内的匹配,实际相当于先根据该对齐周期内每个对齐单元内所包括的文字的发音占据时长(该对齐单元的首个元音起始点到下一个单元的首个元音起始点的间隔时长)来获取该对齐单元在语音段内实际对应的音频数据,然后根据该对齐单元的变速比来对实际对应的音频数据进行变速调整,最终,可以对变速调整后的音频数据再进行变调调整以及特效处理等操作,从而形成转换后的说唱音乐。When controlling the text in the speech segment to be aligned with the matched rhythm point, for the matching in each alignment period, it is actually equivalent to first occupying the duration of the pronunciation according to the text included in each alignment unit in the alignment period (the alignment The interval duration from the start of the first vowel of the unit to the start of the first vowel of the next unit) to obtain the audio data actually corresponding to the alignment unit in the speech segment, and then according to the gear ratio of the alignment unit to match the actual corresponding Variable speed adjustment is performed on the audio data of the variable speed adjustment, and finally, operations such as pitch adjustment and special effect processing can be performed on the audio data after the variable speed adjustment, so as to form the converted rap music.
本申请实施例二提供的一种将语音转换为说唱音乐方法,给出了文字属性信息以及音乐节奏信息的确定操作,还给出了确定用于将语音段与背景音乐对齐的对齐周期以及相关对齐信息表的操作。通过本实施例提供的方法,能够使用户在选定背景音乐以及录制一段随意内容的说话语音后,通过获得到的节奏点位置,单字的起止时间以及元音开始时间,确定字与节奏点匹配对齐以及变速的对齐策略,由此通过对齐策略短时间内就能获得字与节奏点对齐后形成的说唱音乐。整个技术方案的实现,简化了手动进行音频剪辑制作的繁琐过程,为非专业音频处理人员提供了说唱音乐制作的可能;同时,上述技术方案无需限制待转换语音内容,保证了待转换语音内容的自由化录制,还简化了语音转换的实现过程,避免了语音文字与音乐节奏点错位的情况,提升了语音转换说唱音乐的应用范围。A method for converting speech into rap music provided by the second embodiment of the present application provides a determination operation for text attribute information and music rhythm information, and also provides an alignment period for aligning the speech segment with the background music and related The operation of the alignment information table. With the method provided in this embodiment, after selecting background music and recording a speech of random content, the user can determine the match between the word and the rhythm point by obtaining the obtained rhythm point position, the start and end time of a single word, and the start time of a vowel. Alignment and variable-speed alignment strategy, so that the rap music formed by aligning words and rhythm points can be obtained in a short time through the alignment strategy. The realization of the entire technical solution simplifies the tedious process of manual audio editing and production, and provides non-professional audio processing personnel with the possibility of making rap music; at the same time, the above technical solution does not need to limit the content of the voice to be converted, which ensures the quality of the voice content to be converted. Free recording also simplifies the implementation process of voice conversion, avoids the misalignment of voice text and music rhythm points, and improves the application scope of voice conversion rap music.
作为本申请实施例二的一个可选实施例,本可选实施例在执行上述S202中确定所述背景音乐中包含的节奏点总量、节奏点序号、以及每个节拍周期的周期信息,构成所述背景音乐的音乐节奏信息之前,还包括:获取检测出的多个初始节奏点,并确定相邻两个初始节奏点之间形成的间隔时长;根据所述语音段所包括的文字的平均字长时间结合所述间隔时长,确定所述多个初始节奏点中的待删除节奏点并删除待删除节奏点,获得所述背景音乐中有效的节奏点。As an optional embodiment of the second embodiment of the present application, this optional embodiment determines the total amount of rhythm points included in the background music, the sequence number of rhythm points, and the period information of each beat period in the execution of the above S202. Before the music rhythm information of the background music, the method further includes: acquiring a plurality of detected initial rhythm points, and determining the interval duration formed between two adjacent initial rhythm points; The word long time is combined with the interval time length to determine the to-be-deleted rhythm point among the plurality of initial rhythm points and delete the to-be-deleted rhythm point to obtain an effective rhythm point in the background music.
在本可选实施例中,给出了对从背景音乐中所检测出的节奏点进行处理的操作,通过该操作可以从所检测出的节奏点(本可选实施例记为初始节奏点)中去除掉相邻两个节奏点的间隔时长小于平均字长时间一半的间隔较密集的节奏点。In this optional embodiment, an operation for processing the rhythm points detected from the background music is given, and through this operation, the detected rhythm points (referred to as initial rhythm points in this optional embodiment) can be obtained. Remove the densely spaced rhythm points where the interval between two adjacent rhythm points is less than half of the average word length.
文字的平均字长时间为全部文字所占用的时长与文字总量的比值,一般来说,如果相邻两个节奏点间的间隔时长小于平均字长时间的一半,不利于文字与节奏点的对齐,因此,需要对相邻两个节奏点中的任一个进行删除,从而使 未删除的那个节奏点与所删除掉节奏点的前一个节奏点或者后一个节奏点构成新的间隔时长,且可以再次通过本可选实施例的方式对新形成的间隔时长再次进行判定,由此循环更新去除无效的节奏点,保留下有效的节奏点。The average character length of a character is the ratio of the time occupied by all characters to the total amount of characters. Generally speaking, if the interval between two adjacent rhythm points is less than half of the average character length, it is not conducive to the difference between characters and rhythm points. Therefore, it is necessary to delete any one of the two adjacent rhythm points, so that the undeleted rhythm point and the rhythm point before or after the deleted rhythm point constitute a new interval duration, and The newly formed interval duration can be determined again in the manner of this optional embodiment, whereby invalid rhythm points are removed by cyclic updating, and valid rhythm points are retained.
作为本申请实施例二的另一个可选实施例,对上述S2047的执行进行了说明,图5为本申请实施例二提供的一种确定对齐周期中对齐单元以及对齐单元信息的展开流程图,如图5所示,对根据每个候选对齐单元的单元时长以及每个候选对齐单元所匹配的文字的匹配文字属性信息,确定至少一个对齐单元并获得相应的变速比,进行说明:As another optional embodiment of the second embodiment of the present application, the execution of the above S2047 is described. FIG. 5 is an expanded flowchart for determining the alignment unit and the alignment unit information in the alignment period provided by the second embodiment of the present application, As shown in FIG. 5 , according to the unit duration of each candidate alignment unit and the matching text attribute information of the text matched by each candidate alignment unit, at least one alignment unit is determined and the corresponding gear ratio is obtained. Describe:
本可选实施例为上述S2047的执行过程。通过上述S2046的操作,在待对齐节奏段内可以获得一定数量的候选对齐单元,本可选实施例下述操作可以确定候选对齐单元中的对齐单元以及对齐单元对应的变速比。This optional embodiment is the execution process of the foregoing S2047. Through the above operation of S2046, a certain number of candidate alignment units can be obtained in the to-be-aligned rhythm segment. The following operations in this optional embodiment can determine the alignment units in the candidate alignment units and the gear ratio corresponding to the alignment units.
S1、按照单元序号的顺序选取一个未选中的候选对齐单元作为当前处理单元。S1. Select an unselected candidate alignment unit as the current processing unit according to the sequence of unit serial numbers.
在本实施例中,待对齐节奏段内的候选对齐单元存在相应的单元序号,可以按照单元序号的顺序先选取一个之前未被选中的候选对齐单元,作为当前处理单元,未选中可理解为未被选取作为过当前处理单元。In this embodiment, a candidate alignment unit in the rhythm segment to be aligned has a corresponding unit serial number, and a candidate alignment unit that has not been selected before can be selected in the order of the unit serial number as the current processing unit. Unselected can be understood as unselected. Selected as the current processing unit.
示例性的,选取首个候选处理单元作为当前处理单元。Exemplarily, the first candidate processing unit is selected as the current processing unit.
S2、根据所述当前处理单元的单元时长,结合所述当前处理单元以及相邻下一候选对齐单元分别所匹配的文字的起止时间和首个元音的起始位置,确定所述当前处理单元的当前变速比。S2. Determine the current processing unit according to the unit duration of the current processing unit and in combination with the start and end times of the characters and the start position of the first vowel respectively matched by the current processing unit and the adjacent next candidate alignment unit the current gear ratio.
根据本实施例的上述描述,可知文字与候选对齐单元的对齐主要表现在文字实际的发音时长与候选对齐单元的单元时长的对齐,二者的对齐可通过对文字发音时长的拉伸或者压缩来实现,而对文字发音时长的拉伸或者压缩可通过一个变速比决定。变速比相当于发音时长与文字实际发音时长的比值。According to the above description of this embodiment, it can be seen that the alignment of the text and the candidate alignment unit is mainly manifested in the alignment of the actual pronunciation duration of the text and the unit duration of the candidate alignment unit. The alignment of the two can be achieved by stretching or compressing the pronunciation duration of the text. Realization, and the stretching or compression of the text pronunciation time can be determined by a gear ratio. The gear ratio is equivalent to the ratio of the pronunciation time to the actual pronunciation time of the text.
对于一个文字而言,其实际的发音时长是从首个元音的起始位置处开始的,而实际发音的结束时间可看做相邻下一文字的首个元音起始位置处。将文字与候选对齐单元相结合来考虑的话,一个候选对齐单元中,所匹配的全部文字实际发音所占用的时长应该是从该候选对齐单元中首个匹配文字的首个元音位置开始,到与其相邻的下一候选对齐单元中首个匹配文字的首个元音位置结束。因此,可以通过当前处理单元以及相邻下一候选对齐单元分别所匹配的文字的起止时间和首个元音的起始位置来确定当前处理单元所匹配的全部文字的实际发音时长,并由此根据已知的单元时长以及确定出的实际发音时长来获得当前处理单元的当前变速比。For a character, the actual pronunciation duration starts from the starting position of the first vowel, and the actual pronunciation ending time can be regarded as the starting position of the first vowel of the next adjacent character. If the text is considered in combination with the candidate alignment unit, in a candidate alignment unit, the time occupied by the actual pronunciation of all the matched text should be from the position of the first vowel of the first matched text in the candidate alignment unit, to Ends at the first vowel position of the first matching text in the next candidate alignment unit adjacent to it. Therefore, the actual pronunciation duration of all characters matched by the current processing unit can be determined by the start and end times of the characters matched by the current processing unit and the adjacent next candidate alignment unit respectively and the start position of the first vowel, and thus The current gear ratio of the current processing unit is obtained according to the known unit duration and the determined actual sounding duration.
本实施例中,根据所述当前处理单元的单元时长,结合所述当前处理单元以及相邻下一候选对齐单元分别所匹配的文字的起止时间和首个元音的起始位置,确定所述当前处理单元的当前变速比,包括:In this embodiment, according to the unit duration of the current processing unit, in combination with the start and end times of the characters and the start position of the first vowel respectively matched by the current processing unit and the adjacent next candidate alignment unit, determine the The current gear ratio of the current processing unit, including:
S21、根据所述当前处理单元相匹配的全部文字中的每个文字的起止时间以及首个元音的起始位置,确定相匹配的全部文字在所述当前处理单元内的发音占据时长。S21. According to the start and end time of each character in all characters matched by the current processing unit and the start position of the first vowel, determine the duration of pronunciation occupied by all the matched characters in the current processing unit.
本步骤可以获取到当前处理单元中相匹配的全部文字的匹配文字属性信息,匹配文字属性信息可以是每个文字的起止时间以及该文字首个原因的起始位置,基于这些信息,可以确定出相匹配的全部文字在所述当前处理单元内的发音占据时长。In this step, the matching character attribute information of all characters matched in the current processing unit can be obtained. The matching character attribute information can be the starting and ending time of each character and the starting position of the first cause of the character. Based on these information, it can be determined that The pronunciation of all the matched characters in the current processing unit occupies a long time.
示例性的,假设当前处理单元中仅有一个文字,该文字的起止时间分别为t1和t2,首个元音的起始位置为t3,其中,t1<t3<t2,则该文字在当前处理单元内的发音占据时长实际为t2-t3。Exemplarily, assuming that there is only one character in the current processing unit, the starting and ending times of the character are t1 and t2 respectively, and the starting position of the first vowel is t3, where t1<t3<t2, then the character is currently being processed. The pronunciation occupying time in the unit is actually t2-t3.
假设当前处理单元中有两个文字,第一个文字的起止时间分别为t1和t2,首个元音的起始位置为t3,第二个文字的起止时间分别为t2和t4,首个元音的起始位置为t5,其中,t1<t3<t2<t5<t4,则这两个文字在当前处理单元内共同具备的发音占据时长为:t4-t3或者为(t2-t3)+(t4-t5)。可以看出,当前处理单元中所匹配的全部文字的发音占据时长仅为全部文字起止时长之和与第一个文字起始到首个元音的间隔时长的差值。Suppose there are two characters in the current processing unit, the start and end times of the first character are t1 and t2 respectively, the start position of the first vowel is t3, the start and end times of the second character are t2 and t4 respectively, the first The starting position of the sound is t5, where t1<t3<t2<t5<t4, then the two characters in the current processing unit have a common pronunciation occupying time: t4-t3 or (t2-t3)+( t4-t5). It can be seen that the pronunciation occupied duration of all characters matched in the current processing unit is only the difference between the sum of the start and end durations of all characters and the interval duration from the start of the first character to the first vowel.
这里的发音占据时长并不是当前处理单元中相匹配的全部文字的实际发音时长,实际发音时长中还包括了当前处理单元相邻下一候选对齐单元所匹配的首个文字的元音间隔时长,该元音间隔时长可通过下述S22获得。本实施例采用上述方式确定发音占据时长的目的在于使语音段中每个文字的首个元音位置能够对齐到所匹配的候选对齐处理单元的节奏点上,由此来保证采用该种方式对齐后的文字和节奏点的播放效果更优于直接把字首与节奏点对齐的播放效果。The pronunciation occupying time here is not the actual pronunciation duration of all characters matched in the current processing unit. The actual pronunciation duration also includes the vowel interval duration of the first character matched by the next candidate alignment unit adjacent to the current processing unit. The vowel interval duration can be obtained through the following S22. In this embodiment, the purpose of determining the pronunciation occupied duration in the above-mentioned manner is to enable the position of the first vowel of each character in the speech segment to be aligned with the rhythm point of the matched candidate alignment processing unit, thereby ensuring that alignment is adopted in this manner The playback effect of the following text and rhythm points is better than that of directly aligning the prefix with the rhythm points.
S22、根据所述当前处理单元相邻下一候选对齐单元所匹配的首个文字的起止时间以及首个元音的起始位置,确定所述首个文字的元音间隔时长。S22. Determine the vowel interval duration of the first character according to the start and end times of the first character matched by the adjacent next candidate alignment unit of the current processing unit and the start position of the first vowel.
以当前处理单元包括两个文字为例,假设当前处理单元相邻下一候选对齐单元所匹配的文字中首个文字的起止时间为t4和t6,首个元音的起始位置为t7,其中,t1<t3<t2<t5<t4<t7<t6,则首个文字的元音间隔时长为t7-t4。Taking the current processing unit including two characters as an example, it is assumed that the start and end times of the first character in the characters matched by the next candidate alignment unit adjacent to the current processing unit are t4 and t6, and the start position of the first vowel is t7, where , t1<t3<t2<t5<t4<t7<t6, then the vowel interval of the first character is t7-t4.
通过S21和S22,可以确定当前处理单元中相匹配的全部文字的实际发音时长为当前处理单元内相匹配的所有文字的发音占据时长与相邻下一候选对齐单元所匹配的首个文字的元音间隔时长。当前处理单元中相匹配的全部文字的实 际发音时长为(t4-t3)+(t7-t4)。Through S21 and S22, it can be determined that the actual pronunciation duration of all characters matched in the current processing unit is the unit of the first character matched by the pronunciation occupying duration of all characters matched in the current processing unit and the adjacent next candidate alignment unit sound interval. The actual pronunciation duration of all characters matched in the current processing unit is (t4-t3)+(t7-t4).
S23、将所述当前处理单元的单元时长与所确定的实际发音时长的比值作为所述当前处理单元的当前变速比,其中,所述实际发音时长为发音占据时长与元音间隔时长的和。S23, take the ratio of the unit duration of the current processing unit and the determined actual pronunciation duration as the current speed change ratio of the current processing unit, wherein, the actual pronunciation duration is the sum of the pronunciation occupying duration and the vowel interval duration.
假设当前处理单元的单元时长为t,当前处理单元的当前变速比可表示为:t/[(t4-t3)+(t7-t4)]。Assuming that the unit duration of the current processing unit is t, the current gear ratio of the current processing unit can be expressed as: t/[(t4-t3)+(t7-t4)].
S3、将所述当前变速比与设定的第一变速比值以及第二变速比值进行比对,其中,所述第二变速比值大于所述第一变速比值。S3. Comparing the current gear ratio with the set first gear ratio value and the second gear ratio value, wherein the second gear ratio value is greater than the first gear ratio value.
通过S2确定出当前处理单元的当前变速比之后,可以将当前变速比与设定的第一变速比值以及第二变速比值进行比较,以此来确定通过当前变速比对相匹配的文字进行拉伸或者压缩是否满足常规的拉伸/压缩条件。After the current gear ratio of the current processing unit is determined through S2, the current gear ratio can be compared with the set first gear ratio value and the second gear ratio value, so as to determine that the matching text is to be stretched by the current gear ratio Or whether the compression meets the normal tension/compression conditions.
本实施例设定处于第一变速比值和第二变速比值之间的变速比满足拉伸/压缩条件,小于第一变速比值的不满足压缩条件,大于第二变速比值的不满足拉伸条件。This embodiment assumes that the speed ratio between the first speed ratio and the second speed ratio satisfies the stretching/compression condition, the speed ratio smaller than the first speed ratio does not meet the compression condition, and the speed ratio greater than the second speed ratio does not meet the stretching condition.
S4、如果所述当前变速比大于或等于第一变速比值且小于或等于第二变速比值,则将所述当前处理单元确定为对齐单元,并记所述当前变速比为所述对齐单元的变速比,之后执行S7。S4. If the current gear ratio is greater than or equal to the first gear ratio value and less than or equal to the second gear ratio value, determine the current processing unit as the alignment unit, and record the current gear ratio as the gear shift of the alignment unit ratio, and then execute S7.
当所述当前变速比大于或等于第一变速比值且小于或等于第二变速比值时,认为该当前处理单元的当前变速比满足常规的拉伸/压缩条件,此时可直接将当前处理单元看作一个对齐单元,以及将当前变速比看作该对齐单元的变速比,并跳转至S7继续执行操作。When the current gear ratio is greater than or equal to the first gear ratio and less than or equal to the second gear ratio, it is considered that the current gear ratio of the current processing unit satisfies the conventional stretching/compression conditions, and the current processing unit can be directly viewed at this time. Make an alignment unit, and regard the current gear ratio as the gear ratio of the alignment unit, and jump to S7 to continue the operation.
S5、如果所述当前变速比大于所述第二变速比值,则确定用于填补所述当前处理单元的静音时长,并根据所述静音时长确定新的当前变速比,之后执行S3。S5. If the current gear ratio is greater than the second gear ratio value, determine the silence duration used to fill the current processing unit, and determine a new current gear ratio according to the silence duration, and then execute S3.
当所述当前变速比大于所述第二变速比值时,认为该当前处理单元的当前变速比不满足常规的拉伸条件,此时相当于当前处理单元的单元时长过长于相匹配的全部文字的实际发音时长,需要在当前处理单元中补入一个静音时长,以此来增加文字的实际发音时长。When the current speed ratio is greater than the second speed ratio value, it is considered that the current speed ratio of the current processing unit does not meet the normal stretching conditions, which is equivalent to that the unit duration of the current processing unit is longer than the length of all matching characters. For the actual pronunciation duration, a mute duration needs to be added to the current processing unit to increase the actual pronunciation duration of the text.
一实施例中,确定所补入的静音时长为一个文字的起止时长,由此,根据该静音时长与已确定的实际发音时长相结合,重新确定一个以单元时长为分子,静音时长与实际发音时长之和为分母的当前变速比。并在此之后重新返回S3进行变速比的比对操作。In one embodiment, it is determined that the added mute duration is the start and end duration of a character, and thus, according to the combination of the mute duration and the determined actual pronunciation duration, a unit duration is re-determined, and the mute duration is the same as the actual pronunciation duration. The sum of the durations is the current gear ratio in the denominator. After that, it returns to S3 to perform the comparison operation of the gear ratio.
S6、如果所述当前变速比小于所述第一变速比值,则将当前处理单元与相邻下一候选对齐单元合并形成新的当前处理单元,返回执行S2。S6. If the current gear ratio is smaller than the first gear ratio value, combine the current processing unit with the adjacent next candidate alignment unit to form a new current processing unit, and return to executing S2.
当所述当前变速比小于所述第一变速比值时,认为该当前处理单元的当前变速比不满足常规的压缩条件,此时相当于当前处理单元的单元时长过小于相匹配的全部文字的实际发音时长,需要在当前处理单元的基础上,再并入一个候选对齐单元,形成一个新的当前处理单元,以此来增加当前处理单元的单元时长。When the current speed ratio is smaller than the first speed ratio value, it is considered that the current speed ratio of the current processing unit does not meet the conventional compression conditions, which is equivalent to that the unit duration of the current processing unit is shorter than the actual length of all matching characters. The pronunciation duration needs to be combined with a candidate alignment unit on the basis of the current processing unit to form a new current processing unit, so as to increase the unit duration of the current processing unit.
一实施例中,待并入的候选对齐单元为当前处理单元相邻的下一候选对齐单元,此时,新形成的当前处理单元的单元时长为原有单元时长与下一候选对齐单元所对应单元时长的和,之后,可返回S2重新计算新形成的当前处理单元中相匹配的全部文字的实际发音时长。In one embodiment, the candidate alignment unit to be merged is the next candidate alignment unit adjacent to the current processing unit. In this case, the unit duration of the newly formed current processing unit is the duration of the original unit and the next candidate alignment unit. After the sum of the unit durations, it is possible to return to S2 to recalculate the actual pronunciation durations of all characters matched in the newly formed current processing unit.
本实施例中进行了选择下一候选对齐单元并入当前处理单元的操作,同样认为所选取的下一候选对齐单元已经被选中,后续在需要执行S1时,可以跳过对该下一候选对齐单元的选取,不再单独选取该下一候选对齐单元作为当前处理单元。In this embodiment, the operation of selecting the next candidate alignment unit to be incorporated into the current processing unit is performed, and it is also considered that the selected next candidate alignment unit has been selected. When S1 needs to be performed subsequently, the next candidate alignment unit can be skipped. For unit selection, the next candidate alignment unit is no longer individually selected as the current processing unit.
S7、判断所有候选对齐单元是否均被选中参与处理,若所有候选对齐单元均被选中参与处理,则执行S8;若所有候选对齐单元中存在未被选中参与处理的候选对齐单元,返回执行S1;S7, determine whether all candidate alignment units are selected to participate in processing, if all candidate alignment units are selected to participate in processing, then execute S8; If there are candidate alignment units that are not selected to participate in processing in all candidate alignment units, return to execute S1;
通过上述步骤确定出一个对齐单元后,该待对齐节奏段中可能还存在未被选中的候选对齐单元,可以通过S7进行一下判定,如果所有候选对齐单元均被选中参与了上述处理,则可执行S8,如果所有候选对齐单元中存在未被选中参与处理的候选对齐单元,需要返回S1重新选择一个未被选中的候选对齐单元循环进行上述操作。After an alignment unit is determined through the above steps, there may be unselected candidate alignment units in the to-be-aligned rhythm segment, which can be determined by S7. If all the candidate alignment units are selected to participate in the above processing, the execution can be executed. S8, if there is a candidate aligning unit that is not selected to participate in the processing among all the candidate aligning units, it is necessary to return to S1 to re-select an unselected candidate aligning unit to perform the above operations in a loop.
S8、汇总所确定的每个对齐单元以及相应的变速比。S8. Summarize each of the determined alignment units and the corresponding gear ratios.
可以将上述确定出的每个对齐单元以及相应的变速比进行汇总,以获得待对齐节奏段中包括的至少一个对齐单元以及变速比。Each of the above-determined alignment units and corresponding gear ratios may be aggregated to obtain at least one alignment unit and a gear ratio included in the rhythm section to be aligned.
本可选实施例给出了确定待对齐节奏段中的对齐单元以及相应变速比的实现过程,通过本可选实施例的执行,能够保证待对齐节奏段中的节奏点与语音段中的文字的有效对齐,避免了语音文字与音乐节奏点错位的情况发生,由此为本实施例中语音到说唱音乐的转换提供了有效的理论支撑。This optional embodiment provides an implementation process for determining the alignment unit in the rhythm segment to be aligned and the corresponding speed change ratio. Through the execution of this optional embodiment, it is possible to ensure that the rhythm points in the rhythm segment to be aligned and the text in the speech segment The effective alignment of the rhythm points avoids the occurrence of misalignment between the speech text and the music rhythm point, thereby providing an effective theoretical support for the conversion of speech to rap music in this embodiment.
实施例三Embodiment 3
图6为本申请实施例三提供的一种将语音转换为说唱音乐的装置的结构框图,该装置适用于对用户录制的语音进行说唱音乐转换的情况,其中,该装置可以由软件或硬件实现,并一般可集成在计算机设备上。如图6所示,该装置包括:信息确定模块31、对齐信息确定模块32以及转换控制模块33。FIG. 6 is a structural block diagram of an apparatus for converting speech into rap music according to Embodiment 3 of the present application. The apparatus is suitable for converting the voice recorded by a user to rap music. The apparatus can be implemented by software or hardware. , and can generally be integrated on computer equipment. As shown in FIG. 6 , the apparatus includes: an information determination module 31 , an alignment information determination module 32 and a conversion control module 33 .
信息确定模块31,设置为识别语音段以及处理背景音乐,获得所述语音段内文字的文字属性信息以及所述背景音乐的音乐节奏信息;对齐信息确定模块32,设置为根据所述文字属性信息以及所述音乐节奏信息,确定用于将所述语音段与所述背景音乐对齐的至少一个对齐周期,并获得每个对齐周期的对齐信息表;转换控制模块33,设置为按照所述至少一个对齐周期的对齐信息表控制所述语音段中的文字与所述背景音乐中的节奏点对齐,得到对齐后的音频,并在对所述对齐后的音频进行变调调整以及特效处理后形成说唱音频。The information determination module 31 is set to recognize the speech segment and process the background music, and obtain the text attribute information of the text in the speech segment and the music rhythm information of the background music; the alignment information determination module 32 is set to be based on the text attribute information. and the music rhythm information, determine at least one alignment period for aligning the speech segment with the background music, and obtain an alignment information table for each alignment period; the conversion control module 33 is set to be in accordance with the at least one The alignment information table of the alignment period controls the alignment of the text in the speech segment with the rhythm points in the background music to obtain aligned audio, and the rap audio is formed after performing pitch adjustment and special effects processing on the aligned audio .
本申请实施例三提供的一种将语音转换为说唱音乐的装置,有效实现了将用户随意录制的语音内容片段转化为配合背景音乐的说唱片段,简化了手动进行音频剪辑制作的繁琐过程,为非专业音频处理人员提供了说唱音乐制作的可能;同时,上述技术方案无需限制待转换语音内容,保证了待转换语音内容的自由化录制,还简化了语音转换的实现过程,避免了语音文字与音乐节奏点错位的情况,提升了语音转换说唱音乐的应用范围。The device for converting voice into rap music provided by the third embodiment of the present application effectively realizes the conversion of voice content clips randomly recorded by users into voice clips matched with background music, simplifies the tedious process of manually performing audio editing, and provides Non-professional audio processing personnel provide the possibility of rap music production; at the same time, the above technical solution does not need to limit the voice content to be converted, which ensures the free recording of the voice content to be converted, and also simplifies the implementation process of voice conversion, avoiding the need for voice and text to be converted. The situation that the music rhythm points are misplaced has improved the application scope of voice-converted rap music.
实施例四Embodiment 4
图7为本申请实施例四提供的一种计算机设备的硬件结构示意图,该计算机设备包括:处理器和存储装置。存储装置中存储有至少一条指令,且指令由所述处理器执行,使得所述计算机设备执行如上述方法实施例所述的将语音转换为说唱音乐的方法。FIG. 7 is a schematic diagram of the hardware structure of a computer device according to Embodiment 4 of the present application, where the computer device includes: a processor and a storage device. At least one instruction is stored in the storage device, and the instruction is executed by the processor, so that the computer device executes the method for converting speech into rap music according to the above method embodiments.
参照图7,该计算机设备可以包括:处理器40、存储装置41、显示屏42、输入装置43、输出装置44以及通信装置45。该计算机设备中处理器40的数量可以是一个或者多个,图6中以一个处理器40为例。该计算机设备中存储装置41的数量可以是一个或者多个,图7中以一个存储装置41为例。该计算机设备的处理器40、存储装置41、显示屏42、输入装置43、输出装置44以及通信装置45可以通过总线或者其他方式连接,图7中以通过总线连接为例。7 , the computer equipment may include: a processor 40 , a storage device 41 , a display screen 42 , an input device 43 , an output device 44 and a communication device 45 . The number of processors 40 in the computer device may be one or more, and one processor 40 is taken as an example in FIG. 6 . The number of storage devices 41 in the computer device may be one or more, and one storage device 41 is taken as an example in FIG. 7 . The processor 40 , storage device 41 , display screen 42 , input device 43 , output device 44 and communication device 45 of the computer equipment may be connected by a bus or in other ways. In FIG. 7 , the connection by a bus is taken as an example.
实施例中,处理器40执行存储装置41中存储的一个或多个程序时,实现如下操作:识别语音段以及处理背景音乐,获得所述语音段内文字的文字属性信息以及所述背景音乐的音乐节奏信息;根据所述文字属性信息以及所述音乐节奏信息,确定用于将所述语音段与所述背景音乐对齐的至少一个对齐周期, 并获得每个对齐周期的对齐信息表;按照所述至少一个对齐周期的对齐信息表控制所述语音段中的文字与所述背景音乐中的节奏点对齐,得到对齐后的音频,并在对所述对齐后的音频进行变调调整以及特效处理后形成说唱音频。In the embodiment, when the processor 40 executes one or more programs stored in the storage device 41, the following operations are implemented: recognizing the speech segment and processing the background music, and obtaining the text attribute information of the text in the speech segment and the information of the background music. Music rhythm information; according to the text attribute information and the music rhythm information, determine at least one alignment period for aligning the speech segment with the background music, and obtain an alignment information table for each alignment period; The alignment information table of the at least one alignment period controls the text in the speech segment to be aligned with the rhythm points in the background music, to obtain aligned audio, and after the aligned audio is adjusted and processed with special effects Form rap audio.
本申请实施例还提供一种计算机可读存储介质,所述存储介质中的程序由计算机设备的处理器执行时,使得计算机设备能够执行如上述实施例所述的将语音转换为说唱音乐的方法。示例性的,上述实施例所述的将语音转换为说唱音乐的方法包括:识别语音段以及处理背景音乐,获得所述语音段内文字的文字属性信息以及所述背景音乐的音乐节奏信息;根据所述文字属性信息以及所述音乐节奏信息,确定用于将所述语音段与所述背景音乐对齐的至少一个对齐周期,并获得每个对齐周期的对齐信息表;按照所述至少一个对齐周期的对齐信息表控制所述语音段中的文字与所述背景音乐中的节奏点对齐,得到对齐后的音频,并在对所述对齐后的音频进行变调调整以及特效处理后形成说唱音频。Embodiments of the present application further provide a computer-readable storage medium, where a program in the storage medium, when executed by a processor of a computer device, enables the computer device to execute the method for converting speech into rap music as described in the foregoing embodiments . Exemplarily, the method for converting speech into rap music described in the above embodiments includes: recognizing speech segments and processing background music, and obtaining text attribute information of words in the speech segment and music rhythm information of the background music; the text attribute information and the music rhythm information, determine at least one alignment period for aligning the speech segment with the background music, and obtain an alignment information table for each alignment period; according to the at least one alignment period The alignment information table controls the text in the speech segment to align with the rhythm points in the background music to obtain the aligned audio, and the rap audio is formed after the alignment of the audio is adjusted and processed with special effects.
对于装置、计算机设备、存储介质实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。For the embodiments of the apparatus, computer equipment, and storage medium, since they are basically similar to the method embodiments, the description is relatively simple, and reference may be made to some descriptions of the method embodiments for related parts.

Claims (13)

  1. 一种将语音转换为说唱音乐的方法,包括:A method of converting speech into rap music, comprising:
    识别语音段以及处理背景音乐,获得所述语音段中的文字的文字属性信息以及所述背景音乐的音乐节奏信息;Identifying speech segments and processing background music, and obtaining text attribute information of the text in the speech segment and music rhythm information of the background music;
    根据所述文字属性信息以及所述音乐节奏信息,确定用于将所述语音段与所述背景音乐对齐的至少一个对齐周期,并获得每个对齐周期的对齐信息表;According to the text attribute information and the music rhythm information, determine at least one alignment period for aligning the speech segment with the background music, and obtain an alignment information table for each alignment period;
    按照所述至少一个对齐周期的对齐信息表控制所述语音段中的文字与所述背景音乐中的节奏点对齐,得到对齐后的音频,并在对所述对齐后的音频进行变调调整以及特效处理后形成说唱音频。Control the text in the speech segment to align with the rhythm points in the background music according to the alignment information table of the at least one alignment period to obtain aligned audio, and perform pitch adjustment and special effects on the aligned audio Rap audio is formed after processing.
  2. 根据权利要求1所述的方法,其中,所述识别语音段以及处理背景音乐,获得所述语音段中的文字的文字属性信息以及所述背景音乐的音乐节奏信息,包括:The method according to claim 1, wherein the identifying the speech segment and processing the background music to obtain the text attribute information of the text in the speech segment and the music rhythm information of the background music, comprising:
    对用户选定的所述语音段进行降噪处理以及端点检测处理,通过对处理后的语音段的语音识别获得所述处理后的语音段中的多个文字中每个文字的文字序号、每个文字的起止时间、每个文字的首个元音的起始位置以及所述多个文字的文字总量,构成所述语音段的文字属性信息;Perform noise reduction processing and endpoint detection processing on the voice segment selected by the user, and obtain the character serial number, each character number, and the number of each character in the multiple characters in the processed voice segment through speech recognition of the processed voice segment. The starting and ending time of each character, the starting position of the first vowel of each character, and the total amount of characters of the plurality of characters constitute the character attribute information of the speech segment;
    对所述用户选定的所述背景音乐进行节奏点检测和节拍周期划分,确定所述背景音乐中包含的多个节奏点的节奏点总量、每个节奏点的节奏点序号、以及所述背景音乐中包含的多个节拍周期中每个节拍周期的周期信息,构成所述背景音乐的音乐节奏信息;Perform rhythm point detection and rhythm period division on the background music selected by the user, determine the total amount of rhythm points of multiple rhythm points included in the background music, the rhythm point sequence number of each rhythm point, and the The period information of each beat period in the multiple beat periods contained in the background music constitutes the music rhythm information of the background music;
    其中,所述周期信息包括:所述每个节拍周期的周期号、所述每个节拍周期所包括的多个节奏点的节奏点个数以及所述每个节拍周期所包括的每个节奏点的节奏点序号和节奏点起始时间。Wherein, the period information includes: the period number of each beat period, the number of rhythm points of multiple rhythm points included in each beat period, and each rhythm point included in each beat period rhythm point number and rhythm point start time.
  3. 根据权利要求2所述的方法,在所述确定所述背景音乐中包含的多个节奏点的节奏点总量、每个节奏点的节奏点序号、以及所述背景音乐中包含的多个节拍周期中每个节拍周期的周期信息,构成所述背景音乐的音乐节奏信息之前,还包括:The method according to claim 2, in the determining the total amount of rhythm points of the plurality of rhythm points included in the background music, the rhythm point sequence number of each rhythm point, and the plurality of beats included in the background music The period information of each beat period in the period, before forming the music rhythm information of the background music, further includes:
    获取从所述背景音乐中检测出的多个初始节奏点,并确定相邻两个初始节奏点之间形成的间隔时长;Obtain a plurality of initial rhythm points detected from the background music, and determine the interval duration formed between two adjacent initial rhythm points;
    根据所述语音段中的所述多个文字的平均字长时间以及所述间隔时长,确定所述多个初始节奏点中的待删除节奏点并删除所述待删除节奏点,获得所述背景音乐中的有效的节奏点。Determine the to-be-deleted rhythm point among the plurality of initial rhythm points and delete the to-be-deleted rhythm point to obtain the background Effective rhythm points in music.
  4. 根据权利要求2所述的方法,其中,所述根据所述文字属性信息以及所 述音乐节奏信息,确定用于将所述语音段与所述背景音乐对齐的至少一个对齐周期,并获得每个对齐周期的对齐信息表,包括:The method according to claim 2, wherein, according to the text attribute information and the music rhythm information, determining at least one alignment period for aligning the speech segment with the background music, and obtaining each Alignment information table for the alignment period, including:
    根据所述文字属性信息中的所述文字总量以及所述音乐节奏信息中每个节拍周期的周期信息,确定用于将所述语音段与所述背景音乐对齐的所述至少一个对齐周期;According to the total amount of text in the text attribute information and the period information of each beat period in the music rhythm information, determine the at least one alignment period for aligning the speech segment with the background music;
    选取一个完整的对齐周期作为待对齐节奏段,根据所述文字属性信息以及所述待对齐节奏段中的待对齐节奏点的节奏点信息,确定至少一个对齐单元以及每个对齐单元对应的对齐单元信息;Select a complete alignment cycle as the rhythm segment to be aligned, and determine at least one alignment unit and the alignment unit corresponding to each alignment unit according to the text attribute information and the rhythm point information of the to-be-aligned rhythm point in the to-be-aligned rhythm segment information;
    汇总所述至少一个对齐单元对应的对齐单元信息形成所述待对齐节奏段的对齐信息表,并根据所述对齐信息表确定所述至少一个对齐周期中除所述一个完整的对齐周期外的对齐周期的对齐信息表。Summarizing the alignment unit information corresponding to the at least one alignment unit to form an alignment information table of the to-be-aligned rhythm segment, and determining alignments other than the one complete alignment period in the at least one alignment period according to the alignment information table Periodic alignment information table.
  5. 根据权利要求4所述的方法,其中,所述根据所述文字属性信息中的所述文字总量以及所述音乐节奏信息中每个节拍周期的周期信息,确定用于将所述语音段与所述背景音乐对齐的至少一个对齐周期,包括:The method according to claim 4, wherein, according to the total amount of text in the text attribute information and the period information of each beat period in the music rhythm information, determining the method for combining the speech segment with the At least one alignment period of the background music alignment includes:
    判断所述背景音乐中的一个完整的节拍周期所对应的周期信息中的节奏点个数是否大于或等于所述文字总量;Determine whether the number of rhythm points in the cycle information corresponding to a complete beat cycle in the background music is greater than or equal to the total amount of text;
    响应于所述一个完整的节拍周期所对应的周期信息中的节奏点个数大于或等于所述文字总量,将每个节拍周期看作一个对齐周期;In response to the number of rhythm points in the period information corresponding to the one complete beat period being greater than or equal to the total amount of text, each beat period is regarded as an alignment period;
    响应于所述一个完整的节拍周期所对应的周期信息中的节奏点个数小于所述文字总量,在所述背景音乐中包括的节拍周期的个数大于1的情况下,按照周期号的排列顺序进行节拍周期的两两合并,形成至少一个新的节拍周期,并返回判定所述音乐节奏信息中一个完整的新的节拍周期所对应的周期信息中的节奏点个数是否大于或等于所述文字总量。In response to the number of rhythm points in the period information corresponding to the one complete beat period being less than the total amount of text, in the case that the number of beat periods included in the background music is greater than 1, according to the period number Perform the pairwise merging of beat cycles in order to form at least one new beat cycle, and return to determine whether the number of rhythm points in the cycle information corresponding to a complete new beat cycle in the music rhythm information is greater than or equal to the specified number of beat cycles. the total amount of text.
  6. 根据权利要求4所述的方法,其中,所述根据所述对齐信息表确定所述至少一个对齐周期中除所述一个完整的对齐周期外的对齐周期的对齐信息表,包括:The method according to claim 4, wherein the determining, according to the alignment information table, the alignment information table of the alignment periods other than the one complete alignment period in the at least one alignment period comprises:
    针对所述至少一个对齐周期中除所述一个完整的对齐周期外的每个对齐周期,在所述每个对齐周期为一个完整周期的情况下,将所述待对齐节奏段的对齐信息表作为所述每个对齐周期的对齐信息表;For each alignment period except the one complete alignment period in the at least one alignment period, in the case that each alignment period is a complete period, the alignment information table of the to-be-aligned rhythm segment is used as the alignment information table of each alignment cycle;
    在所述每个对齐周期为一个非完整周期,且所述待对齐节奏段的对齐信息表中的一行对应一个对齐单元的情况下,确定所述每个对齐周期所包括的节奏点的节奏点个数;从所述待对齐节奏段的对齐信息表中逆序选定所述每个对齐周期对应的节奏点个数行的对齐单元信息,将选定的所述节奏点个数行的对齐 单元信息作为所述每个对齐周期的对齐信息表。In the case that each alignment period is an incomplete period, and a row in the alignment information table of the to-be-aligned rhythm segment corresponds to an alignment unit, determine the rhythm points of the rhythm points included in each alignment period Number; from the alignment information table of the rhythm segment to be aligned, select the alignment unit information of the number of lines of rhythm points corresponding to each alignment period in reverse order, and select the alignment unit of the number of lines of the selected rhythm points information as the alignment information table for each alignment period.
  7. 根据权利要求4所述的方法,其中,所述根据所述文字属性信息以及所述待对齐节奏段中的待对齐节奏点的节奏点信息,确定至少一个对齐单元以及每个对齐单元对应的对齐单元信息,包括:The method according to claim 4, wherein the at least one alignment unit and the alignment corresponding to each alignment unit are determined according to the text attribute information and the rhythm point information of the to-be-aligned rhythm points in the to-be-aligned rhythm segment Unit information, including:
    基于所述待对齐节奏段中的多个待对齐节奏点的节奏点信息,形成与所述多个待对齐节奏点一一对应的多个待对齐节奏块,并将所述多个待对齐节奏点的数量记为初始的剩余点个数;Based on the rhythm point information of a plurality of to-be-aligned rhythm points in the to-be-aligned rhythm segment, a plurality of to-be-aligned rhythm blocks corresponding to the plurality of to-be-aligned rhythm points are formed, and the plurality of to-be-aligned rhythms The number of points is recorded as the initial number of remaining points;
    确定所述剩余点个数与所述文字属性信息中的所述文字总量的比值,将所述比值记为对齐匹配值;Determine the ratio of the number of remaining points to the total amount of text in the text attribute information, and record the ratio as an alignment matching value;
    查找预设的节奏点-文字对齐规则表,确定所述对齐匹配值在所述节奏点-文字对齐规则表中所处的长度比值范围;Find a preset rhythm point-text alignment rule table, and determine the length ratio range where the alignment matching value is located in the rhythm point-text alignment rule table;
    按照所述长度比值范围对应的对齐规则确定存在相匹配文字的待对齐节奏块,将所述存在相匹配文字的待对齐节奏块记为候选对齐单元;According to the alignment rule corresponding to the length ratio range, determine that there is a rhythm block to be aligned with matching text, and record the rhythm block to be aligned with matching text as a candidate alignment unit;
    统计未记为候选对齐单元的待对齐节奏块的块数,将所述块数作为新的剩余点个数,返回执行所述确定所述剩余点个数与所述文字属性信息中的所述文字总量的比值,将所述比值记为对齐匹配值的操作,直至最终的剩余点个数为0;Count the number of blocks of rhythm blocks to be aligned that are not recorded as candidate alignment units, take the number of blocks as the new number of remaining points, and return to performing the described determination of the number of remaining points and the text attribute information. The ratio of the total amount of text, and the ratio is recorded as the operation of aligning the matching value until the final number of remaining points is 0;
    根据每个候选对齐单元的单元时长以及所述每个候选对齐单元所匹配的文字的匹配文字属性信息,确定所述至少一个对齐单元并获得每个对齐单元对应的变速比;According to the unit duration of each candidate aligning unit and the matching text attribute information of the text matched by each candidate aligning unit, determine the at least one aligning unit and obtain the gear ratio corresponding to each aligning unit;
    将每个对齐单元的单元序号、所包括的节奏点中的起始节奏点序号、所匹配的文字的文字序号以及变速比确定为所述每个对齐单元对应的对齐单元信息。The unit serial number of each aligning unit, the starting rhythm point serial number in the included rhythm points, the character serial number of the matched characters and the gear ratio are determined as the aligning unit information corresponding to each aligning unit.
  8. 根据权利要求7所述的方法,其中,所述根据每个候选对齐单元的单元时长以及所述每个候选对齐单元所匹配的文字的匹配文字属性信息,确定所述至少一个对齐单元并获得每个对齐单元对应的变速比,包括:The method according to claim 7, wherein the at least one alignment unit is determined according to the unit duration of each candidate alignment unit and the matching character attribute information of the characters matched by the each candidate alignment unit, and each alignment unit is obtained. The gear ratios corresponding to each alignment unit, including:
    按照单元序号的顺序选取一个未选中的候选对齐单元作为当前处理单元;Select an unselected candidate alignment unit as the current processing unit in the order of unit serial numbers;
    根据所述当前处理单元的单元时长,结合所述当前处理单元以及相邻下一候选对齐单元分别所匹配的文字的起止时间和首个元音的起始位置,确定所述当前处理单元的当前变速比;According to the unit duration of the current processing unit, combined with the start and end times of the characters and the start position of the first vowel respectively matched by the current processing unit and the adjacent next candidate alignment unit, determine the current processing unit of the current processing unit. gear ratio;
    将所述当前变速比与设定的第一变速比值以及第二变速比值进行比对,其中,所述第二变速比值大于所述第一变速比值;comparing the current speed ratio with a set first speed ratio and a set second speed ratio, wherein the second speed ratio is greater than the first speed ratio;
    在所述当前变速比大于或等于所述第一变速比值且小于或等于所述第二变速比值的情况下,将所述当前处理单元确定为所述对齐单元,并记所述当前变 速比为确定的所述对齐单元的变速比;判断所有候选对齐单元是否均被选中参与处理,响应于所述所有候选对齐单元均被选中参与处理,汇总所确定的每个对齐单元以及所述每个对齐单元对应的变速比,响应于所述所有候选对齐单元中存在未被选中参与处理的候选对齐单元,返回执行所述按照单元序号的顺序选取一个未选中的候选对齐单元作为当前处理单元的操作;In the case that the current gear ratio is greater than or equal to the first gear ratio value and less than or equal to the second gear ratio value, the current processing unit is determined as the alignment unit, and the current gear ratio is denoted as Determine the gear ratio of the alignment unit; determine whether all candidate alignment units are selected to participate in processing, and in response to all candidate alignment units being selected to participate in processing, summarize each determined alignment unit and each alignment unit The gear ratio corresponding to the unit, in response to the existence of a candidate alignment unit that is not selected to participate in the processing in the all candidate alignment units, returns to perform the operation of selecting an unselected candidate alignment unit as the current processing unit according to the sequence of unit serial numbers;
    在所述当前变速比大于所述第二变速比值的情况下,确定用于填补所述当前处理单元的静音时长,并根据所述静音时长确定新的当前变速比,返回执行所述将所述当前变速比与设定的第一变速比值以及第二变速比值进行比对的操作;In the case that the current gear ratio is greater than the second gear ratio value, determine the silence duration for filling the current processing unit, determine a new current gear ratio according to the silence duration, and return to executing the The operation of comparing the current gear ratio with the set first gear ratio and the second gear ratio;
    在所述当前变速比小于所述第一变速比值的情况下,将所述当前处理单元与相邻下一候选对齐单元合并形成新的当前处理单元,并返回执行所述根据所述当前处理单元的单元时长,结合所述当前处理单元以及相邻下一候选对齐单元分别所匹配的文字的起止时间和首个元音的起始位置,确定所述当前处理单元的当前变速比的操作。In the case that the current gear ratio is smaller than the first gear ratio value, combine the current processing unit with the adjacent next candidate alignment unit to form a new current processing unit, and return to executing the process according to the current processing unit The operation of determining the current gear ratio of the current processing unit is combined with the start and end times of the characters and the start position of the first vowel respectively matched by the current processing unit and the adjacent next candidate alignment unit.
  9. 根据权利要求8所述的方法,其中,所述根据所述当前处理单元的单元时长,结合所述当前处理单元以及相邻下一候选对齐单元中分别所匹配的文字的起止时间和首个元音的起始位置,确定所述当前处理单元的当前变速比,包括:The method according to claim 8, wherein according to the unit duration of the current processing unit, the start and end times and the first element of the characters matched in the current processing unit and the adjacent next candidate alignment unit respectively are combined The starting position of the sound to determine the current gear ratio of the current processing unit, including:
    根据所述当前处理单元相匹配的全部文字中的每个文字的起止时间以及首个元音的起始位置,确定所述相匹配的全部文字在所述当前处理单元中的发音占据时长;According to the start and end time of each character in all the characters matched by the current processing unit and the starting position of the first vowel, determine the duration of the pronunciation of the matched all characters in the current processing unit;
    根据所述当前处理单元相邻下一候选对齐单元所匹配的首个文字的起止时间以及首个元音的起始位置,确定所述首个文字的元音间隔时长;According to the start and end time of the first character matched by the adjacent next candidate alignment unit of the current processing unit and the start position of the first vowel, determine the vowel interval duration of the first character;
    将所述当前处理单元的单元时长与所确定的实际发音时长的比值作为所述当前处理单元的当前变速比,其中,所述实际发音时长为所述发音占据时长与所述元音间隔时长的和。The ratio of the unit duration of the current processing unit to the determined actual pronunciation duration is taken as the current speed change ratio of the current processing unit, wherein the actual pronunciation duration is the duration of the pronunciation occupation duration and the vowel interval duration. and.
  10. 根据权利要求1-9任一项所述的方法,在所述根据所述文字属性信息以及所述音乐节奏信息,确定用于将所述语音段与所述背景音乐对齐的至少一个对齐周期,并获得每个对齐周期的对齐信息表之前,还包括:The method according to any one of claims 1-9, wherein at least one alignment period for aligning the speech segment with the background music is determined according to the text attribute information and the music rhythm information, And before getting the alignment information table for each alignment cycle, also include:
    在所述文字属性信息中的文字总量大于所述音乐节奏信息中的节奏点总量的情况下,结束将所述语音段转换为说唱音乐的处理,并输出重新获得语音段或背景音乐的提示。In the case where the total amount of text in the text attribute information is greater than the total amount of rhythm points in the music rhythm information, the process of converting the speech segment into rap music is ended, and a re-obtained speech segment or background music is output. hint.
  11. 一种将语音转换为说唱音乐的装置,包括:A device for converting speech into rap music, comprising:
    信息确定模块,设置为识别语音段以及处理背景音乐,获得所述语音段中的文字的文字属性信息以及所述背景音乐的音乐节奏信息;an information determination module, configured to recognize a speech segment and process background music, and obtain text attribute information of the text in the speech segment and music rhythm information of the background music;
    对齐信息确定模块,设置为根据所述文字属性信息以及所述音乐节奏信息,确定用于将所述语音段与所述背景音乐对齐的至少一个对齐周期,并获得每个对齐周期的对齐信息表;an alignment information determination module, configured to determine at least one alignment period for aligning the speech segment with the background music according to the text attribute information and the music rhythm information, and obtain an alignment information table for each alignment period ;
    转换控制模块,设置为按照所述至少一个对齐周期的对齐信息表控制所述语音段中的文字与所述背景音乐中的节奏点对齐,得到对齐后的音频,并在对所述对齐后的音频进行变调调整以及特效处理后形成说唱音频。The conversion control module is configured to control the text in the speech segment to align with the rhythm points in the background music according to the alignment information table of the at least one alignment period, to obtain the aligned audio, and to align the aligned audio. The audio is transposed and processed with special effects to form rap audio.
  12. 一种计算机设备,包括:A computer device comprising:
    至少一个处理器;at least one processor;
    存储装置,设置为存储至少一个程序;a storage device configured to store at least one program;
    所述至少一个程序被所述至少一个处理器执行,使得所述至少一个处理器实现如权利要求1-10任一项所述的将语音转换为说唱音乐的方法。The at least one program is executed by the at least one processor such that the at least one processor implements the method of converting speech into rap music as claimed in any one of claims 1-10.
  13. 一种计算机可读存储介质,存储有计算机程序,其中,所述程序被处理器执行时实现如权利要求1-10任一项所述的将语音转换为说唱音乐的方法。A computer-readable storage medium storing a computer program, wherein when the program is executed by a processor, the method for converting speech into rap music according to any one of claims 1-10 is implemented.
PCT/CN2021/095236 2020-07-16 2021-05-21 Method and apparatus for converting voice into rap music, device, and storage medium WO2022012164A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010688502.3A CN111862913B (en) 2020-07-16 2020-07-16 Method, device, equipment and storage medium for converting voice into rap music
CN202010688502.3 2020-07-16

Publications (1)

Publication Number Publication Date
WO2022012164A1 true WO2022012164A1 (en) 2022-01-20

Family

ID=72984100

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/095236 WO2022012164A1 (en) 2020-07-16 2021-05-21 Method and apparatus for converting voice into rap music, device, and storage medium

Country Status (2)

Country Link
CN (1) CN111862913B (en)
WO (1) WO2022012164A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114566191A (en) * 2022-02-25 2022-05-31 腾讯音乐娱乐科技(深圳)有限公司 Sound correcting method for recording and related device

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111862913B (en) * 2020-07-16 2023-09-05 广州市百果园信息技术有限公司 Method, device, equipment and storage medium for converting voice into rap music
CN113823281B (en) * 2020-11-24 2024-04-05 北京沃东天骏信息技术有限公司 Voice signal processing method, device, medium and electronic equipment
CN112669849A (en) * 2020-12-18 2021-04-16 百度国际科技(深圳)有限公司 Method, apparatus, device and storage medium for outputting information
CN112712783B (en) * 2020-12-21 2023-09-29 北京百度网讯科技有限公司 Method and device for generating music, computer equipment and medium
CN112700781B (en) * 2020-12-24 2022-11-11 江西台德智慧科技有限公司 Voice interaction system based on artificial intelligence

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5811707A (en) * 1994-06-24 1998-09-22 Roland Kabushiki Kaisha Effect adding system
CN101399036A (en) * 2007-09-30 2009-04-01 三星电子株式会社 Device and method for conversing voice to be rap music
CN103035235A (en) * 2011-09-30 2013-04-10 西门子公司 Method and device for transforming voice into melody
CN103440862A (en) * 2013-08-16 2013-12-11 北京奇艺世纪科技有限公司 Method, device and equipment for synthesizing voice and music
CN105931625A (en) * 2016-04-22 2016-09-07 成都涂鸦科技有限公司 Rap music automatic generation method based on character input
CN111402843A (en) * 2020-03-23 2020-07-10 北京字节跳动网络技术有限公司 Rap music generation method and device, readable medium and electronic equipment
CN111862913A (en) * 2020-07-16 2020-10-30 广州市百果园信息技术有限公司 Method, device, equipment and storage medium for converting voice into rap music

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107170464B (en) * 2017-05-25 2020-11-27 厦门美图之家科技有限公司 Voice speed changing method based on music rhythm and computing equipment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5811707A (en) * 1994-06-24 1998-09-22 Roland Kabushiki Kaisha Effect adding system
CN101399036A (en) * 2007-09-30 2009-04-01 三星电子株式会社 Device and method for conversing voice to be rap music
CN103035235A (en) * 2011-09-30 2013-04-10 西门子公司 Method and device for transforming voice into melody
CN103440862A (en) * 2013-08-16 2013-12-11 北京奇艺世纪科技有限公司 Method, device and equipment for synthesizing voice and music
CN105931625A (en) * 2016-04-22 2016-09-07 成都涂鸦科技有限公司 Rap music automatic generation method based on character input
CN111402843A (en) * 2020-03-23 2020-07-10 北京字节跳动网络技术有限公司 Rap music generation method and device, readable medium and electronic equipment
CN111862913A (en) * 2020-07-16 2020-10-30 广州市百果园信息技术有限公司 Method, device, equipment and storage medium for converting voice into rap music

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114566191A (en) * 2022-02-25 2022-05-31 腾讯音乐娱乐科技(深圳)有限公司 Sound correcting method for recording and related device

Also Published As

Publication number Publication date
CN111862913A (en) 2020-10-30
CN111862913B (en) 2023-09-05

Similar Documents

Publication Publication Date Title
WO2022012164A1 (en) Method and apparatus for converting voice into rap music, device, and storage medium
Temperley The musical language of rock
CN107123415B (en) Automatic song editing method and system
US20180349493A1 (en) Dual sound source audio data processing method and apparatus
CN112382257B (en) Audio processing method, device, equipment and medium
CN110782869A (en) Speech synthesis method, apparatus, system and storage medium
TW201933332A (en) Method and device for composing music
CN112712783B (en) Method and device for generating music, computer equipment and medium
WO2023124472A1 (en) Midi music file generation method, storage medium and terminal
TWI271699B (en) Music editing methods and related devices
Ockelford Zygonic theory: introduction, scope, and prospects
JP2014013340A (en) Music composition support device, music composition support method, music composition support program, recording medium storing music composition support program and melody retrieval device
CN112825244B (en) Music audio generation method and device
KR100762079B1 (en) Automatic musical composition method and system thereof
JP2006106334A (en) Method and apparatus for displaying lyrics
CN112528631B (en) Intelligent accompaniment system based on deep learning algorithm
JP3904012B2 (en) Waveform generating apparatus and method
Köküer et al. Curating and annotating a collection of traditional Irish flute recordings to facilitate stylistic analysis
JP2000276194A (en) Waveform compressing method and waveform generating method
JPH1097249A (en) Playing data converter
Cannon Laughter, Liquor, and Licentiousness: Preservation Through Play in Southern Vietnamese Traditional Music
Fields Morton Feldman's Piano and String Quartet: Analysis, Aesthetics, and Experience of a 20th-century Masterpiece
JP3744247B2 (en) Waveform compression method and waveform generation method
Levy We Are Never New:“Transing” the time of music through the life and works of Beverly Glenn-Copeland
JP4173475B2 (en) Lyric display method and apparatus

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21841954

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21841954

Country of ref document: EP

Kind code of ref document: A1