CN116757175A - Audio text alignment method, computer device and computer readable storage medium - Google Patents

Audio text alignment method, computer device and computer readable storage medium Download PDF

Info

Publication number
CN116757175A
CN116757175A CN202310553682.8A CN202310553682A CN116757175A CN 116757175 A CN116757175 A CN 116757175A CN 202310553682 A CN202310553682 A CN 202310553682A CN 116757175 A CN116757175 A CN 116757175A
Authority
CN
China
Prior art keywords
audio
phoneme
text
probability
inter
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310553682.8A
Other languages
Chinese (zh)
Inventor
王武城
龚韬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Music Entertainment Technology Shenzhen Co Ltd
Original Assignee
Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Music Entertainment Technology Shenzhen Co Ltd filed Critical Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority to CN202310553682.8A priority Critical patent/CN116757175A/en
Publication of CN116757175A publication Critical patent/CN116757175A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/189Automatic justification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to an audio text alignment method, a computer device and a computer readable storage medium. The method comprises the following steps: acquiring audio and text to be aligned; extracting at least one inter-word trip point from the audio; the inter-word hopping point is located between two adjacent characters in text content expressed by audio; based on the inter-word hopping points, probability adjustment is carried out on the predicted transition probabilities of the texts belonging to various phoneme states respectively, and new transition probabilities are obtained; the probability adjustment is used for adjusting the predicted transition probability between two adjacent characters in the text to be preset transition probability, and the preset transition probability represents the transition probability of different phoneme states between the two adjacent characters; and aligning the audio and the text based on the new transition probability to obtain the aligned audio text. The method can accelerate the alignment rate of the audio text and improve the naturalness and expressive force of the aligned audio text.

Description

Audio text alignment method, computer device and computer readable storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to an audio text alignment method, a computer device, and a computer readable storage medium.
Background
When a song is played by a music playing application, the song playing interface may typically display lyrics corresponding to the current song playing progress, requiring the use of time-aligned song text to achieve this effect. Wherein the time-aligned song text includes that the vocal songs of each frame are aligned with the lyric text of each phoneme.
In the related art, the time-aligned song texts are usually aligned by manual labeling, but a new amount of songs are increased every day in a song library corresponding to a music playing application program, if each time-aligned song text is obtained only by a manual labeling alignment method, the efficiency of producing the time-aligned song text is low and the quality is not high.
Disclosure of Invention
In view of the foregoing, there is a need for an audio text alignment method, a computer device, and a computer-readable storage medium that can improve the efficiency and quality of audio/text alignment.
According to a first aspect of an embodiment of the present disclosure, there is provided an alignment method of audio text, including:
acquiring audio and text to be aligned;
extracting at least one inter-word trip point from the audio; the inter-word hopping points are located between two adjacent characters in the text content expressed by the audio;
Based on the inter-word hopping points, probability adjustment is carried out on the predicted transition probabilities of the texts belonging to various phoneme states respectively, so that new transition probabilities are obtained; the probability adjustment is used for adjusting the predicted transition probability between two adjacent characters in the text to be preset transition probability, and the preset transition probability represents the transition probability of different phoneme states between the two adjacent characters;
and aligning the audio and the text based on the new transition probability to obtain an aligned audio text.
In an exemplary embodiment, before the probability adjustment for the predicted transition probabilities that the text belongs to the various phoneme states based on the inter-word hopping points, the method further includes:
framing the audio to obtain an audio frame sequence;
performing phoneme conversion processing on the text based on the audio frame sequence to obtain a phoneme sequence; the number of phonemes in the phoneme sequence corresponds to the number of audio frames in the audio frame sequence;
and obtaining the prediction transition probabilities of each phoneme belonging to various phoneme states in the phoneme sequence.
In an exemplary embodiment, the obtaining the predicted transition probabilities of each phoneme in the phoneme sequence belonging to each phoneme state includes:
Obtaining the prediction transition probabilities of each phoneme belonging to various hidden Markov states in the phoneme sequence; the predictive transition probabilities include inter-state transition probabilities that characterize a probability in the phoneme sequence that a first hidden Markov state of a previous phoneme transitions to a second hidden Markov state of a subsequent phoneme, and the first hidden Markov state is different from the second hidden Markov state;
based on the inter-word hopping points, the probability adjustment is performed on the predicted transition probabilities of the texts belonging to various phoneme states respectively, and the method comprises the following steps:
and carrying out probability adjustment on the transition probability between states of part of phonemes in the phoneme sequence based on the inter-word hopping points.
In an exemplary embodiment, the probability adjustment for the state transition probability of a part of phonemes in the phoneme sequence based on the inter-word hopping point includes:
in the audio, determining two adjacent target audio frames corresponding to the positions of the inter-word hopping points;
respectively determining two target phonemes corresponding to the two target audio frames in the phoneme sequence;
And adjusting the transition probability between the predicted states of the next target phoneme in the two target phonemes to be a preset transition probability.
In an exemplary embodiment, the extracting at least one inter-word trip point from the audio includes:
detecting the fundamental frequency of the audio frequency, and determining a first class trip point in the audio frequency; the first class trip point characterizes that a trip of a base frequency value exists between two adjacent audio frames corresponding to the audio;
performing energy detection on the audio frequency, and determining a second class of jump points in the audio frequency; the second class of hopping points represent hopping of energy values between two adjacent audio frames in the audio;
and extracting at least one jump point from the first jump point type and the second jump point type as an inter-word jump point.
In an exemplary embodiment, the determining the first type of trip point in the audio includes:
performing Fourier transform on respective fundamental frequency values of each audio frame in the audio respectively, and determining amplitude spectrum data aiming at the audio;
when the amplitude variation degree between two adjacent audio frames in the amplitude spectrum data exceeds a preset degree, determining that a first type of jump points exist between the two adjacent audio frames;
And collecting the first class of hopping points in the audio frequency to obtain a first class of hopping point sequence aiming at the audio frequency.
In an exemplary embodiment, the determining the second class of trip points in the audio includes:
in each audio frame of the audio, when the degree of change between the energy values of two adjacent audio frames exceeds a preset degree, determining that a second class of jump points exist between the two adjacent audio frames;
and collecting the second class of hopping points in the audio frequency to obtain a second class of hopping point sequence aiming at the audio frequency.
In an exemplary embodiment, the extracting at least one hopping point from the first class of hopping points and the second class of hopping points as an inter-word hopping point includes:
determining an intersection point from the first class point of the first class point sequence and the second class point of the second class point sequence, and taking the intersection point as an inter-word point of the audio; the number of intersection hops is at least one.
In an exemplary embodiment, the aligning the audio and the text based on the new transition probability to obtain an aligned audio text includes:
Acquiring the emission probabilities of the audios respectively belonging to various phoneme states;
decoding the emission probability and the new transition probability, and determining an optimal phoneme state corresponding to each audio frame in the audio;
determining a time correspondence between each audio frame and each phoneme in the text based on the optimal phoneme state of each audio frame;
and aligning the audio and the text based on the time corresponding relation to obtain an aligned audio text.
According to a second aspect of embodiments of the present disclosure, there is provided an alignment apparatus for audio text, including:
a data acquisition unit configured to perform acquisition of audio and text to be aligned;
a trip point extraction unit configured to perform extraction of at least one inter-word trip point from the audio; the inter-word hopping points are located between two adjacent characters in the text content expressed by the audio;
the probability adjustment unit is configured to execute probability adjustment on the predicted transition probabilities of the texts belonging to various phoneme states respectively based on the inter-word hopping points to obtain new transition probabilities; the probability adjustment is used for adjusting the predicted transition probability between two adjacent characters in the text to be preset transition probability, and the preset transition probability represents the transition probability of different phoneme states between the two adjacent characters;
And the alignment unit is configured to perform alignment of the audio and the text based on the new transition probability, so as to obtain aligned audio text.
According to a third aspect of embodiments of the present disclosure, there is provided a computer device comprising:
a processor;
a memory for storing executable instructions of the processor;
wherein the processor is configured to execute the executable instructions to implement the method of audio text alignment as set forth in any one of the preceding claims.
According to a fourth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium comprising a computer program which, when executed by a processor of a computer device, enables the computer device to perform a method of alignment of audio text as described in any one of the above.
According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising program instructions therein, which when executed by a processor of a computer device, enable the computer device to perform the method of alignment of audio text as described in any one of the above.
The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:
Firstly, acquiring audio and text to be aligned; then, at least one inter-word trip point is extracted from the audio; the inter-word hopping point is located between two adjacent characters in text content expressed by audio; then, based on the inter-word hopping points, probability adjustment is carried out on the predicted transition probabilities of the texts belonging to various phoneme states respectively, and new transition probabilities are obtained; the probability adjustment is used for adjusting the predicted transition probability between two adjacent characters in the text to be preset transition probability, and the preset transition probability represents the transition probability of different phoneme states between the two adjacent characters; and finally, aligning the audio and the text based on the new transition probability to obtain the aligned audio text. On the one hand, compared with the prior art, the method aligns the audio and the text through the transition probability of the phoneme states corresponding to the audio and the text to be aligned to obtain the aligned audio text, thereby optimizing the alignment flow of the audio text, accelerating the alignment speed of the audio text and reducing the consumption of manpower and time cost; on the other hand, the prediction transition probability of the phoneme state corresponding to the text is adjusted by utilizing the inter-word jumping points in the audio so as to align the audio with the text based on the transition probability, thereby improving the naturalness and expressive force of the aligned audio text and ensuring that the quality and the display effect of the aligned audio text are better.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.
Fig. 1 is an application environment diagram illustrating a method of alignment of audio text according to an exemplary embodiment.
Fig. 2 is a flow chart illustrating a method of alignment of audio text according to an exemplary embodiment.
Fig. 3 is a flowchart illustrating a step of extracting an inter-word trip point, according to an example embodiment.
FIG. 4 is a flowchart illustrating steps for obtaining a predicted transition probability according to an exemplary embodiment.
FIG. 5 is a flow chart illustrating a step of adjusting a predicted transition probability according to an exemplary embodiment.
Fig. 6 is a flowchart illustrating a step of aligning audio and text according to another exemplary embodiment.
Fig. 7 is a flowchart illustrating a method of alignment of audio text according to another exemplary embodiment.
Fig. 8 is a block diagram illustrating a method of alignment of audio text according to another exemplary embodiment.
Fig. 9 is a schematic diagram illustrating a state transition probability map, according to an example embodiment.
Fig. 10 is a block diagram of an audio text alignment apparatus according to an exemplary embodiment.
FIG. 11 is a block diagram of a computer device for audio text alignment, according to an exemplary embodiment.
FIG. 12 is a block diagram of a computer-readable storage medium for audio text alignment, according to an exemplary embodiment.
FIG. 13 is a block diagram of a computer program product for audio text alignment, according to an exemplary embodiment.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
The term "and/or" in embodiments of the present application is meant to include any and all possible combinations of one or more of the associated listed items. Also described are: as used in this specification, the terms "comprises/comprising" and/or "includes" specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, and/or components, and/or groups thereof.
The terms "first," "second," and the like in this disclosure are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.
In addition, although the terms "first," "second," etc. may be used several times in the present application to describe various operations (or various elements or various applications or various instructions or various data) etc., these operations (or elements or applications or instructions or data) should not be limited by these terms. These terms are only used to distinguish one operation (or element or application or instruction or data) from another operation (or element or application or instruction or data). For example, the first hidden Markov state may be referred to as a second hidden Markov state, which may also be referred to as a first hidden Markov state, with only the two including ranges that are different without departing from the scope of the application, the first and second hidden Markov states being a collection of various hidden Markov states, but not the same.
The method for aligning the audio text, provided by the embodiment of the application, can be applied to an application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a communication network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be located on a cloud or other network server.
In some embodiments, referring to fig. 1, server 104 first obtains audio and text to be aligned; then, the server 104 extracts at least one inter-word trip point from the audio; the inter-word hopping point is located between two adjacent characters in text content expressed by the audio; then, the server 104 performs probability adjustment on the predicted transition probabilities of the texts belonging to various phoneme states based on the inter-word hopping points, so as to obtain new transition probabilities; the probability adjustment is used for adjusting the predicted transition probability between two adjacent characters in the text to be preset transition probability, and the preset transition probability represents the transition probability of different phoneme states between the two adjacent characters; finally, the server 104 aligns the audio and text based on the new transition probabilities, resulting in an aligned audio text.
In some embodiments, the terminal 102 (e.g., mobile terminal, fixed terminal) may be implemented in various forms. The terminal 102 may be a mobile terminal including a mobile phone, a smart phone, a notebook computer, a portable handheld device, a personal digital assistant (PDA, personal Digital Assistant), a tablet computer (PAD), etc. capable of aligning audio and text based on predicted transition probabilities that an inter-word transition point and text of the audio to be aligned respectively belong to various phoneme states, or a fixed terminal including an automatic teller machine (Automated Teller Machine, ATM), an automatic all-in-one machine, a digital TV, a desktop computer, a fixed computer, etc. capable of aligning audio and text based on predicted transition probabilities that an inter-word transition point and text of the audio to be aligned respectively belong to various phoneme states.
In the following, it is assumed that the terminal 102 is a fixed terminal. However, those skilled in the art will appreciate that the configuration according to the disclosed embodiments of the present application can also be applied to a mobile type terminal 102 if there are operations or elements specifically for the purpose of movement.
In some embodiments, the data processing components running on server 104 may load any of a variety of additional server applications and/or middle tier applications being executed, including, for example, HTTP (hypertext transfer protocol), FTP (file transfer protocol), CGI (common gateway interface), RDBMS (relational database management system), and the like.
In some embodiments, the server 104 may be implemented as a stand-alone server or as a cluster of servers. The server 104 may be adapted to run one or more application services or software components that provide the terminal 102 described in the foregoing disclosure.
In some embodiments, the application services may include a service interface that provides the user with audio and text selections to be aligned, as well as corresponding program services, and so forth. Among other things, the software components may include, for example, an application (SDK) or a client (APP) having a function of aligning audio and text based on predicted transition probabilities that the inter-word trip points and text of the audio to be aligned respectively belong to various phoneme states.
In some embodiments, the application or client provided by server 104 with the ability to align audio and text includes a portal port that provides one-to-one application services to users in the foreground and a plurality of business systems that are located in the background for data processing to extend the application of relevant functionality in generating the audio-video animation to the APP or client so that users can use and access the functionality associated with audio and text alignment anywhere at any time.
In some embodiments, the audio and text alignment function of the APP or client may be a computer program running in user mode to accomplish some specific task or tasks, which may interact with the user and have a visual user interface. Wherein, APP or client may include two parts: a Graphical User Interface (GUI) and an engine (engine) with which a user can be provided with a digitized client system of various application services in the form of a user interface.
In some embodiments, a user may input corresponding code data or control parameters to the APP or client through a preset input device or an automatic control program to execute application services of a computer program in the server 104 and display application services in a user interface.
In some embodiments, the APP or client-running operating system may include various versions of Microsoft WindowsAnd/or Linux operating system, various commercial or quasi +.>Operating systems (including but not limited to various GNU/Linux operating systems, google +.>Etc.) and/or mobile operating systems, such as The operating system, as well as other online or offline operating systems, is not particularly limited herein.
In some embodiments, as shown in fig. 2, a method for aligning audio text is provided, and the method is applied to the server 104 in fig. 1 for illustration, and the method includes the following steps:
step S11, acquiring the audio and text to be aligned.
In some embodiments, the server obtains audio to be aligned and text corresponding to the audio to be aligned, which are transmitted by the user account, from a terminal application (e.g., a cell phone, a tablet, etc.).
The audio to be aligned may be a formal version of a music song that has been released, or a local song that has been recorded by the terminal application (e.g., live song that has been recorded off-line by the terminal application and web song that has been recorded on-line).
In some embodiments, the audio to be aligned comprises dry audio for expressing a lyric sequence. Wherein, the dry sound audio is composed of a plurality of continuous audio frames.
In some embodiments, the server may first obtain the input audio transmitted by the user account (e.g., the live song recorded offline by the terminal), then transfer the input audio into a preset sound accompaniment separation model (e.g., the spleeter algorithm), and perform separation of the dry sound and accompaniment, so as to extract the dry sound frequency of the input audio composed of a plurality of continuous audio frames, and take the dry sound frequency as the audio to be aligned. And then, extracting lyrics of the dry voice frequency of each audio frame through a preset lyrics analysis model to obtain a lyrics sequence of the input audio, and taking the lyrics sequence as a text to be aligned.
As an example, the spline algorithm first re-segments a song segment and projects the segmented song into a low dimensional space for dimensionality reduction and compression of information data to extract audio depth features for the song. Then, the audio depth features are classified by a multilayer perceptron based on MLP to obtain the audio depth features related to the dry audio and the accompaniment audio depth features. Finally, the compressed low-dimensional features are restored into the dry sound audio and accompaniment audio with original dimensions, so that the dry sound audio, namely the audio to be aligned, is extracted from the input audio.
As an example, the lyric analysis model first extracts text information of the dry audio of each audio frame through a preset speech recognition algorithm. The speech recognition algorithm may be a dynamic time warping (Dynamic Time Warping) based algorithm, a hidden markov model based algorithm based on a parametric model, an algorithm based on an artificial neural network, a hybrid algorithm, or the like. Then, the lyric analysis model carries out word segmentation processing on the text information of the dry voice frequency to obtain words to be identified. Finally, matching words to be identified of the dry voice frequency with lyric text information corresponding to the songs respectively to obtain a dry voice frequency lyric sequence, namely a text to be aligned.
Step S12, at least one inter-word hopping point is extracted from the audio.
In one embodiment, the server performs voice activity event detection (Voice Activity Detection, VAD) on the audio to be aligned to extract at least one inter-word trip point from the audio.
The inter-word hopping point is located between two adjacent characters in text content expressed by the audio, namely, one inter-word hopping point exists between every two characters expressed in the audio to be aligned.
In some embodiments, voice activity event detection includes fundamental frequency detection or energy detection of the audio to be aligned to determine inter-word hopping points between characters expressed in the audio to be aligned based on detected fundamental frequency data or energy data of the audio to be aligned.
And step S13, based on the inter-word hopping points, probability adjustment is carried out on the predicted transition probabilities of the texts belonging to various phoneme states respectively, and new transition probabilities are obtained.
In an embodiment, the probability adjustment is used to adjust the predicted transition probability between two adjacent characters in the text to a preset transition probability.
In one embodiment, the preset transition probabilities characterize the transition probabilities of different phoneme states between two adjacent characters.
The preset transition probability may be an absolute probability, that is, the transition probability between the predicted states of the next character in the two adjacent characters is one hundred percent. For example, two adjacent characters are character 1 and character 2, respectively, and character 2 is the character in the latter, and before probability adjustment, the transition probability of character 2 is 60%; after the probability adjustment, the transition probability of the character 2 is 0% or 100%, so that when the character 1 is the phoneme state 1, the phoneme state of the character 2 is other phoneme states different from the phoneme state 1. In other embodiments, the preset transition probability may be configured as other probabilities as needed, which is not specifically limited herein.
In some embodiments, before performing probability adjustment on the predicted transition probabilities that the text belongs to various phoneme states, the server needs to obtain the predicted transition probabilities that each phoneme in the phoneme sequence corresponding to the text belongs to various phoneme states, where the method specifically includes: and obtaining the prediction transition probabilities of each phoneme belonging to various hidden Markov states in the phoneme sequence.
In some embodiments, the phoneme states refer to hidden Markov states for each phoneme in the phoneme sequence. Each phoneme may contain at least 3 hidden markov states, namely a pre-phoneme state, an in-phoneme state, a post-phoneme state. The predicted transition probabilities of the phoneme states are the transition probabilities of the phonemes belonging to the various phoneme states respectively.
The prediction transition probability comprises a state transition probability, the state transition probability represents the probability that a first hidden Markov state of a previous phoneme is transited to a second hidden Markov state of a next phoneme in the phoneme sequence, and the first hidden Markov state is different from the second hidden Markov state. For example, in the above embodiment, when the character 1 is the phoneme state 1, the phoneme state of the character 2 is other phoneme states different from the phoneme state 1.
In some embodiments, the server performs probability adjustment on the predicted transition probabilities that the text belongs to various phoneme states based on the inter-word hopping points, and specifically includes: based on the inter-word hopping points, probability adjustment is carried out on the transition probabilities among states of partial phonemes in the phoneme sequences corresponding to the text.
Specifically, the server determines, based on the inter-word trip points, which part of the phoneme sequence corresponding to the text has the inter-word trip points between the pair of phonemes (i.e., the previous phoneme and the next phoneme), and then adjusts the state transition probability of the next phoneme in the part of the phoneme pairs to a preset transition probability.
And step S14, aligning the audio and the text based on the new transition probability to obtain an aligned audio text.
In some embodiments, the server first extracts audio features (such as mel-frequency cepstrum features and perceptual linear prediction features) of each audio frame in the audio, and then calculates the transmission state probability that the audio features of each audio frame respectively belong to various hidden markov states based on a preset hidden markov model. Further, the server inputs the transmission state emission probability and the new transition probability of the audio into a hidden Markov-Gaussian mixture model to perform forced alignment of the audio text, and the aligned audio text is obtained.
In the above-mentioned alignment process of audio text, the server obtains the audio frequency and text to be aligned at first; then, at least one inter-word trip point is extracted from the audio; the inter-word hopping point is located between two adjacent characters in text content expressed by audio; then, based on the inter-word hopping points, probability adjustment is carried out on the predicted transition probabilities of the texts belonging to various phoneme states respectively, and new transition probabilities are obtained; the probability adjustment is used for adjusting the predicted transition probability between two adjacent characters in the text to be preset transition probability, and the preset transition probability represents the transition probability of different phoneme states between the two adjacent characters; and finally, aligning the audio and the text based on the new transition probability to obtain the aligned audio text. On the one hand, compared with the prior art, the method aligns the audio and the text through the transition probability of the phoneme states corresponding to the audio and the text to be aligned to obtain the aligned audio text, thereby optimizing the alignment flow of the audio text, accelerating the alignment speed of the audio text and reducing the consumption of manpower and time cost; on the other hand, the prediction transition probability of the phoneme state corresponding to the text is adjusted by utilizing the inter-word jumping points in the audio so as to align the audio with the text based on the transition probability, thereby improving the naturalness and expressive force of the aligned audio text and ensuring that the quality and the display effect of the aligned audio text are better.
It will be appreciated by those skilled in the art that in the above-described methods of the embodiments, the disclosed methods may be implemented in a more specific manner. For example, the embodiment described above in which the server aligns audio and text based on new transition probabilities, resulting in aligned audio text is merely illustrative.
Illustratively, or in a manner in which the server extracts at least one inter-word trip point from the audio; the server performs probability adjustment on the predicted transition probabilities of the texts belonging to various phoneme states based on the inter-word transition points, which is just a set manner, and may have other division manners when actually implementing, for example, the inter-word transition points of audio, the predicted transition probabilities of the texts may be combined or may be integrated into another system, or some features may be omitted or not implemented.
In an exemplary embodiment, referring to fig. 3, fig. 3 is a flowchart illustrating an embodiment of extracting an inter-word trip point according to the present application. In step S12, the process of extracting at least one inter-word trip point from audio by the server may be implemented in the following manner:
step S121, detecting the fundamental frequency of the audio, and determining the first class trip point in the audio.
In one embodiment, the first class of trip points characterizes the presence of a trip of the baseband value between corresponding two adjacent audio frames in the audio.
Wherein the first class of hopping points are inter-frame hopping points between adjacent audio frames.
As an example, the base frequency value of the first audio frame A1 in the audio is A2, the base frequency value of the second audio frame B1 is B2, and the first audio frame A1 and the second audio frame B1 are a pair of audio frames adjacent to each other. If the distance between the baseband value A2 and the baseband value B2 is smaller than the preset distance threshold, no first class trip point exists between the first audio frame A1 and the second audio frame B1; if the distance between the baseband value A2 and the baseband value B2 is greater than or equal to the preset distance threshold, a first type of jump point exists between the first audio frame A1 and the second audio frame B1.
In other embodiments, the server determining the first type of trip point in the audio may also be performed by:
step one: and respectively carrying out Fourier transform on the fundamental frequency values of each audio frame in the audio, and determining the amplitude spectrum data aiming at the audio.
In some embodiments, the server first performs a fourier transform process on the respective baseband values of each audio frame to determine a complex spectrum corresponding to the transformed audio. Wherein the amplitude spectrum data includes phase frequency spectrum data and amplitude frequency spectrum data.
In other embodiments, the server may perform fourier transform processing on the baseband values of each audio frame, and may perform fast fourier transform processing, (modified) discrete cosine transform processing, wavelet transform processing, and the like on the audio frame, which are not particularly limited herein.
As an example, the m-th frame audio frame is fourier transformed to obtain a complex spectrum corresponding to the m-th frame based on a formula characterization:
where K represents the frame length (where the frame length is exactly equal to the number of points of the fourier transform).
The complex frequency spectrum characteristics corresponding to the audio frame are characterized based on X (k), i represents an index of a signal in an nth frame, w (i) represents a windowing function, and j is a preset coefficient.
Further, the server performs feature extraction processing on the complex spectrum of the audio frame to obtain phase frequency spectrum data and amplitude frequency spectrum data.
As an example, the amplitude-frequency spectral data is characterized based on the formula: a is that X (k) = ||x (k) |, where |·| represents a complex modulo operation.
Wherein the phase frequency spectrum data is characterized based on the formula: phi X (k)=atan2(X r (k),X i (k) X), wherein X r (k),X i (k) Representing the real and imaginary parts of the complex spectrum, respectively, namely: x (k) =x r (k)+jX i (k) J represents an imaginary number.
Step two: when the amplitude variation degree between two adjacent audio frames in the amplitude spectrum data exceeds a preset degree, determining that a first type of jump points exist between the two adjacent audio frames.
As an example, the amplitude spectrum data of the first audio frame A1 in the audio is A3, the amplitude spectrum data of the second audio frame B1 is B3, and the first audio frame A1 and the second audio frame B1 are a pair of front-back adjacent audio frames. If the distance between the amplitude spectrum data A3 and the amplitude spectrum data B3 is smaller than a preset distance threshold, no first class trip point exists between the first audio frame A1 and the second audio frame B1; if the distance between the amplitude spectrum data A3 and the amplitude spectrum data B3 is greater than or equal to a preset distance threshold, a first type of jump point exists between the first audio frame A1 and the second audio frame B1.
In other embodiments, the second step may be: when the phase frequency change degree between two adjacent audio frames in the phase frequency spectrum data exceeds a preset degree, determining that a first type of jump point exists between the two adjacent audio frames.
As an example, the phase frequency spectrum data of the first audio frame A1 in the audio is A4, the phase frequency spectrum data of the second audio frame B1 is B4, and the first audio frame A1 and the second audio frame B1 are a pair of audio frames adjacent to each other. If the distance between the phase frequency spectrum data A4 and the phase frequency spectrum data B4 is smaller than a preset distance threshold, no first class jump point exists between the first audio frame A1 and the second audio frame B1; if the distance between the phase frequency spectrum data A4 and the phase frequency spectrum data B4 is greater than or equal to a preset distance threshold, a first type of jump point exists between the first audio frame A1 and the second audio frame B1.
Step three: and collecting the first class of hopping points in the audio to obtain a first class of hopping point sequence aiming at the audio.
In an embodiment, the server may collect any one of the hopping points where there is a hop between baseband values of the audio frame, the hopping points where there is a hop between phase-frequency spectrum data of the audio frame, and the hopping points where there is a hop between amplitude spectrum data of the audio frame, and may collect the hopping points of each of the three to obtain the first class of hopping point sequence, which is not specifically limited herein.
Step S122, energy detection is carried out on the audio, and second class jumping points in the audio are determined.
In one embodiment, the second class of trip points characterizes the presence of a transition in the audio in energy values between corresponding two adjacent audio frames. Wherein the second class of hops are inter-frame hops between adjacent audio frames.
In one embodiment, the server determines the second class of trip points in the audio may be performed by:
step one: and in each audio frame of the audio, when the degree of change between the energy values of two adjacent audio frames exceeds a preset degree, determining that a second class of jump points exist between the two adjacent audio frames.
As an example, the energy value of the first audio frame C1 in the audio is C2, the energy value of the second audio frame D1 is D2, and the first audio frame C1 and the second audio frame D1 are a pair of audio frames adjacent one after the other. If the distance between the energy value C2 and the energy value D2 is smaller than the preset distance threshold, no second class trip point exists between the first audio frame C1 and the second audio frame D1; if the distance between the energy value C2 and the energy value D2 is greater than or equal to the preset distance threshold, a second class of jump point exists between the first audio frame C1 and the second audio frame D1.
Step two: and collecting the second class of hopping points in the audio to obtain a second class of hopping point sequence aiming at the audio.
In one embodiment, the server may aggregate the hops at which hops exist between the energy values of the audio frames to obtain a first type of sequence of hops.
Step S123, extracting at least one jump point from the first jump point and the second jump point as an inter-word jump point.
In one embodiment, the server determines an intersection point from the first class point of the first class point sequence and the second class point of the second class point sequence, and uses the intersection point as an audible inter-word point.
In one embodiment, the intersection trip point characterizes the presence of an audio trip between corresponding two adjacent characters in the audio-expressed character content.
As an example, the first class of hopping sequences includes five hopping points a1, a2, a3, a4, a 5; the second class of hopping point sequences includes five hopping points b1, b2, b3, b4, and b 5. Where a3 and b3 are each represented as a trip point between audio frame Z1 and audio frame Z2, a4 and b5 are each represented as a trip point between audio frame Z3 and audio frame Z4, then the server uses P1 for representing the set of both a3 and b3, P2 for representing the set of both a4 and b5, and P1 and P2 are each taken as an intersection trip point, i.e., P1 and P2 are each inter-word trip points of audio.
Wherein the number of the intersection points is at least one, and when the number of the intersection points is 0, the server reselects the audio first-class points and the audio second-class points to redetermine the intersection points.
In an exemplary embodiment, referring to fig. 4, fig. 4 is a flowchart illustrating an embodiment of obtaining a predicted transition probability according to the present application. Before step S13, the server may specifically perform the following implementation before performing probability adjustment on the predicted transition probabilities that the text belongs to various phoneme states based on the inter-word hopping points:
And a1, carrying out framing operation on the audio to obtain an audio frame sequence.
In one embodiment, the server first performs a framing operation on the audio, and then performs a windowing operation on the audio to obtain a sequence of audio frames.
Wherein, the framing operation may represent: x is x n (i) X (n·m+i), where n represents an n-th frame signal, M represents a frame shift, i represents an index of an n-th intra-frame signal, and the value range of i is 0,1,2, …, L-1, where L represents a frame length. The application uses the frame length t frmhop =0.02 s (seconds) frame shift t frmhop =0.01 s (seconds).
Wherein the windowing operation represents: xw n (i)=x n (i) W (i), where w (i) represents a window function and i represents the ith sample, in one example, a hanning window may be used.
And a2, performing phoneme conversion processing on the text based on the audio frame sequence to obtain a phoneme sequence.
In an embodiment, the server inputs the text to be aligned into a preset tone recognition network for phoneme segmentation, or the music engineer performs artificial phoneme segmentation on the text to be aligned to obtain corresponding multi-segment phoneme fragments.
In an embodiment, the number of phonemes in the sequence of phonemes corresponds to the number of audio frames in the sequence of audio frames. That is, the timbre recognition network segments the phonemes in the phoneme sequence according to the number of audio frames in the audio frame sequence, or the musician segments the phonemes in the phoneme sequence according to the number of audio frames in the audio frame sequence to obtain a plurality of sections of phoneme fragments equal to the number of audio frames in the audio frame sequence.
And a3, obtaining the prediction transition probabilities of each phoneme belonging to various phoneme states in the phoneme sequence.
In some embodiments, the server performs probabilistic prediction on each phoneme in the phoneme sequence by using a probabilistic prediction model to obtain a prediction transition probability that each phoneme belongs to each phoneme state. The probabilistic predictive model may be, for example, a hidden markov-gaussian mixture model (HMM-GMM model), so as to output the predictive transition probabilities that each phoneme belongs to each phoneme state through a series of algorithm optimization and iteration of the HMM-GMM model.
In some embodiments, the phoneme states refer to hidden Markov states for each phoneme in the phoneme sequence. Each phoneme may contain at least 3 hidden markov states, namely a pre-phoneme state, an in-phoneme state, a post-phoneme state. The predicted transition probabilities of the phoneme states are the transition probabilities of the phonemes belonging to the various phoneme states respectively.
As an example, in actual speaking, one does not pronounce each word individually, and many pronunciations are concatenated, which is particularly common in english. Therefore, even if the same phoneme is adopted, if the front and rear phonemes are different, the pronunciation of the final phoneme will be different. Since the pronunciation of a phoneme is affected by the front and rear phonemes, the phoneme can be combined with the front and rear phonemes, and three states are used to represent a phoneme. For example, the phonemes en may be represented by zh-en, en-h. Thus, in some embodiments, when the language type of the text to be aligned and the audio to be aligned is chinese, the above triphone splitting may be to split the text in the text to be aligned into triphone expressions. In some embodiments, triphone splitting may be performed based on the result of the monophone splitting, e.g., after splitting "jin" into "jin" in the text to be aligned, j may be further split into "sli-j j j-in" triphone expressions based on the result. After the triphone expressions are obtained, the triphones can be clustered by unsupervised learning such that the same or similar triphones are categorized as one class. Similar triphones refer to triphones with similar pronunciation, such as pronunciation in the two pronunciations of "ning" and "mini", but because the pronunciation of the initial consonant "n" is similar to that of "m", the pronunciation of the triphones "n-ing" is similar to that of "m-ing", and can be classified into one category. After the triphones are clustered, the same or similar triphones are classified, so that the number of features to be processed in the audio text alignment process can be reduced, and the alignment speed is increased.
In some embodiments, the predicted transition probabilities include a transition probability and an inter-state transition probability. The self-rotation probability representation state is converted into the self-state with a certain probability; the inter-state transition probabilities characterize a state transition to other states with a certain probability.
As an example, the audio frame A1 and the audio frame A2 are a pair of adjacent audio frames, the phoneme B1 and the phoneme B2 are a pair of adjacent phonemes, and the audio frame A1 corresponds to the phoneme B1 in respective sequences (i.e., the audio frame sequence and the phoneme sequence), and the audio frame A2 corresponds to the phoneme B2 in respective sequences. The probability of the same phoneme state between the phoneme B1 and the phoneme B2 is a self-rotation probability (i.e., the phoneme B1 is a state 1 in the audio frame A1 and the phoneme B2 is a state 1 in the audio frame A2), and the probability of the different phoneme state between the phoneme B1 and the phoneme B2 is an inter-state transition probability (i.e., the phoneme B1 is a state 1 in the audio frame A1 and the phoneme B2 is a state 2 in the audio frame A2 and the states 1 and 2 are different).
In an exemplary embodiment, referring to fig. 5, fig. 5 is a flowchart illustrating an embodiment of adjusting the predicted transition probability according to the present application. In step S13, the server performs a probabilistic adjustment process on the state transition probabilities of part of phonemes in the phoneme sequence based on the inter-word hopping point, which may be specifically implemented by:
Step S131, in the audio, determining two adjacent target audio frames corresponding to the positions of the inter-word hopping points.
In step S132, two target phonemes corresponding to the two target audio frames are determined in the phoneme sequence, respectively.
Step S133, the transition probability between the predicted states of the next target phoneme in the two target phonemes is adjusted to be the preset transition probability.
As an example, there are inter-word hopping points X1, X2, X3 in the audio, the server first determines audio frames a1 and a2 corresponding to X1, audio frames a3 and a4 corresponding to X2, and audio frames a5 and a6 corresponding to X3 in the audio frame sequence corresponding to audio; then, the server determines phonemes b1 and b2 corresponding to a1 and a2, phonemes b3 and b4 corresponding to a3 and a4, and phonemes b5 and b6 corresponding to a5 and a6 in the phoneme sequence. Finally, the server adjusts the transition probability between the predicted states of the phoneme b2 to be a preset transition probability, adjusts the transition probability between the predicted states of the phoneme b4 to be a preset transition probability, and adjusts the transition probability between the predicted states of the phoneme b6 to be a preset transition probability.
The preset transition probability may be an absolute probability, that is, the transition probability between the predicted states of the next target phoneme in the two target phonemes is one hundred percent. For example, before probability adjustment, the phoneme a6 has a transition probability of 60% and an inter-state transition probability of 40%; after the probability adjustment, the self-transition probability of the phoneme a6 is 0% and the inter-state transition probability is 100%.
In an exemplary embodiment, referring to fig. 6, fig. 6 is a flow chart illustrating an embodiment of the present application for aligning audio and text. In step S14, the server aligns the audio and the text based on the new transition probability, and the process of obtaining the aligned audio text may be specifically implemented by the following manner:
in step S141, the emission probabilities of the audio belonging to the respective phoneme states are acquired.
In some embodiments, the server extracts MFCC (mel-frequency cepstrum parameter) features of each audio frame of the audio, and then integrates emission state probabilities that the MFCC features of each audio frame belong to various hidden markov states, respectively, to obtain an emission probability matrix P corresponding to each audio frame.
Wherein the transmission probability matrix P is based on the formulaCharacterization.
Where pij denotes the probability that the ith frame belongs to the jth state (i=1, …, n; j=1, …, m), and k is a number belonging to j, i.e. j=1, …, k, …, m.
In other embodiments, the server may also extract PLP (Perceptual Linear Predictive, perceptual linear prediction coefficients) features of each audio frame of the audio, and then integrate the transmission state probabilities that the PLP features of each audio frame belong to various hidden markov states, respectively, to obtain the transmission probability matrix P corresponding to each audio frame.
Step S142, decoding the emission probability and the new transition probability, and determining the optimal phoneme state corresponding to each audio frame in the audio.
In some embodiments, the server performs viterbi decoding on the transmission probability matrix and the new transition probability matrix through the GMM-HMM framework to obtain an optimal hidden markov state corresponding to each audio frame. The optimal hidden Markov state is an optimal solution for searching the audio frame in a text space.
Step S143, based on the optimal phoneme state of each audio frame, a time correspondence between each audio frame and each phoneme in the text is determined.
And S144, aligning the audio and the text based on the time corresponding relation to obtain an aligned audio text.
In some embodiments, the server converts the optimal phoneme state corresponding to each audio frame in the audio into a phoneme sequence, and then converts the phoneme sequence into a word sequence to determine the time correspondence between each audio frame and each phoneme in the text, and finally, the server aligns each phoneme with each audio frame according to the time correspondence between each phoneme in the text and each audio frame to obtain the aligned audio text. Wherein the temporal correspondence between audio frames and voxels is used to characterize the time stamp of the text.
In order to more clearly clarify the audio text alignment method provided by the embodiment of the present disclosure, a specific embodiment of the audio text alignment method is described below. In an exemplary embodiment, referring to fig. 7 and 8, fig. 7 is a flowchart of an audio text alignment method according to another exemplary embodiment, and fig. 8 is a block diagram of an audio text alignment method according to another exemplary embodiment, where the audio video synthesis method is used in the server 104, and specifically includes the following:
step S21, obtaining voice audio in song audio and lyric text corresponding to the voice audio.
Step S22, carrying out frame division processing on the voice audio to obtain an audio frame sequence, and extracting audio characteristics from each audio frame in the audio frame sequence.
Wherein the audio features are MFCC features corresponding to non-audio.
Step S23, inputting the MFCC characteristics of each audio frame into an acoustic model DNN for probability prediction to obtain state emission probabilities of the MFCC characteristics of each audio frame belonging to various HMM states respectively.
Step S24, the MFCC features of each audio frame are integrated to respectively belong to the transmission state transmission probabilities of various HMM states, and the transmission probability matrix P corresponding to each audio frame is obtained.
Wherein the transmission probability matrix P is based on the formulaCharacterization.
Where pij denotes the probability that the ith frame belongs to the jth state (i=1, …, n; j=1, …, m), and k is a number belonging to j, i.e. j=1, …, k, …, m.
Step S25, firstly converting the lyric text into a phoneme sequence, and then converting the phoneme sequence into a state sequence belonging to various HMM states.
Wherein the phoneme sequence is characterized based on ph1, ph2, …, phn (where n is the phoneme sequence length); the state sequence is characterized based on S1, S2, …, sn (where n is the state sequence length).
Wherein each state in the sequence of states includes a self transition probability and an inter-state transition probability.
The self-rotation probability representation state is converted into the self-state with a certain probability; the inter-state transition probabilities characterize a state transition to other states with a certain probability.
Wherein the HMM states share a preset number of types (e.g., 26 types), each phoneme corresponds to a plurality of audio frames (each audio frame corresponds to a state), and the state of each frame is determined according to the respective probabilities with respect to all the states.
Step S26, a state transition matrix S corresponding to each audio frame is generated according to the state sequence.
Wherein the state transition matrix S is characterized based on a state transition probability map. As shown in fig. 9, fig. 9 is a schematic diagram of a state transition probability map according to an exemplary embodiment. Wherein the horizontal axis of the state transition probability map is the number of audio frames or The phoneme order (i.e., sequence T (1, 2,3,4,5,6, 7) characterizes the number of audio frames in a sequence of audio frames, and sequence X (X1, X2, X3, X4, X5, X6, X7) characterizes the number of phonemes in a sequence of phonemes); the vertical axis is the probability of all types of states (e.g., 26 states, schematically illustrated with S (S 0 ,S 1 ,S 2 ,S 3 ,S 4 ) 5), wherein the sequence a 1 (a 11 ,a 22 ,a 33 ) Representing the probability of autorotation for each phoneme, sequence a 2 (a 01 ,a 12 ,a 23 ,a 34 ) Representing the transition probability between states of each phoneme; the bold line in the figure refers to the optimally determined optimal probability path, i.e. the optimal state for each frame determined based on the optimal probability path.
Step S27, detecting the fundamental frequency of the voice audio, and determining a first class trip point sequence in the audio frame sequence.
Wherein the fundamental frequency detection is based on an autocorrelation function (or Y-in algorithm)Is carried out.
Wherein Ri (k) is a trip point between two adjacent audio frames in the voice audio; xi (m) is the base frequency value of the i-th frame; k is the period of the function xi (m), and when k is a multiple of xi (m), the xi (m) takes the maximum value, namely corresponds to the jump point value between two adjacent audio frames; the first class of trip point sequences are characterized based on (V1, V2, …, vi, …, vn).
The jump points represent that corresponding audio frames in the state transition probability diagram have jump relative to the HMM state of the previous audio frame, namely the corresponding audio frames are higher by one bit relative to the vertical axis value of the previous audio frame on the state transition probability diagram. For example, x1 and x2 in fig. 9 are the same state, and x3 and x4 are different states, i.e., there is a state transition between x3 and x 4.
And step S28, short-time energy detection is carried out on the audio frame sequence, and a second class of hopping point sequence in the audio frame sequence is determined.
Wherein short-time energy detection is used to calculate an energy value for each audio frame.
When the change between the energy values of two adjacent audio frames exceeds a threshold value, the voice audio is considered to be at an energy jump point currently; the second class of trip point sequences is characterized based on (D1, D2, …, di, …, dn).
Wherein, the baseband frequency detection and the energy detection are all VAD detection in one mode.
Step S29, taking the intersection of the first class of hopping point sequences and the second class of hopping point sequences, and determining a third class of first class of hopping point sequences and the second class of hopping point sequences in the audio frame sequence.
Wherein the third class of trip point sequences is characterized based on (K1, K2, …, ki, …, kn).
The first class of hopping point sequences and the second class of hopping point sequences are hopping point sets between two audio frames corresponding to the existing hopping points, the third class of hopping point sequences are intersection hopping point sets between the first class of hopping point sequences and the second class of hopping point sequences, and the intersection hopping point sets are used as the hopping point sets between every two words.
Step S30, the state transition probability of the audio frame corresponding to each hopping point in the third type of hopping point sequence is modified to be 1, and a new state transition matrix S' is obtained.
Wherein a state transition probability of 1 between audio frames characterizes that two audio frames between two words must have state transitions. The new state transition matrix S' is characterized based on the new state transition probability map.
Wherein, there may or may not be a state transition between audio frames within each word.
And S31, carrying out Viterbi decoding on the transmission probability matrix P and the transition probability matrix S' through a GMM-HMM framework to obtain an optimal HMM state corresponding to each audio frame.
The optimal HMM state is an optimal solution for searching the audio frame on the lyric text space.
And step S32, determining a decoded state sequence according to the optimal HMM state.
The decoded state sequence characterizes an optimal state path on the state transition probability diagram, namely, a recognition result of the state transition probability diagram on the state path.
Step S33, converting the state sequence into a phoneme sequence and then into a word sequence to determine the time corresponding relation between each audio frame in the human voice audio and each lyric in the lyric sequence.
Wherein the temporal correspondence between audio frames and lyrics is used to characterize the time stamp of the lyrics.
Step S34, according to the time corresponding relation between each audio frame and each lyric in the lyric sequence, aligning each audio frame in the voice audio with each lyric in the lyric sequence, and obtaining the aligned audio text.
On the one hand, compared with the prior art, the method aligns the audio and the text through the transition probability of the phoneme states corresponding to the audio and the text to be aligned to obtain the aligned audio text, thereby optimizing the alignment flow of the audio text, accelerating the alignment speed of the audio text and reducing the consumption of manpower and time cost; on the other hand, the prediction transition probability of the phoneme state corresponding to the text is adjusted by utilizing the inter-word jumping points in the audio so as to align the audio with the text based on the transition probability, thereby improving the naturalness and expressive force of the aligned audio text and ensuring that the quality and the display effect of the aligned audio text are better.
It should be understood that, although the steps in the flowcharts of fig. 2-9 are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least a portion of the steps of fig. 2-9 may include multiple steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor does the order in which the steps or stages are performed necessarily occur sequentially, but may be performed alternately or alternately with at least a portion of the steps or stages in other steps or other steps.
It should be understood that the same/similar parts of the embodiments of the method described above in this specification may be referred to each other, and each embodiment focuses on differences from other embodiments, and references to descriptions of other method embodiments are only needed.
Fig. 10 is a block diagram of an audio text alignment apparatus according to an embodiment of the present application. Referring to fig. 10, the audio text alignment apparatus 10 includes: a data acquisition unit 11, a trip point extraction unit 12, a probability adjustment unit 13, and an alignment unit 14.
Wherein the data acquisition unit 11 is configured to perform acquisition of audio and text to be aligned;
wherein the trip point extraction unit 12 is configured to perform at least one inter-word trip point extraction from the audio; the inter-word hopping points are located between two adjacent characters in the text content expressed by the audio;
wherein, the probability adjustment unit 13 is configured to execute probability adjustment on the predicted transition probabilities that the text belongs to various phoneme states respectively based on the inter-word hopping points, so as to obtain new transition probabilities; the probability adjustment is used for adjusting the predicted transition probability between two adjacent characters in the text to be a preset transition probability, and the preset transition probability represents the transition probability of different phoneme states between the two adjacent characters.
Wherein the alignment unit 14 is configured to perform an alignment of the audio and the text based on the new transition probability, resulting in an aligned audio text.
In some embodiments, before said probability adjustment of the predicted transition probabilities of the text belonging to the respective phoneme states based on the inter-word hopping points, the probability adjustment unit 13 is specifically further configured to:
framing the audio to obtain an audio frame sequence;
performing phoneme conversion processing on the text based on the audio frame sequence to obtain a phoneme sequence; the number of phonemes in the phoneme sequence corresponds to the number of audio frames in the audio frame sequence;
and obtaining the prediction transition probabilities of each phoneme belonging to various phoneme states in the phoneme sequence.
In some embodiments, in terms of said obtaining the predicted transition probabilities of the respective phoneme states of the sequence of phonemes, the probability adjusting unit 13 is specifically further configured to:
obtaining the prediction transition probabilities of each phoneme belonging to various hidden Markov states in the phoneme sequence; the predictive transition probabilities include inter-state transition probabilities that characterize a probability in the phoneme sequence that a first hidden Markov state of a previous phoneme transitions to a second hidden Markov state of a subsequent phoneme, and the first hidden Markov state is different from the second hidden Markov state;
Based on the inter-word hopping points, the probability adjustment is performed on the predicted transition probabilities of the texts belonging to various phoneme states respectively, and the method comprises the following steps:
and carrying out probability adjustment on the transition probability between states of part of phonemes in the phoneme sequence based on the inter-word hopping points.
In some embodiments, in terms of said probability adjustment of the inter-state transition probabilities of the partial phonemes in the phoneme sequence based on the inter-word hopping point, the probability adjustment unit 13 is specifically further configured to:
in the audio, determining two adjacent target audio frames corresponding to the positions of the inter-word hopping points;
respectively determining two target phonemes corresponding to the two target audio frames in the phoneme sequence;
and adjusting the transition probability between the predicted states of the next target phoneme in the two target phonemes to be a preset transition probability.
In some embodiments, in terms of said extracting at least one inter-word trip point from said audio, the trip point extracting unit 12 is specifically further configured to:
detecting the fundamental frequency of the audio frequency, and determining a first class trip point in the audio frequency; the first class trip point characterizes that a trip of a base frequency value exists between two adjacent audio frames corresponding to the audio;
Performing energy detection on the audio frequency, and determining a second class of jump points in the audio frequency; the second class of hopping points represent hopping of energy values between two adjacent audio frames in the audio;
and extracting at least one jump point from the first jump point type and the second jump point type as an inter-word jump point.
In some embodiments, in the aspect of determining the first class of trip points in the audio, the trip point extracting unit 12 is specifically further configured to:
performing Fourier transform on respective fundamental frequency values of each audio frame in the audio respectively, and determining amplitude spectrum data aiming at the audio;
when the amplitude variation degree between two adjacent audio frames in the amplitude spectrum data exceeds a preset degree, determining that a first type of jump points exist between the two adjacent audio frames;
and collecting the first class of hopping points in the audio frequency to obtain a first class of hopping point sequence aiming at the audio frequency.
In some embodiments, in terms of said determining the second class of hops in said audio, the hopping point extracting unit 12 is specifically further configured to:
in each audio frame of the audio, when the degree of change between the energy values of two adjacent audio frames exceeds a preset degree, determining that a second class of jump points exist between the two adjacent audio frames;
And collecting the second class of hopping points in the audio frequency to obtain a second class of hopping point sequence aiming at the audio frequency.
In some embodiments, in terms of said extracting at least one hopping point from said first class of hopping points and said second class of hopping points as an inter-word hopping point, the hopping point extracting unit 12 is specifically further configured to:
determining an intersection point from the first class point of the first class point sequence and the second class point of the second class point sequence, and taking the intersection point as an inter-word point of the audio; the number of intersection hops is at least one.
In some embodiments, in the aspect of said aligning said audio and said text based on said new transition probability, resulting in aligned audio text, the alignment unit 14 is specifically further configured to:
acquiring the emission probabilities of the audios respectively belonging to various phoneme states;
decoding the emission probability and the new transition probability, and determining an optimal phoneme state corresponding to each audio frame in the audio;
determining a time correspondence between each audio frame and each phoneme in the text based on the optimal phoneme state of each audio frame;
And aligning the audio and the text based on the time corresponding relation to obtain an aligned audio text.
Fig. 11 is a block diagram of a computer device 20 provided in an embodiment of the present application. For example, the computer device 20 may be an electronic device, an electronic component, or an array of servers, or the like. Referring to fig. 11, the computer device 20 includes a processor 21, which further processor 21 may be a processor set, which may include one or more processors, and the computer device 20 includes memory resources represented by a memory 22, wherein the memory 22 has stored thereon a computer program, such as an application program. The computer program stored in the memory 22 may include one or more modules each corresponding to a set of executable instructions. Further, the processor 21 is configured to implement the audio text alignment method as described above when executing the executable instructions.
In some embodiments, computer device 20 is an electronic device in which a computing system may run one or more operating systems, including any of the operating systems discussed above as well as any commercially available server operating systems. The computer device 20 may also run any of a variety of additional server applications and/or middle tier applications, including HTTP (hypertext transfer protocol) servers, FTP (file transfer protocol) servers, CGI (common gateway interface) servers, super servers, database servers, and the like. Exemplary database servers include, but are not limited to, those commercially available from (International Business machines) and the like.
In some embodiments, processor 21 generally controls the overall operation of computer device 20, such as operations associated with display, data processing, data communication, and recording operations. The processor 21 may comprise one or more processor components to execute computer programs to perform all or part of the steps of the methods described above. Further, the processor component may include one or more modules that facilitate interactions between the processor component and other components. For example, the processor component may include a multimedia module to facilitate controlling interactions between the user computer device 20 and the processor 21 using the multimedia component.
In some embodiments, the processor components in the processor 21 may also be referred to as CPUs (Central Processing Unit, central processing units). The processor assembly may be an electronic chip with signal processing capabilities. The processor may also be a general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a Field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor element or the like. In addition, the processor components may be collectively implemented by an integrated circuit chip.
In some embodiments, memory 22 is configured to store various types of data to support operations at computer device 20. Examples of such data include instructions, acquisition data, messages, pictures, video, and the like for any application or method operating on computer device 20. The memory 22 may be implemented by any type of volatile or non-volatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk, optical disk, or graphene memory.
In some embodiments, the memory 22 may be a memory stick, TF card, etc., and may store all information in the computer device 20, including the input raw data, computer programs, intermediate running results, and final running results, all stored in the memory 22. In some embodiments, it stores and retrieves information based on the location specified by the processor. In some embodiments, with memory 22, computer device 20 has memory capabilities to ensure proper operation. In some embodiments, the memory 22 of the computer device 20 may be divided into a main memory (memory) and an auxiliary memory (external memory) according to purposes, and there is a classification method that is divided into an external memory and an internal memory. The external memory is usually a magnetic medium, an optical disk, or the like, and can store information for a long period of time. The memory refers to a storage component on the motherboard for storing data and programs currently being executed, but is only used for temporarily storing programs and data, and the data is lost when the power supply is turned off or the power is turned off.
In some embodiments, the computer device 20 may further comprise: the power supply assembly 23 is configured to perform power management of the computer device 20, and the wired or wireless network interface 24 is configured to connect the computer device 20 to a network, and the input output (I/O) interface 25. The computer device 20 may operate based on an operating system stored in the memory 22, such as Windows Server, mac OS X, unix, linux, freeBSD, or the like.
In some embodiments, power supply component 23 provides power to the various components of computer device 20. The power supply components 23 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the computer device 20.
In some embodiments, the wired or wireless network interface 24 is configured to facilitate communication between the computer device 20 and other devices, either wired or wireless. The computer device 20 may access a wireless network based on a communication standard, such as WiFi, an operator network (e.g., 2G, 3G, 4G, or 5G), or a combination thereof.
In some embodiments, the wired or wireless network interface 24 receives broadcast signals or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the wired or wireless network interface 24 also includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.
In some embodiments, input output (I/O) interface 25 provides an interface between processor 21 and peripheral interface modules, which may be keyboards, click wheels, buttons, and the like. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.
Fig. 12 is a block diagram of a computer-readable storage medium 30 provided by an embodiment of the present application. The computer readable storage medium 30 has stored thereon a computer program 31, wherein the computer program 31, when executed by a processor, implements the method of alignment of audio text as described above.
The units integrated with the functional units in the various embodiments of the present application may be stored in the computer-readable storage medium 30 if implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the technical solution of the present application may be embodied essentially or partly in the form of a software product or all or part of the technical solution, and the computer readable storage medium 30 includes several instructions in a computer program 31 to make a computer device (which may be a personal computer, a system server, or a network device, etc.), an electronic device (such as MP3, MP4, etc., also may be a smart terminal such as a mobile phone, a tablet computer, a wearable device, etc., also may be a desktop computer, etc.), or a processor (processor) to perform all or part of the steps of the method according to the embodiments of the present application.
Fig. 13 is a block diagram of a computer program product 40 provided by an embodiment of the present application. The computer program product 40 comprises program instructions 41, which program instructions 41 are executable by a processor of the server 20 to implement the method of alignment of audio text as described above.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided with an audio text alignment method, an audio text alignment apparatus 10, a computer device 20, a computer readable storage medium 30 or a computer program product 40. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product 40 embodied on one or more computer program instructions 41 (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods of aligning audio text, apparatus 10 for aligning audio text, computer device 20, computer-readable storage medium 30, or computer program product 40 according to embodiments of the application. It will be understood that each flowchart and/or block of the flowchart illustrations and/or block diagrams, and combinations of flowcharts and/or block diagrams, can be implemented by computer program product 40. These computer program products 40 may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the program instructions 41, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program products 40 may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the program instructions 41 stored in the computer program product 40 produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These program instructions 41 may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the program instructions 41 which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It should be noted that the descriptions of the above methods, apparatuses, electronic devices, computer-readable storage media, computer program products and the like according to the method embodiments may further include other implementations, and specific implementations may refer to descriptions of related method embodiments, which are not described herein in detail.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (11)

1. A method of alignment of audio text, the method comprising:
acquiring audio and text to be aligned;
extracting at least one inter-word trip point from the audio; the inter-word hopping points are located between two adjacent characters in the text content expressed by the audio;
based on the inter-word hopping points, probability adjustment is carried out on the predicted transition probabilities of the texts belonging to various phoneme states respectively, so that new transition probabilities are obtained; the probability adjustment is used for adjusting the predicted transition probability between two adjacent characters in the text to be preset transition probability, and the preset transition probability represents the transition probability of different phoneme states between the two adjacent characters;
And aligning the audio and the text based on the new transition probability to obtain an aligned audio text.
2. The method of claim 1, further comprising, prior to said probability adjustment of the predicted transition probabilities of the respective phoneme states of the text based on the inter-word hopping points:
framing the audio to obtain an audio frame sequence;
performing phoneme conversion processing on the text based on the audio frame sequence to obtain a phoneme sequence; the number of phonemes in the phoneme sequence corresponds to the number of audio frames in the audio frame sequence;
and obtaining the prediction transition probabilities of each phoneme belonging to various phoneme states in the phoneme sequence.
3. The method of claim 2, wherein said obtaining the predicted transition probabilities of each phoneme in the sequence of phonemes belonging to each phoneme state comprises:
obtaining the prediction transition probabilities of each phoneme belonging to various hidden Markov states in the phoneme sequence; the predictive transition probabilities include inter-state transition probabilities that characterize a probability in the phoneme sequence that a first hidden Markov state of a previous phoneme transitions to a second hidden Markov state of a subsequent phoneme, and the first hidden Markov state is different from the second hidden Markov state;
Based on the inter-word hopping points, the probability adjustment is performed on the predicted transition probabilities of the texts belonging to various phoneme states respectively, and the method comprises the following steps:
and carrying out probability adjustment on the transition probability between states of part of phonemes in the phoneme sequence based on the inter-word hopping points.
4. A method according to claim 3, wherein said probability adjustment of state transition probabilities of partial phones in the phone sequence based on the inter-word hopping points comprises:
in the audio, determining two adjacent target audio frames corresponding to the positions of the inter-word hopping points;
respectively determining two target phonemes corresponding to the two target audio frames in the phoneme sequence;
and adjusting the transition probability between the predicted states of the next target phoneme in the two target phonemes to be a preset transition probability.
5. The method of claim 1, wherein the extracting at least one inter-word trip point from the audio comprises:
detecting the fundamental frequency of the audio frequency, and determining a first class trip point in the audio frequency; the first class trip point characterizes that a trip of a base frequency value exists between two adjacent audio frames corresponding to the audio;
Performing energy detection on the audio frequency, and determining a second class of jump points in the audio frequency; the second class of hopping points represent hopping of energy values between two adjacent audio frames in the audio;
and extracting at least one jump point from the first jump point type and the second jump point type as an inter-word jump point.
6. The method of claim 5, wherein the determining a first type of trip point in the audio comprises:
performing Fourier transform on respective fundamental frequency values of each audio frame in the audio respectively, and determining amplitude spectrum data aiming at the audio;
when the amplitude variation degree between two adjacent audio frames in the amplitude spectrum data exceeds a preset degree, determining that a first type of jump points exist between the two adjacent audio frames;
and collecting the first class of hopping points in the audio frequency to obtain a first class of hopping point sequence aiming at the audio frequency.
7. The method of claim 5, wherein the determining the second class of trip points in the audio comprises:
in each audio frame of the audio, when the degree of change between the energy values of two adjacent audio frames exceeds a preset degree, determining that a second class of jump points exist between the two adjacent audio frames;
And collecting the second class of hopping points in the audio frequency to obtain a second class of hopping point sequence aiming at the audio frequency.
8. The method according to claims 6 and 7, wherein said extracting at least one hopping point from said first class of hopping points and said second class of hopping points as an inter-word hopping point comprises:
determining an intersection point from the first class point of the first class point sequence and the second class point of the second class point sequence, and taking the intersection point as an inter-word point of the audio; the number of intersection hops is at least one.
9. The method of claim 1, wherein aligning the audio and the text based on the new transition probabilities results in aligned audio text, comprising:
acquiring the emission probabilities of the audios respectively belonging to various phoneme states;
decoding the emission probability and the new transition probability, and determining an optimal phoneme state corresponding to each audio frame in the audio;
determining a time correspondence between each audio frame and each phoneme in the text based on the optimal phoneme state of each audio frame;
And aligning the audio and the text based on the time corresponding relation to obtain an aligned audio text.
10. A computer device, comprising:
a processor;
a memory for storing executable instructions of the processor;
wherein the processor is configured to execute the executable instructions to implement the method of alignment of audio text as claimed in any one of claims 1 to 9.
11. A computer readable storage medium comprising program data, wherein the program data, when executed by a processor of a computer device, enables the computer device to perform the method of alignment of audio text as claimed in any one of claims 1 to 9.
CN202310553682.8A 2023-05-16 2023-05-16 Audio text alignment method, computer device and computer readable storage medium Pending CN116757175A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310553682.8A CN116757175A (en) 2023-05-16 2023-05-16 Audio text alignment method, computer device and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310553682.8A CN116757175A (en) 2023-05-16 2023-05-16 Audio text alignment method, computer device and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN116757175A true CN116757175A (en) 2023-09-15

Family

ID=87950426

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310553682.8A Pending CN116757175A (en) 2023-05-16 2023-05-16 Audio text alignment method, computer device and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN116757175A (en)

Similar Documents

Publication Publication Date Title
US10789290B2 (en) Audio data processing method and apparatus, and computer storage medium
US11972753B2 (en) System and method for performing automatic speech recognition system parameter adjustment via machine learning
EP3180785B1 (en) Systems and methods for speech transcription
US11361753B2 (en) System and method for cross-speaker style transfer in text-to-speech and training data generation
US12027165B2 (en) Computer program, server, terminal, and speech signal processing method
CN110706714B (en) Speaker model making system
CN102779508A (en) Speech corpus generating device and method, speech synthesizing system and method
WO2021231952A1 (en) Music cover identification with lyrics for search, compliance, and licensing
CN112750445B (en) Voice conversion method, device and system and storage medium
US11017763B1 (en) Synthetic speech processing
JP2012113087A (en) Voice recognition wfst creation apparatus, voice recognition device employing the same, methods thereof, program and storage medium
US11600261B2 (en) System and method for cross-speaker style transfer in text-to-speech and training data generation
JP2005266349A (en) Device, method, and program for voice quality conversion
CN111161695A (en) Song generation method and device
Shechtman et al. Synthesis of Expressive Speaking Styles with Limited Training Data in a Multi-Speaker, Prosody-Controllable Sequence-to-Sequence Architecture.
Zolnay et al. Using multiple acoustic feature sets for speech recognition
Hanzlíček et al. LSTM-based speech segmentation for TTS synthesis
US20240038258A1 (en) Audio content identification
CN107025902B (en) Data processing method and device
CN116757175A (en) Audio text alignment method, computer device and computer readable storage medium
CN113516963B (en) Audio data generation method and device, server and intelligent sound box
KR101890303B1 (en) Method and apparatus for generating singing voice
Reddy et al. MusicNet: Compact Convolutional Neural Network for Real-time Background Music Detection
Tripathi et al. Robust vowel region detection method for multimode speech
CN116386592A (en) Audio template generation method, server and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination