CN115050351A

CN115050351A - Method and device for generating timestamp and computer equipment

Info

Publication number: CN115050351A
Application number: CN202210310807.XA
Authority: CN
Inventors: 王武城; 赵伟峰
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2022-03-28
Filing date: 2022-03-28
Publication date: 2022-09-13

Abstract

The application discloses a method and a device for generating a timestamp and computer equipment, and is applied to the technical field of computers. The method comprises the following steps: extracting acoustic features from a target audio, wherein the target audio corresponds to a first language; acquiring a phoneme sequence corresponding to the target audio, wherein the phoneme sequence is from a phoneme set of a second language; inputting the phoneme sequence corresponding to the target audio and the acoustic characteristics of the target audio into a first language alignment model to obtain the state corresponding to each frame of audio of the target audio; and determining the corresponding voice words of each frame in the target audio based on the corresponding states of each frame of audio, and determining the time stamps of the voice words according to the voice words. By the method, the timestamp of the dialect lyrics can be automatically generated.

Description

Method and device for generating timestamp and computer equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for generating a timestamp, a computer device, and a computer-readable storage medium.

Background

The automatic lyric time stamp is that the starting time and the ending time of each pronunciation corresponding character in the audio are obtained by inputting the audio frequency of the song and the corresponding text content and by an automatic alignment algorithm. By automatic alignment of the lyrics, the labor marking cost of the time stamp will be greatly reduced and the musician's production process will be simplified.

Common song language types include Chinese, English and Cantonese, and the languages all have strict, standard and unified pronunciation phonetic symbol or phoneme dictionaries, and can be used for training an automatic alignment model to realize automatic generation of a lyric time stamp. However, the dialect song is also an important part of the chinese song, such as minnan language, sikawa language, and hennan language, etc., the dialect is different from mandarin language, the chinese phoneme dictionary cannot be directly used, and there is no pronunciation dictionary for the related dialects, and the same word pronounces differently in different dialects, and there is no uniform rule. Therefore, how to realize the automatic generation of the timestamp of dialect lyrics is an urgent problem to be solved.

Disclosure of Invention

The embodiment of the application provides a method, a device, computer equipment and a computer readable storage medium for generating a time stamp, wherein the time stamp of dialect lyrics can be automatically generated.

In one aspect, an embodiment of the present application provides a method for generating a timestamp, where the method includes:

extracting acoustic features from a target audio, wherein the target audio corresponds to a first language;

acquiring a phoneme sequence corresponding to the target audio, wherein the phoneme sequence is from a phoneme set of a second language;

inputting the phoneme sequence corresponding to the target audio and the acoustic characteristics of the target audio into a first language alignment model to obtain the state corresponding to each frame of audio of the target audio;

and determining the corresponding voice words of each frame in the target audio based on the corresponding states of each frame of audio, and determining the time stamps of the voice words according to the voice words.

In one aspect, an embodiment of the present application provides an apparatus for generating a timestamp, where the apparatus includes a processing unit and a determining unit:

the processing unit is used for extracting acoustic features from target audio, and the target audio corresponds to a first language;

the processing unit is further used for acquiring a phoneme sequence corresponding to the target audio, wherein the phoneme sequence is from a phoneme set of a second language;

the processing unit is further used for inputting the phoneme sequence corresponding to the target audio and the acoustic characteristics of the target audio into the first language alignment model to obtain the state corresponding to each frame of audio of the target audio;

and the determining unit is used for determining the voice characters corresponding to each frame in the target audio based on the state corresponding to each frame of audio, and determining the time stamps of the voice characters according to the voice characters.

In one aspect, embodiments of the present application provide a computer device, which includes a memory and a processor, where the memory stores a computer program, and the computer program, when executed by the processor, causes the processor to execute the above-mentioned generating of the timestamp.

In one aspect, embodiments of the present application provide a computer-readable storage medium, which stores a computer program, and when the computer program is read and executed by a processor of a computer device, the computer program causes the computer device to perform the above-mentioned generating of the time stamp.

In one aspect, embodiments of the present application provide a computer program product, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium. The computer instructions are read by a processor of the computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the generating the time stamp as described above.

In the method provided by the application, firstly, acoustic features are extracted from target audio, and the target audio corresponds to a first language; acquiring a phoneme sequence corresponding to the target audio, wherein the phoneme sequence is from a phoneme set of a second language; then, inputting the phoneme sequence corresponding to the target audio and the acoustic characteristics of the target audio into a first language alignment model to obtain the state corresponding to each frame of audio of the target audio; and finally, determining the corresponding voice words of each frame in the target audio based on the corresponding states of each frame of audio, and determining the time stamps of the voice words according to the voice words. Under the scene that the first language does not have the corresponding phoneme set, the phoneme sequence corresponding to the target audio is represented based on the phoneme set of the second language, so that the phoneme sequence of the target audio is converted into a state sequence, and the voice time stamp of the target audio is obtained. Based on the method described in the application, under the condition that the first language corresponding to the target audio does not have the phoneme set, the voice time stamp of the target audio can be automatically generated, so that the method not only can help workers reduce workload, but also can generate more accurate voice time stamp.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic structural diagram of a system for generating timestamps according to an embodiment of the present application;

fig. 2 is a flowchart illustrating a method for generating a timestamp according to an embodiment of the present application;

FIG. 3 is a diagram illustrating a phoneme sequence and a state sequence provided by an embodiment of the present application;

fig. 4 is a schematic diagram of a time frame corresponding state provided in an embodiment of the present application;

FIG. 5 is a schematic flowchart of another method for generating a timestamp according to an embodiment of the present application;

FIG. 6 is a schematic flowchart of another method for generating a timestamp according to an embodiment of the present application;

FIG. 7 is a schematic structural diagram of an apparatus for generating a timestamp according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that references to "first", "second", etc. in the embodiments of the present application are for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a technical feature defined as "first" or "second" may explicitly or implicitly include at least one such feature.

In the embodiments of the present application, an Artificial Intelligence (AI) technique is involved. So-called artificial intelligence is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the implementation method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject, and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and the like.

The method provided by the application belongs to a voice processing technology in an artificial intelligence technology. The automatic lyric time stamp is that the starting time and the ending time of each word in the audio are obtained according to an automatic alignment algorithm by inputting the audio frequency of the song and the corresponding text content. The lyric time stamp is automatically generated through the lyric automatic alignment algorithm, so that the labor labeling cost can be reduced, and the process of making audio by musicians is simplified.

Common song language types include Chinese, English and Cantonese, and the languages all have strict, standard and unified pronunciation phonetic symbol or phoneme dictionaries, and can be used for training an automatic alignment model to realize automatic generation of a lyric time stamp. However, dialect songs are also important parts of chinese songs, such as southern minnas, chinese tetrancharies, and chinese homs, and dialects are different from mandarins, and cannot directly use a chinese phoneme dictionary, and the same characters pronounce differently in different dialects without a uniform rule. Therefore, training the auto-alignment model for dialect songs is a huge challenge.

In a specific implementation, the above-mentioned method of generating a timestamp may be performed by a computer device, which may be a terminal device or a server. The terminal device may be, for example, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, a smart car, or the like, but is not limited thereto; the server may be, for example, an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN) server, and a big data and artificial intelligence platform.

Alternatively, the above-mentioned timestamp generation method may be performed by the terminal device and the server together. For example, see FIG. 1 for an illustration: the terminal device 101 may acquire the target audio and send the target audio to the server 102. Correspondingly, after receiving the target audio, the server 102 may extract acoustic features, and then obtain a phoneme sequence corresponding to the target audio; then, the server 102 inputs the phoneme sequence corresponding to the target audio and the acoustic features of the target audio into the first language alignment model to obtain a state corresponding to each frame of audio of the target audio, and finally the server 102 determines corresponding voice words in each frame of the target audio based on the state corresponding to each frame of audio, determines a time stamp of the voice words according to the voice words, and sends the time stamp of the voice words to the terminal device.

The method includes mapping the voices in the target audio to phoneme sequences based on the phoneme set of the second language aiming at a scene that the first voices corresponding to the target audio do not have the phoneme set, and generating corresponding voice time stamps of the target audio based on the phoneme sequences of the target audio. Based on the method, under the condition that the first language corresponding to the target audio does not have the phoneme set, the voice time stamp of the target audio can be automatically generated, so that the method not only can help workers reduce workload, but also can generate more accurate voice time stamp.

It is to be understood that the system architecture diagram described in the embodiment of the present application is for more clearly illustrating the technical solution of the embodiment of the present application, and does not constitute a limitation to the technical solution provided in the embodiment of the present application, and as a person having ordinary skill in the art knows that along with the evolution of the system architecture and the appearance of a new service scenario, the technical solution provided in the embodiment of the present application is also applicable to similar technical problems.

Based on the above description, the method for generating a timestamp according to the embodiment of the present application is further described below with reference to the flowchart shown in fig. 2. In the embodiment of the present application, the method for generating a timestamp by the aforementioned computer device is mainly taken as an example for explanation. Referring to fig. 2, the method for generating a timestamp may specifically include steps S201 to S204:

s201, the computer equipment extracts acoustic features from target audio, and the target audio corresponds to a first language.

In the embodiment of the present application, the target audio refers to audio with voice, such as a song, a recording, and the like. The target audio corresponding to the first language means that the voice in the target audio is in the first language, and the first language may refer to a foreign language, a dialect, and the like. The acoustic feature refers to a physical quantity representing acoustic characteristics of speech, and is also a general term for acoustic representation of elements of sound, and the acoustic feature can be used for analyzing features of target audio, and optionally, the acoustic feature can be Mel-Frequency cepstral Coefficients (MFCC). The target audio may be audio data local to the computer device, or may be audio data sent to the computer device by other devices, which is not limited in this embodiment of the application.

S202, the computer equipment obtains a phoneme sequence corresponding to the target audio, wherein the phoneme sequence is from a phoneme set of a second language.

In the embodiment of the present application, a phoneme is a minimum voice unit divided according to natural attributes of voice, and is analyzed according to pronunciation actions in syllables, and one action constitutes one phoneme. For example, the chinese syllables o (ā) have only one phoneme, the love (aji) has two phonemes, the generation (d aji) has three phonemes, etc. The second language is a different language from the first language, and the pronunciation of the second language is similar to that of the first language, and is usually a common language, and there is a common phoneme dictionary, for example, the second language may be mandarin chinese. The phone set of the second language refers to a plurality of phones for expressing pronunciation of the second language, for example, the second language is mandarin chinese, and the phone set of the second language may be pinyin; the second language is english and the phone set of the second language may be phonetic. It should be noted that the phone set may have other forms besides pinyin, and this is not limited in this embodiment. The phoneme set of the second language is adopted to represent the phoneme sequence corresponding to the target audio, so that the problem that the first language corresponding to the target audio does not have the phoneme set can be effectively solved. Illustratively, the target audio is the audio of a small a saying "shoes" in the context of the word Sichuan, which is similar to the pronunciation of "children" in the common speech, and the pronunciation of "children" in the context of the word Sichuan is the same as the pronunciation of "children" in the common speech, and the corresponding phone sequence of the target audio may be represented as "h a izi" according to the phone set of the common speech. Further illustratively, the target audio is an audio of a small a saying "eat" in the shanghai, the "eat" in the shanghai is similar to the "just" in the mandarin chinese, the "meal" in the shanghai is identical in pronunciation to the "meal" in the mandarin chinese, and the phoneme sequence corresponding to the target audio may be represented as "qi ā f a n" according to the phoneme set of the mandarin chinese.

In a possible implementation manner, the computer device obtains a phoneme sequence corresponding to the target audio, and the specific implementation manner may be: and the computer equipment inputs the acoustic characteristics of the target audio frequency into the second language identification model to obtain a phoneme sequence corresponding to the target audio frequency. The second language identification model is used to convert the speech in the audio into a sequence of phonemes, illustratively, the target audio is audio of Xiao A saying "shoes" in Sichuan, the target audio corresponds to the first language being Sichuan, the "shoes" in Sichuan pronouncing similar to the "children" in Mandarin, and thus the second language identification model identifies the target audio as "h a izi". Based on the implementation method, the working personnel can automatically generate the phoneme sequence of the target audio through the second language identification model, so that the operation is simplified, and meanwhile, the working efficiency of the working personnel is improved.

In a possible implementation manner, the computer device obtains a phoneme sequence corresponding to the target audio, and the specific implementation manner may be: the method comprises the steps that a computer device obtains a voice text of a target audio; the computer device determines a sequence of phonemes corresponding to a pronunciation dictionary target audio of the first language based on the phonetic text of the target audio. The voice text is a word corresponding to the voice in the target audio, and specifically may be a text stored locally by the computer device, or may be sent to the computer device by another device. The pronunciation dictionary of the first language is used for representing a corresponding relationship between the words and phonemes of the first language, the pronunciation dictionary of the first language may be constructed based on a plurality of sample audios and a speech text corresponding to the sample audios, and a specific construction process may refer to the description corresponding to the embodiment in fig. 5, which is not described herein again in this embodiment of the present application. For example, the text data corresponding to the target audio is "shoes", the first language corresponding to the target audio is the Sichuan language, and the phoneme sequence corresponding to the target audio can be represented as "h a izi" according to the correspondence between the words and the phonemes in the pronunciation dictionary of the first language. Based on the implementation mode, the computer equipment can obtain the phoneme sequence corresponding to the target audio only according to the voice text of the target audio, and the more accurate phoneme sequence can be obtained.

S203, inputting the phoneme sequence corresponding to the target audio and the acoustic characteristics of the target audio into the first language alignment model by the computer equipment to obtain the state corresponding to each frame of audio of the target audio.

In the embodiment of the application, the states are audio representations with minimum granularity, one phoneme may correspond to one or more states, and one frame of audio features in the audio data corresponds to one state. For example, as shown in fig. 3, the text corresponding to the speech in the target audio is "i and you", and the corresponding phoneme sequence may be represented as "w ǒ han ǐ", the total length of the target audio corresponds to 10 frames, "w" corresponds to two frames, "ǒ" also corresponds to two frames. In the two frames corresponding to "w", the states are different, the state corresponding to the 1 st frame is state 1, and the state corresponding to the 2 nd frame is state 2.

In one possible implementation manner, the computer device inputs the phoneme sequence corresponding to the target audio and the acoustic features of the target audio into the first language acoustic model to obtain a state corresponding to each frame of audio of the target audio, and the specific implementation manner may be: the computer equipment calls a language model in the first language alignment model to process the phoneme sequence to obtain a state sequence corresponding to the target audio; the computer equipment inputs the state sequence corresponding to the target audio and the acoustic characteristics of the target audio into an acoustic model of a first language alignment model to determine the probability of a plurality of states corresponding to each frame of audio in the audio data; the computer equipment obtains the state corresponding to each frame of audio frequency of the target audio frequency based on the preset state transition probability and the probability of a plurality of states corresponding to each frame of audio frequency in the target audio frequency.

The first language alignment model comprises two models which are a language model and an acoustic model respectively. The language Model in the first language alignment Model is used to convert the phoneme sequence into a state sequence, for example, the language Model may be a Hidden Markov Model (HMM), or the language Model may be another Model, for example, an end-to-end singing voice recognition Model, which is not limited in this embodiment. And the acoustic model in the first language alignment model is used for determining the probability of the corresponding state of each frame of audio of the target audio according to the acoustic features and the state sequence of the target audio. The acoustic Model may be a Gaussian Mixture Model (GMM) or a Deep Neural Network (DNN) Model. Or the acoustic model may also be a model obtained by training an acoustic model adapted to a second language based on a sample text of the first language, where the second language acoustic model is used to determine a probability of a corresponding state of each frame of audio in the audio data of the second language according to an acoustic feature and a state sequence of the audio data of the second language, and a specific implementation manner of the model may be described in the following description of an embodiment corresponding to fig. 6.

It should be noted that the preset state transition probability refers to the probability that each state in the target audio transitions to another state or remains as the original state in the next frame, for example, when it is known that the first frame is the state 1, the probability that the state of the next frame is still as the state 1 is 20%, and the probability that the state of the next frame is the state 2 is 80%.

Illustratively, as shown in fig. 4, the target audio is a recording of "i and you", the total length of the target audio corresponds to 10 frames, the computer device may determine a state sequence corresponding to the target audio according to the previous steps, and the 10 frames of the target audio known from the state sequence may correspond to 5 states, namely, state 1, state 2, state 3, state 4, and state 5. Through the acoustic model, it can be determined that the probability of state 1 is 50%, the probability of state 2 is 20%, etc. in the first frame, the probability of state 1 is 10%, the probability of state 2 is 0%, etc. in the second frame. According to the preset state transition probability, it can be determined that if the first frame is known to be in the state 1, the probability that the second frame is in the state 1 is 60%; if the first frame is known to be in state 1, the probability of the second frame being in state 2 is 20%. The method comprises the steps of forming a decoding graph according to the probability of a plurality of states corresponding to each frame of audio in target audio and the state transition probability corresponding to each state, and searching and obtaining a globally optimal decoding path on the decoding graph based on a Viterbi decoding algorithm, so that the state corresponding to each frame of audio of the target audio is determined. Based on the implementation mode, the voice timestamp can be generated more accurately.

Optionally, the method further comprises: training the computer equipment based on a plurality of sample texts to obtain a first language association model, wherein the sample texts correspond to a first language; the computer device determines a preset state transition probability based on a pronunciation dictionary of the first language and the first language association model. The first language association model may be an N-gram (N-gram) model, and because the language habit of the first language is different from the language habit of the second language, the probability of each phrase appearing in the sample text corresponding to the first language and the sample text corresponding to the second language is also different. The N-gram model can count the occurrence probability of each N word groups, for example, when N is 3, the occurrence frequency of the three-element word groups such as 'I' and 'you' and the like in the lyric text is calculated, so that the occurrence probability of the triple can be obtained, and the larger N is, the more abundant the context information of the words is. The correlation model of the first language is used for determining the preset state transition probability corresponding to the first language, so that the state corresponding to each frame of audio of the target audio is more accurate, and the generated voice timestamp is more accurate.

S204, the computer equipment determines the corresponding voice characters of each frame in the target audio based on the corresponding states of each frame of audio, and determines the time stamps of the voice characters according to the voice characters.

In the embodiment of the application, after the state corresponding to each frame of audio is determined, the computer device can determine the phonemes corresponding to each frame of audio according to the corresponding relationship between the state and the phonemes, and then determine the voice texts corresponding to each frame according to the corresponding relationship between the phonemes and the voice texts, so that the time period corresponding to each word in the target audio can be determined according to the voice texts corresponding to each frame, and the voice timestamp of the target audio is obtained.

Based on the method described in the embodiment of the present application, in a scenario where the first language does not have a corresponding phone set, a phone sequence corresponding to the target audio may be represented according to a phone set of the second language, so that the phone sequence of the target audio is converted into a state sequence, and a speech timestamp of the target audio is obtained. Based on the method, under the condition that the first language corresponding to the target audio does not have the phoneme set, the voice time stamp of the target audio can be automatically generated, so that the method not only can help workers reduce workload, but also can generate more accurate voice time stamp.

Referring to fig. 5, a flowchart of another method for generating a timestamp disclosed in the embodiment of the present invention is shown, where the method for generating a timestamp may be executed by a computer device, and the computer device may specifically be the server 102 in the timestamp generating system. This embodiment is mainly used to explain the process of constructing a pronunciation dictionary of the first language. The text processing method may specifically include steps S501 to S506. Wherein:

s501, the computer equipment extracts acoustic features from first sample audio, and the first sample audio corresponds to a first language.

In this embodiment of the present application, the number of the first sample audio may be one or more, which is not limited in this embodiment of the present application, the first sample audio corresponds to a first language, and the first sample audio is used to train to obtain a pronunciation dictionary of the first language.

S502, inputting the acoustic characteristics of the first sample audio into a second language identification model by the computer equipment to obtain a phoneme sequence corresponding to the first sample audio.

The second language identification model is the same as the second language identification model described in step S202, and is used to convert the speech in the audio into a phoneme sequence, which is not described herein again in this embodiment of the application.

In one possible implementation, the second language identification model may be a hybrid model composed of a DNN model and an HMM model. The specific implementation mode is as follows: the computer equipment inputs the acoustic characteristics of the first sample audio into a DNN model in the second language identification model to obtain the probability of a plurality of states corresponding to each frame of audio of the first sample audio; the computer device inputs the probabilities of the multiple states corresponding to each frame of audio of the first sample audio and the preset state transition probability to an HMM model in the second language recognition model to obtain a state sequence corresponding to the first sample audio, and the computer device can determine a phoneme sequence corresponding to the first sample audio according to the corresponding relationship between the state of the second language and the phonemes of the second language. Based on the implementation manner, under the condition that the first language does not have the corresponding phoneme set and pronunciation dictionary, the phoneme sequence of the first sample audio can be represented through the second language recognition model based on the phoneme set of the second language, and the construction of a more accurate pronunciation dictionary of the first language is facilitated.

S503, the computer device constructs a pronunciation dictionary of the first language based on the phoneme sequence corresponding to the first sample audio and the phonetic text of the first sample audio.

In this embodiment, the computer device may construct a pronunciation dictionary of the first language according to a correspondence between a phoneme sequence corresponding to the first audio and a phonetic text of the first sample audio. For example, the first sample audio may be represented as "w ǒ han ǐ" and the speech text of the first sample is "i and you", and then the corresponding phoneme sequence of "i and you" may be recorded as "w ǒ han ǐ" in the pronunciation dictionary of the first language.

S504, the computer device extracts acoustic features from the target audio and obtains a phoneme sequence corresponding to the target audio, wherein the target audio corresponds to the first language, and the phoneme sequence is from a phoneme set of the second language and corresponds to the first language.

And S505, inputting the phoneme sequence corresponding to the target audio and the acoustic characteristics of the target audio into the first language alignment model by the computer equipment to obtain the state corresponding to each frame of audio of the target audio.

S506, the computer equipment determines the corresponding voice characters of each frame in the target audio based on the corresponding states of each frame of audio, and determines the time stamps of the voice characters according to the voice characters.

The implementation manner corresponding to steps S504 to S506 is the same as that of steps S201 to S205, and is not described herein again in this embodiment of the present application.

Referring to fig. 6, a flowchart of another method for generating a timestamp disclosed in the embodiment of the present invention is shown, where the method for generating a timestamp may be executed by a computer device, and the computer device may specifically be the server 102 in the timestamp generating system. This embodiment is mainly used to explain the process of training an acoustic model. The text processing method may specifically include steps S601 to S607. Wherein:

s601, the computer equipment extracts acoustic features from second sample audio, and the second sample audio corresponds to the first language.

In this embodiment, the number of the second sample audios may be one or more, which is not limited in this embodiment.

S602, the computer equipment determines a phoneme sequence corresponding to the second sample audio according to the pronunciation dictionary of the first language and the phonetic text of the second sample audio.

In the embodiment of the present application, the pronunciation dictionary of the first language is constructed as shown in fig. 5, and the pronunciation dictionary of the first language is used to represent the correspondence between the words and the phonemes of the first language, so that the computer device may determine the phonemes corresponding to the words from the speech text of the second sample audio, and thus determine the phoneme sequence corresponding to the second sample audio.

S603, the computer device calls a language model in the first language alignment model to convert the phoneme sequence into a first state sequence, and inputs the acoustic characteristics of the second sample audio into the acoustic model of the first language alignment model to obtain a second state sequence.

In the embodiment of the present application, the language model in the first language alignment model is used to convert the phoneme sequence into the state sequence, which may specifically refer to the description in step S201, and details of the embodiment of the present application are not repeated herein. The acoustic model of the first language alignment model is used to convert the acoustic features into the state sequence, which may specifically refer to the description in step S204, and this embodiment of the present application is not described herein again. Alternatively, before training, the acoustic model may be a trained acoustic model adapted to the second language, since there is no acoustic model adapted to the first language.

S604, training an acoustic model of the first language alignment model by the computer device based on the first state sequence and the second state sequence.

In this embodiment of the present application, because the first state sequence is a speech text from the second sample audio, the first state sequence may be regarded as an initialization tag, and then parameters of the acoustic model of the first language alignment model are continuously trained and adjusted, so that the second state sequence can approximate the first state sequence, thereby obtaining a final acoustic model suitable for the first language. Based on the implementation mode, the obtained acoustic model is more suitable for the first language, so that the subsequently generated time stamp is more accurate.

S605, the computer device extracts acoustic features from the target audio and obtains a phoneme sequence corresponding to the target audio, wherein the target audio corresponds to the first language, and the phoneme sequence is from a phoneme set of the second language and corresponds to the first language.

S606, inputting the phoneme sequence corresponding to the target audio and the acoustic characteristics of the target audio into the first language alignment model by the computer equipment to obtain the state corresponding to each frame of audio of the target audio.

S607, the computer device determines the corresponding voice words in each frame of the target audio based on the corresponding states of each frame of the audio, and determines the time stamps of the voice words according to the voice words.

The implementation manners corresponding to steps S605 to S607 are the same as those of steps S201 to S205, and are not described herein again in this embodiment of the present application.

Based on the timestamp generation method, the embodiment of the application provides a timestamp generation device. Referring to fig. 7, which is a schematic structural diagram of a time stamp generating apparatus provided in an embodiment of the present application, the time stamp generating apparatus 700 may operate the following units:

the processing unit 701 is configured to extract an acoustic feature from a target audio, where the target audio corresponds to a first language;

a processing unit 701, configured to obtain a phoneme sequence corresponding to a target audio, where the phoneme sequence is from a phoneme set of a second language;

the processing unit 701 is further configured to input a phoneme sequence corresponding to the target audio and an acoustic feature of the target audio into the first language alignment model to obtain a state corresponding to each frame of audio of the target audio;

the determining unit 702 is configured to determine, based on the state corresponding to each frame of audio, a speech word corresponding to each frame of target audio, and determine a timestamp of the speech word according to the speech word.

In an embodiment, when the phoneme sequence corresponding to the target audio and the acoustic feature of the target audio are input into the first language acoustic model to obtain a state corresponding to each frame of audio of the target audio, the processing unit 701 may be specifically configured to: calling a language model in the first language alignment model to process the phoneme sequence to obtain a state sequence corresponding to the target audio; inputting a state sequence corresponding to the target audio and the acoustic characteristics of the target audio into an acoustic model of the first language alignment model to determine the probability of a plurality of states corresponding to each frame of audio in the audio data; and obtaining the state corresponding to each frame of audio of the target audio based on the preset state transition probability and the probability of a plurality of states corresponding to each frame of audio in the target audio.

In one embodiment, the processing unit 701 is further configured to: training based on a plurality of sample texts to obtain a first language association model, wherein the sample texts correspond to a first language; the preset state transition probability is determined based on a pronunciation dictionary of the first language and the first language association model.

In one embodiment, the processing unit 701, when determining the phonetic text corresponding to each frame in the target audio based on the state corresponding to each frame of audio, may be specifically configured to determine the phonetic text corresponding to each frame in the target audio based on the pronunciation dictionary of the first language and the state corresponding to each frame of audio.

In one embodiment, the processing unit 701 is further configured to: extracting acoustic features from a first sample audio, the first sample audio corresponding to a first language; inputting the acoustic characteristics of the first sample audio into a second language identification model to obtain a phoneme sequence corresponding to the first sample audio; and constructing a pronunciation dictionary of the first language based on the phoneme sequence corresponding to the first sample audio and the phonetic text of the first sample audio.

In one embodiment, the processing unit 701 is further configured to: extracting acoustic features from a second sample audio, the second sample audio corresponding to the first language; determining a phoneme sequence corresponding to the second sample audio according to the pronunciation dictionary of the first language and the voice text of the second sample audio; calling a language model in the first language alignment model to convert the phoneme sequence into a first state sequence; inputting the acoustic features of the second sample audio to the acoustic model of the first language alignment model to obtain a second state sequence; an acoustic model of the first language alignment model is trained based on the first sequence of states and the second sequence of states.

In an embodiment, when acquiring the phoneme sequence corresponding to the target audio, the processing unit 701 is specifically configured to: and inputting the acoustic characteristics of the target audio into the second language identification model to obtain a phoneme sequence corresponding to the target audio.

In an embodiment, when acquiring the phoneme sequence corresponding to the target audio, the processing unit 701 is specifically configured to: acquiring a voice text of a target audio; a phoneme sequence corresponding to a pronunciation dictionary target audio of the first language is determined based on the phonetic text of the target audio.

According to another embodiment of the present application, the units in the timestamp generation apparatus shown in fig. 7 may be respectively or entirely combined into one or several other units to form another unit, or some unit(s) therein may be further split into multiple functionally smaller units to form another unit, which may implement the same operation without affecting implementation of technical effects of the embodiments of the present application. The units are divided based on logic functions, and in practical application, the functions of one unit can be realized by a plurality of units, or the functions of a plurality of units can be realized by one unit. In other embodiments of the present application, the timestamp generating apparatus may also include other units, and in practical applications, these functions may also be implemented by assistance of other units, and may be implemented by cooperation of a plurality of units.

According to another embodiment of the present application, the time stamp generating apparatus shown in fig. 7 may be constructed by running a computer program (including program codes) capable of executing the steps involved in the corresponding method shown in fig. 2 on a general-purpose computing device such as a computer including a processing element such as a Central Processing Unit (CPU), a random access storage medium (RAM), a read only storage medium (ROM), and a storage element, and the time stamp generating method of the embodiment of the present application may be implemented. The computer program may be recorded on a computer-readable recording medium, for example, and loaded and executed in the above-described computing apparatus via the computer-readable recording medium.

In the embodiment of the application, acoustic features are extracted from target audio, and the target audio corresponds to a first language; acquiring a phoneme sequence corresponding to the target audio, wherein the phoneme sequence is from a phoneme set of a second language; inputting the phoneme sequence corresponding to the target audio and the acoustic characteristics of the target audio into a first language alignment model to obtain the state corresponding to each frame of audio of the target audio; and determining the corresponding voice words of each frame in the target audio based on the corresponding states of each frame of audio, and determining the time stamps of the voice words according to the voice words. Under the scene that the first language does not have the corresponding phoneme set, the method and the device for processing the target audio have the advantage that the phoneme sequence corresponding to the target audio is represented based on the phoneme set of the second language, so that the voice time stamp of the target audio is obtained. Based on the method described in the application, under the condition that the first language corresponding to the target audio does not have the phoneme set, the voice time stamp of the target audio can be automatically generated, so that the method not only can help workers reduce workload, but also can generate more accurate voice time stamp.

Based on the description of the method embodiment and the device embodiment, the embodiment of the application further provides a computer device. Referring to fig. 8, the computer device 800 includes at least a processor 801, a communication interface 802, and a computer storage medium 803. The processor 801, the communication interface 802, and the computer storage medium 803 may be connected by a bus or other means. A computer storage medium 803 may be stored in the memory 804 of the computer device 800, the computer storage medium 803 being used to store a computer program comprising program instructions, the processor 801 being used to execute the program instructions stored by the computer storage medium 803. The processor 801 (or CPU) is a computing core and a control core of a computer device, and is adapted to implement one or more instructions, and in particular, is adapted to load and execute one or more instructions to implement a corresponding method flow or a corresponding function.

In an embodiment, the processor 801 according to the embodiment of the present application may be configured to implement a method for generating a timestamp, which specifically includes: extracting acoustic features from a target audio, wherein the target audio corresponds to a first language; acquiring a phoneme sequence corresponding to the target audio, wherein the phoneme sequence is from a phoneme set of a second language; inputting the phoneme sequence corresponding to the target audio into a language model to obtain a state sequence corresponding to the target audio; inputting a state sequence corresponding to the target audio and the acoustic characteristics of the target audio into an acoustic model to obtain a state corresponding to each frame of audio of the target audio; and determining the voice time stamp of the voice words corresponding to each frame in the target audio based on the state corresponding to each frame of audio, and the like.

An embodiment of the present application further provides a computer storage medium (Memory), which is a Memory device in a computer device and is used to store programs and data. It is understood that the computer storage medium herein may include both built-in storage media in the computer device and, of course, extended storage media supported by the computer device. Computer storage media provide storage space that stores an operating system for a computer device. Also stored in this memory space are one or more instructions, which may be one or more computer programs (including program code), suitable for loading and execution by the processor 801. The computer storage medium may be a high-speed RAM memory, or may be a non-volatile memory (non-volatile memory), such as at least one disk memory; and optionally at least one computer storage medium located remotely from the processor.

In one embodiment, one or more instructions stored in a computer storage medium may be loaded and executed by a processor to implement the corresponding steps of the methods described above with respect to the embodiments of the method of generating timestamps shown in FIG. 2, FIG. 5, or FIG. 6; in particular implementations, one or more instructions in the computer storage medium are loaded by processor 801 and perform the following steps:

In one embodiment, when inputting the sequence of phonemes corresponding to the target audio and the acoustic features of the target audio into the first language acoustic model to obtain a state corresponding to each frame of audio of the target audio, the one or more instructions may be loaded and specifically executed by the processor: calling a language model in the first language alignment model to process the phoneme sequence to obtain a state sequence corresponding to the target audio; inputting a state sequence corresponding to the target audio and the acoustic characteristics of the target audio into an acoustic model of the first language alignment model to determine the probability of a plurality of states corresponding to each frame of audio in the audio data; and obtaining the state corresponding to each frame of audio of the target audio based on the preset state transition probability and the probability of a plurality of states corresponding to each frame of audio in the target audio.

In one embodiment, the one or more instructions are further loadable by a processor and executable to perform the steps of: training based on a plurality of sample texts to obtain a first language association model, wherein the sample texts correspond to a first language; the preset state transition probability is determined based on a pronunciation dictionary of the first language and the first language association model.

In one embodiment, when determining the speech text corresponding to each frame in the target audio based on the state corresponding to each frame of audio, the one or more instructions may be loaded and specifically executed by the processor: and determining the phonetic words corresponding to each frame in the target audio based on the pronunciation dictionary of the first language and the state corresponding to each frame of audio.

In one embodiment, the one or more instructions are further loadable by a processor and executable to perform the steps of: extracting acoustic features from a first sample audio, the first sample audio corresponding to a first language; inputting the acoustic characteristics of the first sample audio into a second language identification model to obtain a phoneme sequence corresponding to the first sample audio; and constructing a pronunciation dictionary of the first language based on the phoneme sequence corresponding to the first sample audio and the phonetic text of the first sample audio.

In one embodiment, the one or more instructions are further loadable by a processor and executable to perform the steps of: extracting acoustic features from a second sample audio, wherein the second sample audio corresponds to a first language; determining a phoneme sequence corresponding to the second sample audio according to the pronunciation dictionary of the first language and the voice text of the second sample audio; calling a language model in the first language alignment model to convert the phoneme sequence into a first state sequence; inputting the acoustic features of the second sample audio to the acoustic model of the first language alignment model to obtain a second state sequence; an acoustic model of the first language alignment model is trained based on the first sequence of states and the second sequence of states.

In one embodiment, the one or more instructions may be loaded and specifically executed by the processor when obtaining the phoneme sequence corresponding to the target audio: and inputting the acoustic characteristics of the target audio into the second language identification model to obtain a phoneme sequence corresponding to the target audio.

In one embodiment, the one or more instructions may be loaded and specifically executed by the processor when obtaining the phoneme sequence corresponding to the target audio: acquiring a voice text of a target audio; and determining a phoneme sequence corresponding to the target audio based on the phonetic text of the target audio and the pronunciation dictionary of the first language.

In the embodiment of the application, acoustic features are extracted from target audio, and the target audio corresponds to a first language; acquiring a phoneme sequence corresponding to the target audio, wherein the phoneme sequence is from a phoneme set of a second language; inputting the phoneme sequence corresponding to the target audio and the acoustic characteristics of the target audio into a first language alignment model to obtain the state corresponding to each frame of audio of the target audio; and determining the corresponding voice words of each frame in the target audio based on the corresponding states of each frame of audio, and determining the time stamps of the voice words according to the voice words. Under the scene that the first language does not have the corresponding phoneme set, the phoneme sequence corresponding to the target audio is expressed based on the phoneme set of the second language, and therefore the voice timestamp of the target audio is obtained. Based on the method described in the application, under the condition that the first language corresponding to the target audio does not have the phoneme set, the voice time stamp of the target audio can be automatically generated, so that the method not only can help workers reduce workload, but also can generate more accurate voice time stamp.

Claims

1. A method of generating a timestamp, the method comprising:

2. The method of claim 1, wherein the inputting the phoneme sequence corresponding to the target audio and the acoustic features of the target audio into a first language acoustic model to obtain a state corresponding to each frame of audio of the target audio comprises:

calling a language model in the first language alignment model to process a phoneme sequence to obtain a state sequence corresponding to the target audio;

inputting the state sequence corresponding to the target audio and the acoustic features of the target audio into the acoustic model of the first language alignment model to determine the probability of a plurality of states corresponding to each frame of audio in audio data;

and obtaining the state corresponding to each frame of audio of the target audio based on the preset state transition probability and the probability of a plurality of states corresponding to each frame of audio in the target audio.

3. The method of claim 2, further comprising:

extracting acoustic features from a first sample audio, the first sample audio corresponding to the first language;

inputting the acoustic features of the first sample audio to a second language identification model to obtain a phoneme sequence corresponding to the first sample audio;

and constructing a pronunciation dictionary of the first language based on the phoneme sequence corresponding to the first sample audio and the voice text of the first sample audio.

4. The method of claim 3, further comprising:

training based on a plurality of sample texts to obtain a first language association model, wherein the sample texts correspond to a first language;

determining the preset state transition probability based on a pronunciation dictionary of the first language and the first language association model.

5. The method of claim 3, wherein the determining the phonetic text corresponding to each frame in the target audio based on the state corresponding to each frame of audio comprises:

and determining the corresponding phonetic characters of each frame in the target audio based on the pronunciation dictionary of the first language and the corresponding state of each frame of audio.

6. The method of claim 2, further comprising:

extracting acoustic features from a second sample audio, the second sample audio corresponding to the first language;

determining a phoneme sequence corresponding to the second sample audio according to the pronunciation dictionary of the first language and the voice text of the second sample audio;

calling a language model in the first language alignment model to convert the phoneme sequence into a first state sequence;

inputting the acoustic features of the second sample audio to the acoustic model of the first language alignment model to obtain a second state sequence;

training an acoustic model of the first language alignment model based on the first sequence of states and the second sequence of states.

7. The method according to claim 3, wherein the obtaining of the phoneme sequence corresponding to the target audio comprises:

and inputting the acoustic characteristics of the target audio frequency into the second language identification model to obtain a phoneme sequence corresponding to the target audio frequency.

8. The method according to claim 3, wherein the obtaining of the phoneme sequence corresponding to the target audio comprises:

acquiring a voice text of the target audio;

and determining a phoneme sequence corresponding to the target audio based on the phonetic text of the target audio and the pronunciation dictionary of the first language.

9. A computer arrangement comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the method of generating a timestamp as claimed in any of claims 1 to 8.

10. A computer-readable storage medium, characterized in that it stores one or more computer programs adapted to be loaded by a processor and to perform the method of generating a timestamp according to any one of claims 1 to 8.