CN114242108A

CN114242108A - Information processing method and related equipment

Info

Publication number: CN114242108A
Application number: CN202111562845.6A
Authority: CN
Inventors: 王武城
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2021-12-20
Filing date: 2021-12-20
Publication date: 2022-03-25

Abstract

The embodiment of the application discloses an information processing method and related equipment, wherein the method can process an audio signal and text content corresponding to the audio signal by using an alignment model to obtain a voice timestamp; determining a jumping point between human voice and non-human voice in the audio signal by using a human voice detection model; and adjusting the voice timestamp according to a jump point between the human voice and the non-human voice in the audio signal to obtain the adjusted voice timestamp. According to the embodiment of the application, the tone timestamp can be locally adjusted by utilizing the jumping point on the basis of the tone timestamp obtained according to the alignment model, so that the more accurate tone timestamp can be obtained.

Description

Information processing method and related equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to an information processing method and a related device.

Background

The automatic lyric time stamp is that the audio frequency of the input song and the corresponding text content are processed through an alignment model to obtain the starting time and the ending time of the corresponding character of each pronunciation in the text content in the audio frequency. However, in the automatic lyric time stamp obtained based on the alignment model, the human voice tail is cut off, or the automatic lyric time stamp includes some useless silence, and the like, so how to obtain a more accurate lyric time stamp is an urgent problem to be solved.

Disclosure of Invention

The embodiment of the application provides an information processing method and related equipment, and more accurate voice timestamp can be obtained.

On one hand, the embodiment of the application discloses an information processing method, which comprises the following steps:

processing an audio signal and corresponding text content by using an alignment model to obtain a sound time stamp, wherein the sound time stamp comprises the starting time and the ending time of each corresponding word of the audio signal in the text content;

determining a jumping point between human voice and non-human voice in the audio signal by using a human voice detection model;

and adjusting the voice timestamp according to a jump point between the human voice and the non-human voice in the audio signal to obtain the adjusted voice timestamp.

In an alternative embodiment, the human voice detection model is obtained by training with a first Mel Frequency Cepstrum Coefficient (MFCC) feature extracted from human voice audio and a second MFCC feature extracted from non-human voice audio.

In an alternative embodiment, the determining, by using a human voice detection model, a transition point between human voice and non-human voice in the audio signal includes:

dividing the audio signal to obtain N frames of audio;

detecting the audio signal according to the human voice detection model to obtain a detection result; the detection result comprises a result that each frame in the N frames of audio belongs to human voice or non-human voice audio;

and determining a jumping point between human voice and non-human voice in the audio signal according to the detection result, wherein the jumping point between human voice and non-human voice comprises a jumping point from human voice to non-human voice and a jumping point from non-human voice to human voice.

In an optional implementation manner, detecting the audio signal according to the human voice detection model to obtain a detection result includes:

aiming at each frame of the N frames of audio, calculating the maximum likelihood probability value of the frame being human voice and the maximum likelihood probability value of the frame being non-human voice by using the human voice detection model;

if the maximum likelihood probability value of the frame being the human voice is larger than the maximum likelihood probability value of the frame being the non-human voice, determining that the frame is the human voice frame;

and if the maximum likelihood probability value of the frame being the human voice is smaller than the maximum likelihood probability value of the frame being the non-human voice, determining that the frame is the non-human voice frame.

In an optional implementation manner, the adjusting the audio timestamp according to a jumping point between a human voice and a non-human voice in the audio signal to obtain an adjusted audio timestamp includes:

and adjusting the ending time of the human voice and/or the starting time of the non-human voice in the voice timestamp according to the jumping point of the human voice to the non-human voice.

and adjusting the ending time of the non-human voice and/or the starting time of the human voice in the voice timestamp according to the jumping point of the non-human voice to the human voice.

In an optional implementation, the processing the audio signal and the corresponding text content by using the alignment model to obtain the text timestamp includes:

extracting a third MFCC feature from the audio signal;

performing content conversion processing on text content corresponding to the audio signal to obtain a Hidden Markov Model (HMM) state sequence;

inputting the third MFCC features and the HMM state sequence into the alignment model, obtaining the phonetic timestamp.

In an alternative embodiment, said inputting said third MFCC feature and said sequence of HMM states into said alignment model to obtain said phonetic time stamp comprises:

and obtaining a probability value of a hidden Markov state corresponding to each frame in the third MFCC characteristics by using the alignment model, and obtaining the voice timestamp according to the probability value of the hidden Markov state corresponding to each frame.

On the other hand, an embodiment of the present application discloses an information processing apparatus, including:

the processing unit is used for processing the audio signals and the corresponding text contents to obtain a text timestamp;

and the adjusting unit is used for adjusting the voice timestamp to obtain the adjusted voice timestamp.

In an alternative embodiment, the processing unit is configured to process the audio signal and the corresponding text content using the alignment model to obtain the textual timestamp.

In an alternative embodiment, the processing unit is further configured to determine a transition point between human voice and non-human voice in the audio signal by using a human voice detection model.

In an optional implementation manner, when the adjusting unit is configured to adjust the audio timestamp according to a transition point between a human voice and a non-human voice in the audio signal, the adjusting unit is specifically configured to: an adjusted textual timestamp is obtained.

In an alternative embodiment, the processing unit is configured to, when determining a transition point between a human voice and a non-human voice in the audio signal by using a human voice detection model, train with a first mel-frequency cepstrum coefficient MFCC feature extracted from human voice audio and a second MFCC feature extracted from non-human voice audio.

In an optional implementation manner, when determining a transition point between human voice and non-human voice in the audio signal by using a human voice detection model, the processing unit is specifically configured to: dividing the audio signal to obtain N frames of audio;

In an optional implementation manner, when detecting the audio signal according to the human voice detection model and obtaining a detection result, the processing unit is specifically configured to: aiming at each frame of the N frames of audio, calculating the maximum likelihood probability value of the frame being human voice and the maximum likelihood probability value of the frame being non-human voice by using the human voice detection model;

In an optional implementation manner, when the adjusting unit adjusts the audio timestamp according to a jump point between a human voice and a non-human voice in the audio signal, and obtains an adjusted audio timestamp, the adjusting unit is specifically configured to: and adjusting the ending time of the human voice and/or the starting time of the non-human voice in the voice timestamp according to the jumping point of the human voice to the non-human voice.

In an optional implementation manner, when the adjusting unit adjusts the audio timestamp according to a jump point between a human voice and a non-human voice in the audio signal, and obtains an adjusted audio timestamp, the adjusting unit is specifically configured to: and adjusting the ending time of the non-human voice and/or the starting time of the human voice in the voice timestamp according to the jumping point of the non-human voice to the human voice.

In an optional implementation manner, when the processing unit is configured to process the audio signal and the corresponding text content by using the alignment model to obtain the text timestamp, the processing unit is specifically configured to: extracting a third MFCC feature from the audio signal;

In an optional implementation manner, when the processing unit is configured to input the third MFCC feature and the HMM state sequence into the alignment model to obtain the phonetic time stamp, the processing unit is specifically configured to: calculating a probability value of a hidden Markov state corresponding to each frame in the third MFCC characteristics by using an alignment model; obtaining the HMM state sequence corresponding to the third MFCC characteristics according to the probability value of the hidden Markov state corresponding to each frame; and performing content conversion on the HMM state sequence to obtain a phonetic text timestamp.

The embodiment of the present application also discloses an information processing apparatus, including:

the information processing method comprises a memory and a processor, wherein an information processing program is stored in the memory, and the information processing program is executed by the processor to execute the information processing method provided by the embodiment of the application.

The embodiment of the application also discloses a computer readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program executes the information processing method.

Accordingly, the present application also discloses a computer program product or a computer program, which includes computer instructions, which are stored in a computer readable storage medium. The processor of the information processing apparatus reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the information processing apparatus executes the information processing method described above.

Therefore, in the information processing method provided by the application, the audio signal and the corresponding text content can be processed by using the alignment model to obtain the voice timestamp, and the voice detection model is used for determining the jumping point between the voice and the non-voice in the audio signal; and adjusting the voice timestamp according to a jump point between the human voice and the non-human voice in the audio signal to obtain the adjusted voice timestamp. Therefore, the method can perform local fine adjustment on the basis of the sound timestamp obtained by aligning the model, so that a more accurate sound timestamp can be obtained.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic diagram of a lyric timestamp disclosed in an embodiment of the present application;

FIG. 2 is a schematic diagram of a network architecture disclosed in an embodiment of the present application;

FIG. 3 is a flow chart of an information processing method disclosed in an embodiment of the present application;

FIG. 4 is a schematic diagram of a method for determining an alignment model according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a method for determining a human voice detection model disclosed in an embodiment of the present application;

FIG. 6 is a flow chart illustrating another information processing method disclosed in an embodiment of the present application;

FIG. 7 is a schematic diagram of a division of an audio signal according to an embodiment of the present application;

fig. 8 is a schematic diagram of a jumping point between a human voice and a non-human voice disclosed in an embodiment of the present application;

FIG. 9 is a schematic diagram of an adjustment of a textual timestamp disclosed in an embodiment of the present application;

FIG. 10 is a schematic diagram of another adjustment of a textual timestamp disclosed in an embodiment of the present application;

fig. 11 is a schematic structural diagram of an information processing apparatus disclosed in an embodiment of the present application;

fig. 12 is a schematic structural diagram of an information processing apparatus disclosed in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

At present, although the lyric time stamp obtained by processing the song audio and the text content corresponding to the song audio by using the alignment model is the optimal alignment result, there may be the case that the human voice tail is cut off or the human voice in the lyric time stamp includes a part of useless silence.

For example, an alignment model is used to process a piece of song audio and its corresponding text content "who me" to obtain a lyric timestamp as shown in fig. 1, where a time a1 to a time b1 in the lyric timestamp are the audio corresponding to "me" in the text content; time b1 to time c1 are audio corresponding to a blank area in the text content, that is, a mute part; the time c1 to the time d1 are the audio corresponding to "yes" in the text content; the time d1 to the time e1 are the audio corresponding to "who" in the text content. However, in practice, the end time of "me" in the song audio is not the time b1 but the time b2 in fig. 1, so that the human voice "me" corresponding to the time a1 to the time b1 in the lyric timestamp is truncated in advance; in addition, since the start time of "yes" in the song audio is the time c2 shown in fig. 1 instead of the time c1, the "yes" of the human voice corresponding to the time c1 to the time d1 in the lyric time stamp includes a partially useless silence.

Therefore, how to improve the accuracy of the audio timestamp is an urgent problem to be solved.

The embodiment of the application provides an information processing method, wherein in the information processing method, terminal equipment processes an input audio signal and text content corresponding to the audio signal by using an alignment model to obtain a voice time stamp; and the terminal equipment determines a jumping point between the human voice and the non-human voice in the audio signal by using the human voice detection model, and then adjusts the obtained voice timestamp according to the jumping point between the human voice and the non-human voice in the audio signal, so that a more accurate voice timestamp is obtained.

Wherein the audio timestamp includes a start time and an end time of each corresponding word of the audio signal in the text content. When the audio signal is a song, the phonetic time stamp may also be referred to as a lyric time stamp.

Optionally, the information processing method provided in this embodiment of the present application may also be executed by a server, where the server may obtain the audio signal and the text content corresponding to the audio signal from the terminal device, or the terminal device may report the audio signal and the text content corresponding to the audio signal to the server; furthermore, the server can process the audio signal and the text content corresponding to the audio signal to obtain a text timestamp; and then, determining a jumping point between the human voice and the non-human voice in the audio signal by using a human voice detection model, and adjusting the voice timestamp according to the jumping point between the human voice and the non-human voice in the audio signal, thereby obtaining a more accurate voice timestamp.

For example, the information processing method may be applied to a network architecture shown in fig. 2, please refer to fig. 2, and fig. 2 is a schematic diagram of a network architecture provided in an embodiment of the present application, where the network architecture may include a terminal device 201 and a server 202. The server 202 obtains the audio signal and the text content corresponding to the audio signal from the terminal device 201, and further, the server 202 may execute the information processing method according to the embodiment of the present application to obtain a more accurate audio timestamp.

It should be noted that the terminal device may be a smart phone, a tablet computer, a notebook computer, a desktop computer, an intelligent sound box, an intelligent watch, an intelligent vehicle-mounted sound box, etc., but is not limited thereto; the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN, and a big data and artificial intelligence platform.

The embodiments of the present application will be described in detail below with reference to the accompanying drawings.

Referring to fig. 3, fig. 3 is a schematic flowchart of an information processing method according to an embodiment of the present application, where the information processing method shown in fig. 3 is described from the perspective of a terminal device, and the method may include, but is not limited to, the following steps:

s301, the terminal device processes the audio signal and the text content corresponding to the audio signal by using the alignment model to obtain a text timestamp.

In this embodiment of the present application, referring to fig. 4, fig. 4 is a schematic diagram of determining an alignment model provided in this embodiment of the present application, and as shown in fig. 4, the alignment model may be obtained by iterative convergence on a training data set through an Expectation maximization algorithm (EM). The EM may be a Gaussian Mixture Model (GMM) -Hidden Markov Model (HMM), and may obtain a relevant parameter of the alignment model through iterative convergence of a training data set, further store the relevant parameter of the alignment model, and obtain an initial phonetic timestamp according to the trained alignment model.

The input-training data set of the EM algorithm may be obtained by extracting Mel Frequency Cepstral Coeffients (MFCCs) features of the audio signal, and converting text content corresponding to the audio signal into a state sequence of the HMM.

The MFCC feature is a feature that can accurately describe the shape of a sound channel displayed in the envelope of a speech short-time power spectrum, that is, the MFCC feature can accurately describe a phoneme (phone) of text content corresponding to a generated audio signal.

Optionally, the terminal device may extract an acoustic feature of the audio signal, such as an MFCC feature, and perform conversion processing on text content corresponding to the audio signal to obtain a state sequence of the HMM; the terminal device can input the acoustic features and the state sequence of the HMM into a trained alignment model to obtain an initial phonetic timestamp.

In an optional implementation manner, the processing, by the terminal device, the audio signal and the text content corresponding to the audio signal by using the alignment model to obtain the text timestamp may include: the terminal equipment extracts a third MFCC characteristic from the audio signal; performing content conversion processing on the text content corresponding to the audio signal to obtain an HMM state sequence; inputting the third MFCC features and the HMM state sequence into the alignment model to obtain the phonetic time stamp.

Optionally, the content conversion processing performed by the terminal device on the text content corresponding to the audio signal may include: the terminal equipment performs phoneme mapping processing on the text content corresponding to the audio signal to obtain a phoneme sequence; and the terminal equipment carries out conversion processing on the phoneme sequence to obtain an HMM state sequence.

In an alternative embodiment, the terminal device may input the third MFCC feature and the hidden markov state sequence into the alignment model, and obtain the timestamp of the voice, which may include: calculating a probability value of a hidden Markov state corresponding to each frame in the third MFCC characteristics by using an alignment model; obtaining the HMM state sequence corresponding to the third MFCC characteristics according to the probability value of the hidden Markov state corresponding to each frame; and performing content conversion on the HMM state sequence to obtain a phonetic text timestamp.

For example, the terminal device may input the third MFCC feature to the alignment model, obtain a plurality of probability values of the HMM state corresponding to each frame feature in the third MFCC feature, and determine, through viterbi decoding, the HMM state of the maximum probability value in the plurality of probability values; the terminal device can align the third MFCC characteristics with HMM states of the maximum probability values of the third MFCC characteristics one by one; the terminal device may convert the HMM state into phonemes, map the phonemes to text content, and obtain a textual timestamp.

S302, the terminal equipment determines a jumping point between human voice and non-human voice in the audio signal by using a human voice detection model.

In this embodiment of the application, the jumping point between the human voice and the non-human voice may include: the point of transition from human voice to non-human voice in the audio signal, the point of transition from non-human voice to human voice in the audio signal, and so on. Wherein, the human voice may refer to the sound produced by the vibration of the vocal cords of human, and the non-human voice may refer to silence, device noise, recording background noise, song background music, and so on.

In an alternative embodiment, the human voice detection model may be obtained by training using a first MFCC feature, which may be extracted from human voice audio, and a second MFCC feature, which may be extracted from non-human voice audio.

Fig. 5 is a schematic diagram of determining a human voice detection model provided in an embodiment of the present application, and as shown in fig. 5, the human voice detection model may be obtained by training a GMM by fitting a probability distribution of human voice features and a probability distribution of non-human voice features on a training data set. The GMM can also be a deep neural network, relevant parameters of a human voice detection model can be obtained through iterative convergence of a training data set, then the relevant parameters of the human voice detection model can be stored, and a jump point between human voice and non-human voice in the audio signal can be determined according to the trained human voice detection model.

Wherein the training data set may be derived from the first MFCC features and the second MFCC features.

S303, the terminal equipment adjusts the voice timestamp according to a jumping point between the human voice and the non-human voice in the audio signal to obtain the adjusted voice timestamp.

In this embodiment of the application, the terminal device may adjust the audio timestamp according to a jumping point between a human voice and a non-human voice in the audio signal, and may include: the end time of the voice and/or the starting time of the non-voice in the voice timestamp can be adjusted according to the jumping point of the voice to the non-voice, and the adjusted voice timestamp can be obtained; or adjusting the ending time of the non-human voice and/or the starting time of the human voice in the voice timestamp according to the jumping point of the non-human voice to the human voice to obtain the adjusted voice timestamp.

The information processing method can obtain more accurate sound time stamp through optimization of the local boundary under various conditions that the sound may mix with background sound, noise exists during recording, and pronunciation is inaccurate due to vocal cords and accents.

Referring to fig. 6, fig. 6 is a schematic flow chart of another information processing method according to an embodiment of the present application, where the information processing method includes, but is not limited to, the following steps:

s601, the terminal device processes the audio signal and the text content corresponding to the audio signal by using the alignment model to obtain the voice time stamp.

In this embodiment of the application, the terminal device processes the audio signal and the text content corresponding to the audio signal by using the alignment model, and the step of obtaining the voice timestamp may refer to the content of the voice timestamp obtained in S301, which is not described herein again.

S602, the terminal equipment divides the audio signal to obtain N frames of audio.

In this embodiment, the terminal device may perform sampling division on the audio signal with Q as a period to obtain N frames of audio. It should be noted that, the terminal device may also divide the audio signal in other manners, which is not limited in this application.

For example, referring to fig. 7, fig. 7 is a schematic diagram of dividing an audio signal provided in this embodiment of the present application, and as shown in fig. 7, assuming that a terminal device divides a segment of the audio signal corresponding to a text content "i am who", the audio signal may be divided into four frames. Among them, the time a1 to the time b1 may be divided into a first frame, the time b1 to the time c1 may be divided into a second frame, the time c1 to the time d1 may be divided into a third frame, and the time d1 to the time e1 may be divided into a fourth frame.

S603, the terminal equipment detects the audio signal according to the human voice detection model to obtain a detection result; the detection result comprises a result that each frame in the N frames of audio belongs to human voice or non-human voice audio.

In this application embodiment, terminal equipment can detect this audio signal according to this people's voice detection model, obtains the testing result, includes: aiming at each frame of audio in the N frames of audio, calculating the maximum likelihood probability value of the frame being human voice and the maximum likelihood probability value of the frame being non-human voice by using the human voice detection model; if the maximum likelihood probability value of the frame being the human voice is larger than the maximum likelihood probability value of the frame being the non-human voice, determining that the frame is the human voice frame; and if the maximum likelihood probability value of the frame being the human voice is smaller than the maximum likelihood probability value of the frame being the non-human voice, determining that the frame is the non-human voice frame.

For example, referring to fig. 7, if the terminal device calculates that the maximum likelihood probability value of the first frame (time a1 to time b1) being a human sound frame is greater than the maximum likelihood probability value of a non-human sound frame by using the human sound detection model, it is determined that the first frame is a human sound frame; and if the maximum likelihood probability value that the first frame is the human sound frame is calculated to be smaller than the maximum likelihood probability value that the first frame is the non-human sound frame, determining that the first frame is the non-human sound frame. It should be noted that the terminal device may also determine, by using the foregoing method, that another frame in the audio signal is a human voice frame or a non-human voice frame, which is not limited in this application.

S604, the terminal device determines a jumping point between human voice and non-human voice in the audio signal according to the detection result, wherein the jumping point between human voice and non-human voice comprises a jumping point from human voice to non-human voice and a jumping point from non-human voice to human voice.

In the embodiment of the application, the voice-to-non-voice transition point can be a time point when the voice is in front and the non-voice is in back in the audio signal; the jumping point of the non-human voice to the human voice can be a time point of the audio signal when the non-human voice is in front and the human voice is behind.

For example, referring to fig. 8, fig. 8 is a schematic diagram of a transition point between human voice and non-human voice provided in an embodiment of the present application, and as shown in fig. 8, it is assumed that N frames of audio obtained by dividing the audio signal by the terminal device are: the first frame is time a1 to time b1, the second frame is time b1 to time c1, the third frame is time c1 to time d1, the fourth frame is time d1 to time e1, and so on.

Wherein, the time a1 to the time b1 may be the audio corresponding to "me" in the text content; the time b1 to the time c1 may be audio corresponding to a blank area (i.e., a mute area) in the text content, and therefore the time b1 may be considered as a transition point from the initial human voice to the non-human voice; the time c1 to the time d1 may be the audio corresponding to "yes" in the text content, so the time c1 may be considered as a transition point from the initial non-human voice to the human voice; the time d1 to the time e1 may be audio corresponding to "who" in the text content. The terminal equipment detects the audio signal according to the human voice detection model, and the obtained detection result comprises the following steps: the transition point timings of the first frame and the second frame are at timing b2, that is, the transition point from the above-mentioned human voice ("i" corresponding audio) to non-human voice (audio of the mute section) is at timing b 2; the transition point timings of the second frame and the third frame are at timing c2, i.e., the above-described non-human voice (audio of the mute section) to human voice ("is" corresponding audio) transition point is at timing c 2.

Therefore, the terminal equipment can determine the jumping point between the human voice and the non-human voice in the audio signal according to the detection result of the human voice detection model to the audio signal.

The terminal device can detect each frame of audio frequency in the audio signal by utilizing the trained human voice detection model, determine a jump point between human voice and non-human voice in the audio signal, and further can extract a step of adjusting the voice timestamp according to the jump point between human voice and non-human voice in the audio signal in S605 to obtain the adjusted voice timestamp.

S605, the terminal equipment adjusts the ending time of the voice and/or the starting time of the non-voice in the voice timestamp according to the jumping point of the voice to the non-voice to obtain the adjusted voice timestamp; and/or the terminal equipment adjusts the starting time of the non-human voice and/or the ending time of the human voice in the voice timestamp according to the jumping point of the non-human voice to the human voice to obtain the adjusted voice timestamp.

In the embodiment of the application, the terminal equipment adjusts the ending time of the voice and/or the starting time of the non-voice in the voice timestamp; and/or the terminal device adjusting the start time of the non-human voice and/or the end time of the human voice in the voice timestamp may include: and/or the terminal equipment moves the starting time and/or the ending time point of the non-human voice in the time stamp to the jumping point of the non-human voice.

For example, referring to fig. 9, fig. 9 is a schematic diagram illustrating an adjustment of a timestamp of a sound provided in an embodiment of the present application, and as shown in fig. 9, it is assumed that N frames of audio obtained by dividing the audio signal by a terminal device are: the first frame is time a1 to time b1, the second frame is time b1 to time c1, the third frame is time c1 to time d1, the fourth frame is time d1 to time e1, and so on.

Wherein, the time a1 to the time b1 may be the audio corresponding to "me" in the text content; the time b1 to the time c1 may be audio corresponding to a blank area (i.e., a mute section) in the text content; the time c1 to the time d1 may be audio corresponding to "yes" in the text content; the time d1 to the time e1 may be audio corresponding to "who" in the text content. And the point of transition of the human voice (the audio to which "i" corresponds) to non-human voice (the audio of the silent section) is at time b 2. The terminal device may adjust time b1 to time b2 resulting in an adjusted timestamp of the sound.

For example, referring to fig. 10, fig. 10 is a schematic diagram of adjusting a timestamp of a sound according to another embodiment of the present application, and as shown in fig. 10, it is assumed that N frames of audio obtained by dividing the audio signal by the terminal device are: the first frame is time a1 to time b1, the second frame is time b1 to time c1, the third frame is time c1 to time d1, the fourth frame is time d1 to time e1, and so on.

Wherein, the time a1 to the time b1 may be the audio corresponding to "me" in the text content; the time b1 to the time c1 may be audio corresponding to a blank area (i.e., a mute section) in the text content; the time c1 to the time d1 may be audio corresponding to "yes" in the text content; the time d1 to the time e1 may be audio corresponding to "who" in the text content. And the point of transition from non-human (the audio of the silent portion) to human (the "yes" corresponding audio) is at time c 2. The terminal device may adjust time c1 to time c2 resulting in an adjusted timestamp of the sound.

This application confirms the jump point between the voice and the non-voice in this audio signal through the voice detection model to judge whether the jump point between this voice and the non-voice is accurate, if there is the mistake, then adjust the sound timestamp, thereby promote the accuracy of sound timestamp.

Based on the above method embodiment, the embodiment of the present application further provides a schematic structural diagram of an information processing apparatus. Fig. 11 is a schematic structural diagram of an information processing apparatus according to an embodiment of the present application. The information processing apparatus 1000 shown in fig. 11 can operate as follows: a processing unit 1002, configured to process an audio signal and corresponding text content;

an adjusting unit 1003, configured to adjust the timestamp of the voice message.

In an alternative embodiment, the processing unit 1002 is configured to process the audio signal and the corresponding text content to obtain a sound timestamp, where the sound timestamp includes a start time and an end time of each corresponding word of the audio signal in the text content.

In an alternative embodiment, the processing unit 1002 is further configured to determine a jumping point between human voice and non-human voice in the audio signal by using a human voice detection model.

In an optional implementation manner, the adjusting unit 1003 is configured to adjust the audio timestamp according to a jump point between a human voice and a non-human voice in the audio signal, so as to obtain an adjusted audio timestamp.

In an optional implementation manner, the processing unit 1002, when determining a transition point between human voice and non-human voice in the audio signal by using a human voice detection model, is specifically configured to: determining a human voice detection model is obtained by training with a first Mel Frequency Cepstral Coefficient (MFCC) feature extracted from human voice audio and a second MFCC feature extracted from non-human voice audio.

In an alternative embodiment, the processing unit 1002, when determining the transition point between the human voice and the non-human voice in the audio signal by using the human voice detection model, is specifically configured to:

dividing the audio signal to obtain N frames of audio;

In an optional implementation manner, when the processing unit 1002 detects the audio signal according to the human voice detection model to obtain a detection result, specifically configured to: aiming at each frame of the N frames of audio, calculating the maximum likelihood probability value of the frame being human voice and the maximum likelihood probability value of the frame being non-human voice by using the human voice detection model;

In an optional implementation manner, the adjusting unit 1003, when adjusting the audio timestamp according to a jump point between a human voice and a non-human voice in the audio signal to obtain an adjusted audio timestamp, is specifically configured to: adjusting the ending time of the human voice and/or the starting time of the non-human voice in the voice timestamp according to the jumping point of the human voice to the non-human voice;

In an optional implementation manner, when the processing unit 1002 is configured to process an audio signal and corresponding text content by using an alignment model to obtain a text timestamp, specifically, to: extracting a third MFCC feature from the audio signal;

performing phoneme mapping on text content corresponding to the audio signal to obtain a phoneme sequence corresponding to the text content, and obtaining a hidden Markov state sequence according to the phoneme sequence;

inputting the third MFCC features and the hidden Markov state sequence into the alignment model to obtain the phonetic time stamp.

In an optional implementation manner, when the third MFCC feature and the hidden markov state sequence are input to the alignment model to obtain the phonetic time stamp, the processing unit 1002 is specifically configured to: and identifying the probability value of the hidden Markov state corresponding to each frame in the third MFCC characteristics by using the alignment model, and obtaining the voice timestamp according to the probability value of the hidden Markov state corresponding to each frame.

According to an embodiment of the present application, the steps involved in the information processing methods shown in fig. 3 and fig. 6 may be performed by units in the information processing apparatus shown in fig. 11. For example, step S301 in the information processing method shown in fig. 3 may be performed by the processing unit 1002 in the information processing apparatus shown in fig. 11, and step S302 may be performed by the adjusting unit 1003 in the information processing apparatus shown in fig. 11; steps S601 to S604 in the information processing method shown in fig. 6 may be performed by the processing unit 1002 in the information processing apparatus shown in fig. 11, and steps S605, S606 may be performed by the adjusting unit 1003 in the information processing apparatus shown in fig. 11.

According to the embodiment of the present application, the units in the information processing apparatus shown in fig. 11 may be respectively or entirely combined into one or several other units to form the unit, or some unit(s) may be further split into multiple units with smaller functions to form the unit(s), which may achieve the same operation without affecting the achievement of the technical effect of the embodiment of the present application. The units are divided based on logic functions, and in practical application, the functions of one unit can be realized by a plurality of units, or the functions of a plurality of units can be realized by one unit. In other embodiments of the present application, the information processing apparatus may also include other units, and in practical applications, these functions may also be implemented by being assisted by other units, and may be implemented by cooperation of a plurality of units.

According to the embodiment of the present application, the information processing apparatus as shown in fig. 11 can be configured by running a computer program (including program codes) capable of executing the steps involved in the respective methods shown in fig. 3 and fig. 6 on a general-purpose computing device such as a computer including a processing element such as a Central Processing Unit (CPU), a random access storage medium (RAM), a read-only storage medium (ROM), and a storage element, and the information processing method of the embodiment of the present application can be realized. The computer program may be embodied on a computer-readable storage medium, for example, and loaded into and executed by the above-described computing apparatus via the computer-readable storage medium.

In this embodiment of the application, the processing unit 1002 processes the input audio signal and the corresponding text content to obtain a text timestamp; the adjusting unit 1003 adjusts the time stamp of the lyrics, and the time stamp of each word can be finely adjusted on the basis of the global optimal result by adopting the method, so that a more accurate time stamp of the lyrics is obtained.

Based on the method and the device embodiment, the embodiment of the application provides an information processing device. Fig. 12 is a schematic structural diagram of an information processing apparatus according to an embodiment of the present application. The information processing apparatus 1100 shown in fig. 12 includes at least a processor 1101, an input interface 1102, an output interface 1103, a computer storage medium 1104, and a memory 1105. The processor 1101, the input interface 1102, the output interface 1103, the computer storage medium 1104, and the memory 1105 may be connected by a bus or other means.

A computer storage medium 1104 may be stored in the memory 1105 of the information processing apparatus 1100, the computer storage medium 1104 being used to store a computer program comprising program instructions, the processor 1101 being used to execute the program instructions stored by the computer storage medium 1104. The processor 1101 (or CPU) is a computing core and a control core of the information Processing apparatus 1100, and is adapted to implement one or more instructions, and in particular, to load and execute one or more computer instructions to implement corresponding method flows or corresponding functions.

An embodiment of the present application also provides a computer storage medium (Memory), which is a Memory device in the information processing device 1100 and is used to store programs and data. It is to be understood that the computer storage medium herein may include both a built-in storage medium in the information processing apparatus 1100 and, of course, an extended storage medium supported by the information processing apparatus 1100. The computer storage medium provides a storage space that stores an operating system of the information processing apparatus 1100. Also stored in this memory space are one or more instructions, which may be one or more computer programs (including program code), suitable for loading and execution by the processor 1101. The computer storage medium may be a high-speed RAM memory, or may be a non-volatile memory (non-volatile memory), such as at least one disk memory; and optionally at least one computer storage medium located remotely from the processor.

In one embodiment, the computer storage medium may be loaded with one or more instructions by processor 1101 and executed to implement the steps of the information processing method described above with respect to fig. 3 and 6. In particular implementations, one or more instructions in the computer storage medium are loaded by processor 1101 and perform the following steps:

In one possible implementation, the processor 1101 processes an audio signal and corresponding text content using an alignment model to obtain a sound time stamp, where the sound time stamp includes a start time and an end time of each corresponding word of the audio signal in the text content;

In one possible implementation, when the processor 1101 determines the transition point between the human voice and the non-human voice in the audio signal by using the human voice detection model, the method includes:

the human voice detection model is obtained by training with a first Mel Frequency Cepstrum Coefficient (MFCC) feature and a second MFCC feature, wherein the first MFCC feature is extracted from human voice audio, and the second MFCC feature is extracted from non-human voice audio.

In one possible implementation, the processor 1101 determines a transition point between human voice and non-human voice in the audio signal by using a human voice detection model, including:

dividing the audio signal to obtain N frames of audio;

In a possible implementation manner, the processor 1101 performs detection on the audio signal according to the human voice detection model to obtain a detection result, including:

In one possible implementation manner, the processor 1101 adjusts the audio timestamp according to a jump point between a human voice and a non-human voice in the audio signal, and obtains an adjusted audio timestamp, including:

adjusting the ending time of the human voice and/or the starting time of the non-human voice in the voice timestamp according to the jumping point of the human voice to the non-human voice;

In one possible implementation, the processor 1101 processes the audio signal and the corresponding text content by using an alignment model to obtain a text timestamp, including:

extracting a third MFCC feature from the audio signal;

In one possible implementation, the processor 1101 obtains the phonetic time stamp by inputting a third MFCC feature and the hidden markov state sequence into the alignment model, including:

and identifying the probability value of the hidden Markov state corresponding to each frame in the third MFCC characteristics by using the alignment model, and obtaining the voice timestamp according to the probability value of the hidden Markov state corresponding to each frame.

In the implementation of the present application, the processor 1101 processes the audio signal and the corresponding text content by using the alignment model to obtain a voice timestamp, and determines a transition point between a human voice and a non-human voice in the audio signal by using a human voice detection model; and adjusting the voice timestamp according to a jump point between the human voice and the non-human voice in the audio signal to obtain the adjusted voice timestamp. By adopting the information processing mode, the sound time stamp can be adjusted, the adjusted sound time stamp is obtained, and further more accurate sound time stamp is obtained.

Embodiments of the present application also provide a computer product or a computer program comprising computer instructions stored in a computer-readable storage medium. The processor 1101 reads the computer instructions from the computer-readable storage medium, and the processor 1101 executes the computer instructions, so that the information processing apparatus 1100 performs the information processing method shown in fig. 3 and 6.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the above-described modules is merely a logical division, and other divisions may be realized in practice, for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of generating a textual timestamp, the method comprising:

processing an audio signal and text content corresponding to the audio signal by using an alignment model to obtain a sound timestamp, wherein the sound timestamp comprises the starting time and the ending time of each corresponding word of the audio signal in the text content;

2. The method of claim 1, wherein the human voice detection model is obtained by training using first Mel Frequency Cepstral Coefficients (MFCC) features extracted from human voice audio and second MFCC features extracted from non-human voice audio.

3. The method of claim 1, wherein the determining the transition point between human voice and non-human voice in the audio signal by using a human voice detection model comprises:

dividing the audio signal to obtain N frames of audio;

detecting the audio signal according to the human voice detection model to obtain a detection result; the detection result comprises a result that each frame of audio in the N frames of audio belongs to human voice or non-human voice audio;

4. The method according to claim 3, wherein the detecting the audio signal according to the human voice detection model to obtain a detection result comprises:

aiming at each frame of audio in the N frames of audio, calculating the maximum likelihood probability value that the frame of audio is the human voice and the maximum likelihood probability value that the frame of audio is the non-human voice by using the human voice detection model;

if the maximum likelihood probability value that the frame audio is the human voice is greater than the maximum likelihood probability value that the frame audio is the non-human voice, determining that the frame audio is the human voice frame;

and if the maximum likelihood probability value of the frame audio being the human voice is smaller than the maximum likelihood probability value of the frame audio being the non-human voice, determining that the frame audio is the non-human voice frame.

5. The method according to claim 3 or 4, wherein the adjusting the audio timestamp according to a jump point between a human voice and a non-human voice in the audio signal to obtain an adjusted audio timestamp comprises:

and adjusting the ending time of the human voice and/or the starting time of the non-human voice in the voice time stamp according to the jumping point of converting the human voice into the non-human voice so as to obtain the adjusted voice time stamp.

6. The method according to claim 3 or 4, wherein the adjusting the audio timestamp according to a jump point between a human voice and a non-human voice in the audio signal to obtain an adjusted audio timestamp, further comprises:

and adjusting the ending time of the non-human voice and/or the starting time of the human voice in the voice time stamp according to the jumping point of the non-human voice to the human voice so as to obtain the adjusted voice time stamp.

7. The method of claim 1, wherein the processing an audio signal and text content corresponding to the audio signal by using an alignment model to obtain a text timestamp comprises:

extracting a third MFCC feature from the audio signal;

inputting the third MFCC features and the HMM state sequence into an alignment model to obtain a phonetic timestamp.

8. The method of claim 7, wherein said inputting said third MFCC features and said HMM state sequences into an alignment model, obtaining a phonetic timestamp, comprises:

calculating a probability value of a hidden Markov state corresponding to each frame feature in the third MFCC features by using an alignment model;

obtaining the HMM state sequence corresponding to the third MFCC feature according to the probability value of the hidden Markov state corresponding to each frame feature;

and performing content conversion on the HMM state sequence to obtain a phonetic text timestamp.

9. An information processing apparatus characterized by further comprising:

memory, processor, wherein the memory has stored thereon an information processing program which, when executed by the processor, implements the steps of the information processing method according to any one of claims 1 to 8.

10. A computer-readable storage medium, characterized in that a computer program is stored thereon, which, when being executed by a processor, implements the steps of the information processing method according to any one of claims 1 to 8.