WO2021227308A1 - 一种视频资源的生成方法和装置 - Google Patents

一种视频资源的生成方法和装置 Download PDF

Info

Publication number
WO2021227308A1
WO2021227308A1 PCT/CN2020/112683 CN2020112683W WO2021227308A1 WO 2021227308 A1 WO2021227308 A1 WO 2021227308A1 CN 2020112683 W CN2020112683 W CN 2020112683W WO 2021227308 A1 WO2021227308 A1 WO 2021227308A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
audio
feature
lip
phoneme
Prior art date
Application number
PCT/CN2020/112683
Other languages
English (en)
French (fr)
Inventor
柳毅恒
何文峰
王胜慧
Original Assignee
完美世界(北京)软件科技发展有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 完美世界(北京)软件科技发展有限公司 filed Critical 完美世界(北京)软件科技发展有限公司
Publication of WO2021227308A1 publication Critical patent/WO2021227308A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/8106Monomedia components thereof involving special audio data, e.g. different tracks for different languages

Definitions

  • the present disclosure relates to the computer field, and in particular to a method and device for generating video resources.
  • the method of generating video resources (such as lip animation) based on audio resources requires first obtaining the subtitle file corresponding to the audio resource, and then using the audio resource and subtitle file to generate the lip animation, and different configuration files need to be provided for audio in different languages.
  • the process of this method of generating video resources is relatively complicated, and the operation is relatively cumbersome.
  • the present disclosure provides a method and device for generating video resources to at least solve the technical problem of relatively high complexity in the process of generating video resources in related technologies.
  • embodiments of the present disclosure provide a method for generating video resources, including:
  • Target lip-shape feature corresponding to the target audio feature, wherein the target lip-shape feature is used to indicate the probability that the target audio feature belongs to each phoneme of a plurality of phonemes;
  • the embodiments of the present disclosure also provide a device for generating video resources, including:
  • the first extraction module is configured to extract the target audio feature corresponding to the target audio frame in the audio resource
  • the first determining module is configured to determine a target lip-shape feature corresponding to the target audio feature, wherein the target lip-shape feature is used to indicate the probability that the target audio feature belongs to each phoneme of a plurality of phonemes;
  • a fusion module configured to fuse multiple lip models corresponding to the multiple phonemes according to the target lip features to obtain a target lip model corresponding to the target audio features
  • the generating module is configured to use the target lip model to generate the video resource corresponding to the audio resource.
  • an embodiment of the present disclosure also provides a storage medium, the storage medium includes a stored program, and the above-mentioned method is executed when the program is running.
  • the embodiments of the present disclosure also provide an electronic device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, and the processor executes the above-mentioned method through the computer program.
  • the beneficial effects of the present disclosure include at least: extracting the target audio feature corresponding to the target audio frame in the audio resource; determining the target lip feature corresponding to the target audio feature, wherein the target lip feature is used to indicate that the target audio feature belongs to multiple phonemes Probability of each phoneme in the lip; fusion of multiple lip models corresponding to multiple phonemes according to the target lip feature to obtain the target lip model corresponding to the target audio feature; use the target lip model to generate the corresponding audio resource
  • the target lip feature corresponding to the target audio feature extracted from the audio resource the probability that the target audio feature belongs to each of the multiple phonemes is obtained, and these multiple phonemes correspond to multiple lip models
  • the multiple models are merged to obtain the target lip model corresponding to the target audio feature, and then the target lip model is used to generate the video resource corresponding to the audio resource to achieve
  • the purpose of directly generating video resources through audio resources is achieved, thereby achieving the technical effect of increasing the
  • Fig. 1 is a schematic diagram of a hardware environment of a method for generating video resources according to an embodiment of the present disclosure
  • Fig. 2 is a flowchart of an optional method for generating video resources according to an embodiment of the present disclosure
  • Fig. 3 is a schematic diagram of an optional framing process according to an embodiment of the present disclosure.
  • FIG. 4 is a schematic diagram of an optional feature detection model according to an optional embodiment of the present disclosure.
  • FIG. 5 is a schematic diagram of configuration parameters of an optional feature detection model according to an optional embodiment of the present disclosure
  • Fig. 6 is a schematic diagram of an optional lip model corresponding to a phoneme group according to an embodiment of the present disclosure
  • Fig. 7 is a schematic diagram of an optional device for generating video resources according to an embodiment of the present disclosure.
  • Fig. 8 is a structural block diagram of a terminal according to an embodiment of the present disclosure.
  • an embodiment of a method for generating video resources is provided.
  • the foregoing method for generating video resources may be applied to the hardware environment formed by the terminal 101 and the server 103 as shown in FIG. 1.
  • the server 103 is connected to the terminal 101 through the network, and can be used to provide services (such as game services, application services, etc.) for the terminal or the client installed on the terminal.
  • the database can be set on the server or independently of the server. It is used to provide data storage services for the server 103.
  • the above-mentioned network includes but is not limited to: a wide area network, a metropolitan area network or a local area network, and the terminal 101 is not limited to a PC, a mobile phone, a tablet computer, and the like.
  • the method for generating video resources in the embodiments of the present disclosure may be executed by the server 103, may also be executed by the terminal 101, or may be executed jointly by the server 103 and the terminal 101.
  • the terminal 101 executing the method for generating video resources of the embodiments of the present disclosure may also be executed by a client installed on it.
  • Fig. 2 is a flowchart of an optional method for generating video resources according to an embodiment of the present disclosure. As shown in Fig. 2, the method may include the following steps:
  • Step S202 extract the target audio feature corresponding to the target audio frame in the audio resource
  • Step S204 Determine a target lip-shape feature corresponding to the target audio feature, where the target lip-shape feature is used to indicate the probability that the target audio feature belongs to each phoneme of a plurality of phonemes;
  • Step S206 fusing multiple lip models corresponding to the multiple phonemes according to the target lip features to obtain a target lip model corresponding to the target audio features;
  • Step S208 Use the target lip-shape model to generate a video resource corresponding to the audio resource.
  • the probability that the target audio feature belongs to each of the multiple phonemes is obtained by determining the target lip-shape feature corresponding to the target audio feature extracted from the audio resource, and the multiple phonemes correspond to multiple phonemes.
  • Mouth model according to the probability that the target audio feature belongs to each of the multiple phonemes, multiple models are merged to obtain the target lip model corresponding to the target audio feature, and then the target lip model is used to generate the video corresponding to the audio resource
  • the resource achieves the purpose of directly generating video resources through audio resources, thereby achieving the technical effect of increasing the complexity of the process of generating video resources, thereby solving the technical problem of high complexity in the process of generating video resources in related technologies.
  • the aforementioned audio resources may include, but are not limited to: recording files, music files, dubbing files, live audio streams, call voices, and so on.
  • the target audio frame may be, but is not limited to, each audio frame in the audio resource or a key frame in the audio resource.
  • the target audio feature may, but is not limited to, use of Mel frequency cepstral coefficients, Mel frequency cepstral coefficients first-order differential coefficients, Mel frequency cepstral coefficients second-order differential coefficients One or a combination of several.
  • the target lip-shape feature is used to indicate the probability that the target audio feature belongs to each phoneme of the plurality of phonemes.
  • the dimension of the target lip-shape feature can be but not limited to the number of multiple phonemes, and the target lip-shape feature can be, but is not limited to, multiple probability values arranged in the order of the multiple phonemes.
  • multiple phonemes can be extracted according to any standard, but are not limited to, for example: refer to the International Phonetic Alphabet to extract 40 general phonemes, including: IH, AE, AA, AO, EY, AY, AW, L, IY, EH, M, B, P, AH, UH, W, UW, OY, OW, Z, S, CH, ZH, SH, ER, R, Y, N, NG, J, DH, D, G, T, K, TH, HH, F, V and silent.
  • the International Phonetic Alphabet to extract 40 general phonemes, including: IH, AE, AA, AO, EY, AY, AW, L, IY, EH, M, B, P, AH, UH, W, UW, OY, OW, Z, S, CH, ZH, SH, ER, R, Y, N, NG, J, DH, D, G, T, K, TH, HH, F,
  • the elements in the target lip-shape feature may, but are not limited to, be arranged in the aforementioned order or other preset order.
  • the sequence of the phoneme corresponding probabilities may not be set, and the target mouth shape feature is expressed in the form of key-value pairs with phoneme as the key and probability as the value.
  • a corresponding mouth shape model may be created for each phoneme of the multiple phonemes to display the viseme of the phoneme. It is also possible to divide multiple phonemes into N phoneme groups according to the viseme when they are pronounced. Each phoneme group includes one phoneme or multiple phonemes with similar visemes. Create a mouth shape model for each phoneme group, one phoneme The phonemes included in the group correspond to a mouth model.
  • the probability that the target audio feature indicated by the target lip-shape feature belongs to each of the multiple phonemes can be used as the fusion weight for the multiple lip-shapes corresponding to the multiple phonemes.
  • the models are fused to obtain the target lip-shape model corresponding to the target audio feature.
  • the obtained target lip-shape models are spliced in a certain order to obtain the video resources corresponding to the audio resources.
  • extracting the target audio feature corresponding to the target audio frame from the audio resource includes:
  • the manner of framing the audio resource may be, but is not limited to, dividing the audio resource into audio segments with a preset frame length as audio frames according to a preset frame shift.
  • Figure 3 is a schematic diagram of an optional framing process according to an embodiment of the present disclosure.
  • the audio is framed with a frame length of 20ms and a frame shift of 10ms, and the first frame is obtained as 0ms to 20ms, the second frame is 10ms to 30ms, and so on, the audio resource is divided into multiple audio frames.
  • extracting the audio feature corresponding to each audio frame in the multiple audio frames includes:
  • S23 Determine the Mel spectrum cepstrum coefficient, the first-order difference coefficient, and the second-order difference coefficient as the audio feature corresponding to each audio frame.
  • the first-order difference of the mel spectrum cepstral coefficient can also be obtained.
  • the coefficient and the second-order difference coefficient are used to describe the dynamic characteristics in the audio resource, that is, the change of the acoustic characteristics between adjacent characteristics.
  • the Mel spectrum cepstrum coefficient, the first-order difference coefficient, and the second-order difference coefficient are jointly determined as the audio characteristics corresponding to each audio frame.
  • the first and second order differential coefficients of the Mel frequency cepstral coefficients are obtained, and the 13x3 dimensional Mel frequency cepstral characteristics are obtained as the audio corresponding to each audio frame.
  • generating the target audio feature according to the audio feature corresponding to each audio frame includes:
  • the features of multiple consecutive audio frames may be combined into one target audio feature. For example: firstly determine the target audio frame from multiple audio frames, and then merge the audio features of the target audio frame and a certain number of audio frames around it into the target audio feature.
  • the first number of audio frames forward and take the second number of audio frames backward with the target audio frame as the center can be equal to the second number, and the first number can also be equal to the second number plus one.
  • determining the target audio frame from the multiple audio frames includes:
  • S42 Determine an audio frame corresponding to the phoneme frame on the audio resource as the target audio frame.
  • the target audio frame may be, but is not limited to, determined according to the position of the phoneme frame in the audio resource.
  • the manner of extracting phoneme frames from the audio resources may be, but is not limited to, dividing the 1s audio resource into 30 phoneme frames, that is, every 1/30s is a phoneme frame.
  • determining the target lip feature corresponding to the target audio feature includes:
  • S51 Input the target audio feature into a target feature detection model, where the target feature detection model is obtained by training an initial feature detection model using audio feature samples annotated with lip features;
  • S52 Acquire the target lip-shape feature output by the target feature detection model.
  • the target feature detection model used to determine the target lip feature corresponding to the target audio feature may be, but not limited to, the initial feature detection model obtained by training the initial feature detection model using audio feature samples marked with lip features of.
  • inputting the target audio feature into the target feature detection model includes:
  • Obtaining the target lip-shape feature output by the target feature detection model includes:
  • S62 Acquire the target lip-shape feature output by the output layer of the target feature detection model
  • the target feature detection model includes the input layer, the spectrum analysis network, the cooperative pronunciation network, and the output layer that are connected in sequence, and the input layer is used to receive audio features and pair the received audio features through a convolutional layer.
  • the spectrum analysis network is used to process the features output by the input layer in the dimension of spectrum features
  • the co-pronunciation network is used to process the features output by the spectrum analysis network in the time domain
  • the output layer is used to map the features output by the cooperative pronunciation network into lip features through a convolution kernel.
  • the target feature detection model may include, but is not limited to, an input layer, a spectrum analysis network, a co-pronunciation network, and an output layer that are connected in sequence.
  • the input layer is used to receive audio features and receive through the convolution layer.
  • the received audio features are initially transformed, the spectral analysis network is used to process the output features of the input layer in the spectral feature dimension, the co-pronunciation network is used to process the output features of the spectral analysis network in the time domain, and the output layer is used to The features output by the cooperative pronunciation network are mapped into lip features through the convolution kernel.
  • the target audio feature is input to the input layer of the target feature detection model, and the target feature detection model automatically analyzes the target audio features, and finally uses the output information of the output layer of the target feature detection model as Target lip characteristics.
  • FIG. 4 is a schematic diagram of an optional feature detection model according to an optional embodiment of the present disclosure.
  • the target feature detection model includes an input for receiving audio features. Layer, spectrum analysis network, co-pronunciation network and output layer for outputting target lip features.
  • the input layer receives the target audio feature, and performs an initial transformation on the received target audio feature through a convolutional layer that does not contain an activation function.
  • the spectrum analysis network analyzes the target audio features in the spectrum feature dimension.
  • the co-pronunciation network further analyzes the extracted target audio features in the time domain.
  • the output layer maps the target audio features into target lip features through a 1x1 convolution kernel.
  • the target feature detection model can be constructed but not limited to the actual use environment, for example: FIG. 5 is an optional feature detection model configuration according to an optional embodiment of the present disclosure The schematic diagram of the parameters, the configuration parameters of each network layer included in the target feature detection model are shown in Figure 5.
  • the method before inputting the target audio feature into the target feature detection model, the method further includes:
  • S72 Determine the mouth shape feature corresponding to the audio feature sample according to the text information corresponding to the audio feature sample, where the text information is used to indicate the text corresponding to the pronunciation of the audio feature sample;
  • S73 Train the initial feature detection model by using the audio feature samples annotated with the lip-shape feature to obtain the target feature detection model.
  • the foregoing data set may include, but is not limited to, two large corpus data sets, LibriSpeech and AISHELL, and so on.
  • the manner of extracting audio feature samples from the audio data in the data set may, but is not limited to, the manner of extracting the target audio feature corresponding to the target audio frame in the audio resource.
  • the manner of determining the lip feature corresponding to the audio feature sample according to the text information corresponding to the audio feature sample may be appropriately adjusted according to the language of the audio feature sample.
  • lip features can be generated but not limited to third-party SDK Annosoft.
  • lip features can be generated, but not limited to, based on Annosoft's improved configuration adapted to Chinese audio.
  • fusing multiple lip models corresponding to the multiple phonemes according to the target lip features to obtain the target lip models corresponding to the target audio features includes:
  • the way of making the lip model may include, but is not limited to, a way based on skeletal animation or a way based on BlendShape, etc., which is not limited in this embodiment.
  • obtaining the mouth shape model corresponding to each phoneme in the plurality of phonemes includes:
  • S91 Determine the target phoneme group into which each phoneme falls in the multiple phoneme groups, wherein the multiple phonemes are divided into the multiple phoneme groups according to the pronunciation patterns, and the multiple phoneme groups are One-to-one correspondence with multiple lip models;
  • S92 Determine the target mouth shape model corresponding to the target phoneme group in the plurality of mouth shape models as the mouth shape model corresponding to each phoneme.
  • multiple phonemes can be divided into multiple phoneme groups according to the similarity of the pronunciation of the phonemes between the phonemes, and a corresponding mouth shape model can be made for each phoneme group in the multiple phoneme groups. , Get multiple mouth shapes corresponding to multiple phoneme groups one-to-one.
  • FIG. 6 is a schematic diagram of an optional mouth shape model corresponding to a phoneme group according to an embodiment of the present disclosure. As shown in FIG. 6, phonemes belonging to the same phoneme
  • using the target lip-shape model to generate the video resource corresponding to the audio resource includes:
  • S102 Combine the target lip-shape model into the video resource according to the time sequence.
  • the target lip model in the video resource may be, but not limited to, arranged according to the time sequence of the target audio frames in the audio resource. According to the duration of the target audio frame in the audio resource, the corresponding target lip model is displayed in the video resource.
  • the method according to the above embodiment can be implemented by means of software plus the necessary general hardware platform, of course, it can also be implemented by hardware, but in many cases the former is Better implementation.
  • the technical solution of the present disclosure essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, The optical disc) includes several instructions to enable a terminal device (which can be a mobile phone, a computer, a server, or a network device, etc.) to execute the methods described in the various embodiments of the present disclosure.
  • a video resource generating device for implementing the above-mentioned video resource generating method.
  • Fig. 7 is a schematic diagram of an optional device for generating video resources according to an embodiment of the present disclosure. As shown in Fig. 7, the device may include:
  • the first extraction module 72 is configured to extract the target audio feature corresponding to the target audio frame in the audio resource
  • the first determining module 74 is configured to determine a target lip-shape feature corresponding to the target audio feature, wherein the target lip-shape feature is used to indicate the probability that the target audio feature belongs to each phoneme of a plurality of phonemes;
  • the fusion module 76 is configured to fuse multiple lip models corresponding to the multiple phonemes according to the target lip features to obtain a target lip model corresponding to the target audio features;
  • the generating module 78 is configured to generate a video resource corresponding to the audio resource using the target lip-shape model.
  • first extraction module 72 in this embodiment can be configured to perform step S202 in the embodiment of the present disclosure
  • first determination module 74 in this embodiment can be configured to perform step S204 in the embodiment of the present disclosure
  • the fusion module 76 in this embodiment can be configured to perform step S206 in the embodiment of the present disclosure
  • the generation module 78 in this embodiment can be configured to perform step S208 in the embodiment of the present disclosure.
  • the probability that the target audio feature belongs to each of the multiple phonemes is obtained by determining the target lip feature corresponding to the target audio feature extracted from the audio resource, and the multiple phonemes correspond to multiple lip models. According to the probability that the target audio feature belongs to each of the multiple phonemes, the multiple models are merged to obtain the target lip model corresponding to the target audio feature, and then the target lip model is used to generate the video resource corresponding to the audio resource.
  • the purpose of directly generating video resources through audio resources realizes the technical effect of increasing the complexity of the process of generating video resources, thereby solving the technical problem of high complexity in the process of generating video resources in related technologies.
  • the first extraction module includes:
  • the framing unit is configured to framing the audio resource to obtain multiple audio frames
  • An extraction unit configured to extract an audio feature corresponding to each audio frame in the plurality of audio frames
  • the generating unit is configured to generate the target audio feature according to the audio feature corresponding to each audio frame.
  • the extraction unit is configured to:
  • the Mel spectrum cepstrum coefficient, the first-order difference coefficient, and the second-order difference coefficient are determined as the audio feature corresponding to each audio frame.
  • the generating unit is set to:
  • the generating unit is set to:
  • the audio frame corresponding to the phoneme frame on the audio resource is determined as the target audio frame.
  • the first determining module includes:
  • An input unit configured to input the target audio feature into a target feature detection model, wherein the target feature detection model is obtained by training an initial feature detection model using audio feature samples marked with lip features;
  • the first acquiring unit is configured to acquire the target lip shape feature output by the target feature detection model.
  • the input unit is configured to: input the target audio feature into the input layer of the target feature detection model;
  • the first obtaining unit is configured to: obtain the target lip-shape feature output by the output layer of the target feature detection model;
  • the target feature detection model includes the input layer, the spectrum analysis network, the cooperative pronunciation network, and the output layer that are connected in sequence, and the input layer is used to receive audio features and pair the received audio features through a convolutional layer.
  • the spectrum analysis network is used to process the features output by the input layer in the dimension of spectrum features
  • the co-pronunciation network is used to process the features output by the spectrum analysis network in the time domain
  • the output layer is used to map the features output by the cooperative pronunciation network into lip features through a convolution kernel.
  • the device further includes:
  • the second extraction module is configured to extract the audio feature samples from the audio data in the data set before inputting the target audio features into the target feature detection model;
  • the second determining module is configured to determine the mouth shape feature corresponding to the audio feature sample according to the text information corresponding to the audio feature sample, wherein the text information is used to indicate the text corresponding to the pronunciation of the audio feature sample;
  • the training module is configured to train the initial feature detection model using the audio feature samples marked with the lip feature to obtain the target feature detection model.
  • the fusion module includes:
  • the second acquiring unit is configured to acquire the mouth shape model corresponding to each phoneme in the plurality of phonemes
  • the fusion unit is set to use the probability corresponding to each phoneme indicated by the target mouth shape feature as a weight, and merge the mouth shape model corresponding to each phoneme into the target mouth shape model.
  • the second acquiring unit is configured to:
  • the target mouth shape model corresponding to the target phoneme group in the plurality of mouth shape models is determined as the mouth shape model corresponding to each phoneme.
  • the generating module includes:
  • a third acquiring unit configured to acquire the time sequence of the target audio frame from the audio resource
  • the merging unit is configured to merge the target lip-shape model into the video resource according to the time sequence.
  • the above-mentioned modules can run in the hardware environment as shown in FIG. 1, and can be implemented by software or hardware, where the hardware environment includes a network environment.
  • a server or terminal for implementing the foregoing method for generating video resources.
  • FIG. 8 is a structural block diagram of a terminal according to an embodiment of the present disclosure.
  • the terminal may include: one or more (only one is shown in the figure) processor 801, memory 803, and transmission device 805 As shown in FIG. 8, the terminal may also include an input and output device 807.
  • the memory 803 may be configured to store software programs and modules, such as the program instructions/modules corresponding to the method and device for generating video resources in the embodiments of the present disclosure.
  • the processor 801 runs the software programs and modules stored in the memory 803, In this way, various functional applications and data processing are executed, that is, the above-mentioned video resource generation method is realized.
  • the memory 803 may include a high-speed random access memory, and may also include a non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory.
  • the memory 803 may include a memory remotely provided with respect to the processor 801, and these remote memories may be connected to the terminal through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
  • the aforementioned transmission device 805 is configured to receive or send data via a network, and may also be configured to transmit data between the processor and the memory.
  • the aforementioned network examples may include wired networks and wireless networks.
  • the transmission device 805 includes a network adapter (Network Interface Controller, NIC), which can be connected to other network devices and routers via a network cable so as to communicate with the Internet or a local area network.
  • the transmission device 805 is a radio frequency (RF) module, which is configured to communicate with the Internet in a wireless manner.
  • RF radio frequency
  • the memory 803 is configured to store an application program.
  • the processor 801 may call the application program stored in the memory 803 through the transmission device 805 to perform the following steps:
  • Target lip-shape feature corresponding to the target audio feature, wherein the target lip-shape feature is used to indicate the probability that the target audio feature belongs to each phoneme of a plurality of phonemes;
  • a solution for generating video resources is provided.
  • the probability that the target audio feature belongs to each of the multiple phonemes is obtained by determining the target lip feature corresponding to the target audio feature extracted from the audio resource.
  • the multiple phonemes correspond to multiple lip models, according to the target audio feature.
  • the probability of each phoneme belonging to multiple phonemes is fused with multiple models to obtain the target lip-shape model corresponding to the target audio feature, and then the target lip-shape model is used to generate the video resource corresponding to the audio resource.
  • the purpose of generating video resources is to achieve the technical effect of increasing the complexity of the process of generating video resources, and thereby solve the technical problem of high complexity of the process of generating video resources in related technologies.
  • the structure shown in Figure 8 is only for illustration, and the terminal can be a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, a handheld computer, and a mobile Internet Device (MID), Terminal equipment such as PAD.
  • FIG. 8 does not limit the structure of the above-mentioned electronic device.
  • the terminal may also include more or fewer components (such as a network interface, a display device, etc.) than shown in FIG. 8, or have a different configuration from that shown in FIG. 8.
  • the program can be stored in a computer-readable storage medium, which can be Including: flash disk, read-only memory (Read-Only Memory, ROM), random access device (Random Access Memory, RAM), magnetic disk or optical disk, etc.
  • the embodiment of the present disclosure also provides a storage medium.
  • the above-mentioned storage medium may be configured as program code for executing the method for generating video resources.
  • the foregoing storage medium may be located on at least one of the multiple network devices in the network shown in the foregoing embodiment.
  • the storage medium is configured to store program code for executing the following steps:
  • Target lip-shape feature corresponding to the target audio feature, wherein the target lip-shape feature is used to indicate the probability that the target audio feature belongs to each phoneme of a plurality of phonemes;
  • the foregoing storage medium may include, but is not limited to: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk Various media that can store program codes, such as discs or optical discs.
  • the integrated unit in the foregoing embodiment is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in the foregoing computer-readable storage medium.
  • the technical solution of the present disclosure essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, A number of instructions are included to enable one or more computer devices (which may be personal computers, servers, or network devices, etc.) to perform all or part of the steps of the methods described in the various embodiments of the present disclosure.
  • the disclosed client can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division.
  • there may be other division methods for example, multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, units or modules, and may be in electrical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.

Abstract

本公开涉及一种视频资源的生成方法和装置,其中,该方法包括:提取音频资源中的目标音频帧对应的目标音频特征;确定目标音频特征对应的目标口型特征,其中,目标口型特征用于指示目标音频特征属于多个音素中的每个音素的概率;根据目标口型特征对多个音素所对应的多个口型模型进行融合,得到目标音频特征所对应的目标口型模型;使用目标口型模型生成音频资源对应的视频资源。本公开解决了相关技术中生成视频资源的过程复杂度较高的技术问题。

Description

一种视频资源的生成方法和装置
本公开要求于2020年05月15日提交中国专利局、优先权号为202010415045.0、发明名称为“一种视频资源的生成方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本公开涉及计算机领域,尤其涉及一种视频资源的生成方法和装置。
背景技术
基于音频资源生成视频资源(比如口型动画)的方式需要首先得到音频资源对应的字幕文件,再利用音频资源和字幕文件来生成口型动画,并且对于不同语种的音频需要提供不同的配置文件。这种生成视频资源的方式过程比较复杂,操作比较繁琐。
针对上述的问题,目前尚未提出有效的解决方案。
发明内容
本公开提供了一种视频资源的生成方法和装置,以至少解决相关技术中生成视频资源的过程复杂度较高的技术问题。
一方面,本公开实施例提供了一种视频资源的生成方法,包括:
提取音频资源中的目标音频帧对应的目标音频特征;
确定所述目标音频特征对应的目标口型特征,其中,所述目标口型特征用于指示所述目标音频特征属于多个音素中的每个音素的概率;
根据所述目标口型特征对所述多个音素所对应的多个口型模型进行融合,得到所述目标音频特征所对应的目标口型模型;
使用所述目标口型模型生成所述音频资源对应的视频资源。
另一方面,本公开实施例还提供了一种视频资源的生成装置,包括:
第一提取模块,设置为提取音频资源中的目标音频帧对应的目标音频特征;
第一确定模块,设置为确定所述目标音频特征对应的目标口型特征,其中,所述目标口型特征用于指示所述目标音频特征属于多个音素中的每个音素的概率;
融合模块,设置为根据所述目标口型特征对所述多个音素所对应的多个口型模型进行融合,得到所述目标音频特征所对应的目标口型模型;
生成模块,设置为使用所述目标口型模型生成所述音频资源对应的视频资源。
另一方面,本公开实施例还提供了一种存储介质,该存储介质包括存储的程序,程序运行时执行上述的方法。
另一方面,本公开实施例还提供了一种电子装置,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,处理器通过计算机程序执行上述的方法。
本公开的有益效果至少包括:采用提取音频资源中的目标音频帧对应的目标音频特征;确定目标音频特征对应的目标口型特征,其中,目标口型特征用于指示目标音频特征属于多个音素中的每个音素的概率;根据目标口型特征对多个音素所对应的多个口型模型进行融合,得到目标音频特征所对应的目标口型模型;使用目标口型模型生成音频资源对应的视频资源的方式,通过确定从音频资源中提取的目标音频特征对应的目标口型特征来得到目标音频特征属于多个音素中的每 个音素的概率,这多个音素对应了多个口型模型,根据目标音频特征属于多个音素中的每个音素的概率对多个模型进行融合从而得到目标音频特征所对应的目标口型模型,再使用目标口型模型生成音频资源对应的视频资源,达到了通过音频资源直接生成视频资源的目的,从而实现了提高生成视频资源的过程复杂度的技术效果,进而解决了相关技术中生成视频资源的过程复杂度较高的技术问题。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本发明的实施例,并与说明书一起用于解释本发明的原理。
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,对于本领域普通技术人员而言,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是根据本公开实施例的视频资源的生成方法的硬件环境的示意图;
图2是根据本公开实施例的一种可选的视频资源的生成方法的流程图;
图3是根据本公开实施例的一种可选的分帧过程的示意图;
图4是根据本公开可选的实施方式的一种可选的特征检测模型的示意图;
图5是根据本公开可选的实施方式的一种可选的特征检测模型的配置参数的示意图;
图6是根据本公开实施例的一种可选的音素组对应口型模型的示意图;
图7是根据本公开实施例的一种可选的视频资源的生成装置的示 意图;
以及
图8是根据本公开实施例的一种终端的结构框图。
具体实施方式
为了使本技术领域的人员更好地理解本公开方案,下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本公开一部分的实施例,而不是全部的实施例。基于本公开中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本公开保护的范围。
需要说明的是,本公开的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本公开的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
根据本公开实施例的一方面,提供了一种视频资源的生成的方法实施例。
可选地,在本实施例中,上述视频资源的生成方法可以应用于如图1所示的由终端101和服务器103所构成的硬件环境中。如图1所示,服务器103通过网络与终端101进行连接,可用于为终端或终端上安装的客户端提供服务(如游戏服务、应用服务等),可在服务器上或独立于服务器设置数据库,用于为服务器103提供数据存储服务, 上述网络包括但不限于:广域网、城域网或局域网,终端101并不限定于PC、手机、平板电脑等。本公开实施例的视频资源的生成方法可以由服务器103来执行,也可以由终端101来执行,还可以是由服务器103和终端101共同执行。其中,终端101执行本公开实施例的视频资源的生成方法也可以是由安装在其上的客户端来执行。
图2是根据本公开实施例的一种可选的视频资源的生成方法的流程图,如图2所示,该方法可以包括以下步骤:
步骤S202,提取音频资源中的目标音频帧对应的目标音频特征;
步骤S204,确定所述目标音频特征对应的目标口型特征,其中,所述目标口型特征用于指示所述目标音频特征属于多个音素中的每个音素的概率;
步骤S206,根据所述目标口型特征对所述多个音素所对应的多个口型模型进行融合,得到所述目标音频特征所对应的目标口型模型;
步骤S208,使用所述目标口型模型生成所述音频资源对应的视频资源。
通过上述步骤S202至步骤S208,通过确定从音频资源中提取的目标音频特征对应的目标口型特征来得到目标音频特征属于多个音素中的每个音素的概率,这多个音素对应了多个口型模型,根据目标音频特征属于多个音素中的每个音素的概率对多个模型进行融合从而得到目标音频特征所对应的目标口型模型,再使用目标口型模型生成音频资源对应的视频资源,达到了通过音频资源直接生成视频资源的目的,从而实现了提高生成视频资源的过程复杂度的技术效果,进而解决了相关技术中生成视频资源的过程复杂度较高的技术问题。
在步骤S202提供的技术方案中,上述音频资源可以但不限于包括:录音文件、音乐文件、配音文件、直播音频流、通话语音等等。
可选地,在本实施例中,目标音频帧可以但不限于为音频资源中的每个音频帧或者音频资源中的关键帧。
可选地,在本实施例中,目标音频特征可以但不限于使用梅尔频率倒谱系数、梅尔频率倒谱系数的一阶差分系数、梅尔频率倒谱系数的二阶差分系数中的一种或者几种的组合。
在步骤S204提供的技术方案中,目标口型特征用于指示目标音频特征属于多个音素中的每个音素的概率。比如:目标口型特征的维度可以但不限于是多个音素的数量,目标口型特征可以但不限于是按照多个音素的顺序排列的多个概率值。
可选地,在本实施例中,多个音素可以但不限于是按照任何标准提取出的,比如:参考国际音标表提取出40个通用音素,包括:IH、AE、AA、AO、EY、AY、AW、L、IY、EH、M、B、P、AH、UH、W、UW、OY、OW、Z、S、CH、ZH、SH、ER、R、Y、N、NG、J、DH、D、G、T、K、TH、HH、F、V和silent。
可选地,在本实施例中,目标口型特征中的元素可以但不限于是按照上述顺序或者其他预先设定的顺序排列的。或者也可以不设定音素对应概率的排列顺序,目标口型特征通过以音素为键,以概率为值的键值对的形式进行表示。
在步骤S206提供的技术方案中,可以针对多个音素中的每个音素制作了其相应的口型模型用来展示该音素的发音嘴型。也可以按照发音时的嘴型将多个音素划分为N个音素组,每个音素组中包括一个音素或者多个发音嘴型相近的音素,为每个音素组制作一个口型模型,一个音素组中包括的音素均对应一个口型模型。
可选地,在本实施例中,可以但不限于将目标口型特征所指示的目标音频特征属于多个音素中的每个音素的概率作为融合权重对多个音素所对应的多个口型模型进行融合,从而得到目标音频特征所对应 的目标口型模型。
在步骤S208提供的技术方案中,将得到的目标口型模型按照一定的顺序拼接起来可以得到音频资源对应的视频资源。
作为一种可选的实施例,从音频资源中提取目标音频帧对应的目标音频特征包括:
S11,对所述音频资源进行分帧,得到多个音频帧;
S12,提取所述多个音频帧中每个音频帧对应的音频特征;
S13,根据所述每个音频帧对应的音频特征生成所述目标音频特征。
可选地,在本实施例中,对所述音频资源进行分帧的方式可以但不限于是按照预设帧移将音频资源划分为预设帧长的音频片段作为音频帧。
例如:图3是根据本公开实施例的一种可选的分帧过程的示意图,如图3所示,以20ms的帧长和10ms的帧移对音频进行分帧处理,得到第一帧为0ms到20ms,第二帧为10ms到30ms,以此类推,将音频资源划分为多个音频帧。
可选地,在本实施例中,每个音频帧对应的音频特征可以但不限于包括每个音频帧的梅尔频率倒谱系数,其中,梅尔频率倒谱系数的系数个数可以但不限于为M=13。
作为一种可选的实施例,提取所述多个音频帧中每个音频帧对应的音频特征包括:
S21,计算所述每个音频帧的梅尔频谱倒谱系数;
S22,获取所述梅尔频谱倒谱系数的一阶差分系数和二阶差分系数;
S23,将所述梅尔频谱倒谱系数,所述一阶差分系数和所述二阶差分系数确定为所述每个音频帧对应的音频特征。
可选地,在本实施例中,为了更好的体现音频资源中的动态特征,在计算每个音频帧的梅尔频谱倒谱系数之后,还可以获取梅尔频谱倒谱系数的一阶差分系数和二阶差分系数,使用该差分系数来描述音频资源中的动态特征,即声学特征在相邻特征间的变化情况。将梅尔频谱倒谱系数、一阶差分系数和二阶差分系数共同确定为每个音频帧对应的音频特征。
例如:在上述实施例中,以20ms的帧长和10ms的帧移对音频进行分帧处理,并计算每个音频帧的梅尔频率倒谱系数,系数个数为M=13。计算完音频帧的梅尔频率倒谱系数后,获取了梅尔频率倒谱系数的一阶差分系数和二阶差分系数,得到13x3维的梅尔频率倒谱特征作为每个音频帧对应的音频特征用于描述当前音频帧的包络和声学特征的变化信息。
作为一种可选的实施例,根据所述每个音频帧对应的音频特征生成所述目标音频特征包括:
S31,从所述多个音频帧中确定所述目标音频帧;
S32,对所述目标音频帧之前第一数量的音频帧对应的音频特征、所述目标音频帧对应的音频特征和所述目标音频帧之后第二数量的音频帧对应的音频特征进行合并,得到所述目标音频帧对应的所述目标音频特征。
可选地,在本实施例中,在生成目标音频特征的过程中,可以将多个连续的音频帧的特征组合成一个目标音频特征。比如:首先从多个音频帧中确定出目标音频帧,再将目标音频帧及其周围一定数量的音频帧的音频特征合并为目标音频特征。
可选地,在本实施例中,可以但不限于以目标音频帧为中心向前取第一数量的音频帧,向后取第二数量的音频帧。也就是说,第一数量加一可以等于第二数量,第一数量也可以等于第二数量加一。比如:以目标音频帧为中心向前取7个音频帧,向后取8个音频帧。也可以是以目标音频帧为中心向前取8个音频帧,向后取7个音频帧。即,以目标音频帧为中心,选取目标音频帧及其前后共16帧的音频帧的梅尔频率倒谱特征作为当前的目标音频特征。
作为一种可选的实施例,从所述多个音频帧中确定所述目标音频帧包括:
S41,在所述音频资源上提取音素帧;
S42,将所述音素帧在所述音频资源上对应的音频帧确定为所述目标音频帧。
可选地,在本实施例中,目标音频帧可以但不限于是根据音素帧在音频资源中的位置确定的。
可选地,在本实施例中,在音频资源上提取音素帧的方式可以但不限于为将1s的音频资源划分为30个音素帧,即每1/30s为一个音素帧。
作为一种可选的实施例,确定所述目标音频特征对应的目标口型特征包括:
S51,将所述目标音频特征输入目标特征检测模型,其中,所述目标特征检测模型是使用标注了口型特征的音频特征样本对初始特征检测模型进行训练得到的;
S52,获取所述目标特征检测模型输出的所述目标口型特征。
可选地,在本实施例中,可以但不限于通过构建并训练特征检测模型来确定目标音频特征对应的目标口型特征。
可选地,在本实施例中,用于确定目标音频特征对应的目标口型特征的目标特征检测模型可以但不限于是使用标注了口型特征的音频特征样本对初始特征检测模型进行训练得到的。
作为一种可选的实施例,将所述目标音频特征输入所述目标特征检测模型包括:
S61,将所述目标音频特征输入所述目标特征检测模型的输入层;
获取所述目标特征检测模型输出的所述目标口型特征包括:
S62,获取所述目标特征检测模型的输出层输出的所述目标口型特征;
其中,所述目标特征检测模型包括依次连接的所述输入层,谱分析网络,协同发音网络和所述输出层,所述输入层用于接收音频特征并通过卷积层对接收到的音频特征进行初始变换,所述谱分析网络用于在谱特征维度上对所述输入层输出的特征进行处理,所述协同发音网络用于在时域上对所述谱分析网络输出的特征进行处理,所述输出层用于通过卷积核将所述协同发音网络输出的特征映射成口型特征。
可选地,在本实施例中,目标特征检测模型可以但不限于包括依次连接的输入层,谱分析网络,协同发音网络和输出层,输入层用于接收音频特征并通过卷积层对接收到的音频特征进行初始变换,谱分析网络用于在谱特征维度上对输入层输出的特征进行处理,协同发音网络用于在时域上对谱分析网络输出的特征进行处理,输出层用于通过卷积核将协同发音网络输出的特征映射成口型特征。
可选地,在本实施例中,将目标音频特征输入目标特征检测模型的输入层,由目标特征检测模型对目标音频特征进行自动的分析,最终将目标特征检测模型的输出层的输出信息作为目标口型特征。
在一个可选的实施方式中,图4是根据本公开可选的实施方式的 一种可选的特征检测模型的示意图,如图4所示,目标特征检测模型包括用于接收音频特征的输入层,谱分析网络,协同发音网络和用于输出目标口型特征的输出层。输入层接收目标音频特征,并通过不含激活函数的卷积层对接收到的目标音频特征做初始变换。然后谱分析网络在谱特征维度上对目标音频特征进行分析。紧接着,协同发音网络在时域上对提取的目标音频特征做进一步分析。最后,输出层通过1x1的卷积核将目标音频特征映射成目标口型特征。
可选地,在本实施方式中,目标特征检测模型可以但不限于根据实际的使用环境进行构建,比如:图5是根据本公开可选的实施方式的一种可选的特征检测模型的配置参数的示意图,目标特征检测模型所包括的各个网络层的配置参数如图5所示。
作为一种可选的实施例,在将所述目标音频特征输入目标特征检测模型之前,还包括:
S71,从数据集中的音频数据中提取所述音频特征样本;
S72,根据所述音频特征样本对应的文本信息确定所述音频特征样本对应的口型特征,其中,所述文本信息用于指示音频特征样本的发音对应的文字;
S73,使用标注了所述口型特征的所述音频特征样本对所述初始特征检测模型进行训练,得到所述目标特征检测模型。
可选地,在本实施例中,上述数据集可以但不限于包括LibriSpeech和AISHELL两个大型语料数据集等等。
可选地,在本实施例中,从数据集中的音频数据中提取音频特征样本的方式可以但不限于采用上述提取音频资源中的目标音频帧对应的目标音频特征的方式。
可选地,在本实施例中,根据音频特征样本对应的文本信息确定 音频特征样本对应的口型特征的方式可以但不限于根据音频特征样本的语种来进行适当的调整。比如:针对英文音频,口型特征可以但不限于利用第三方SDK Annosoft生成。针对中文音频,口型特征可以但不限于基于Annosoft对其配置进行适应于中文音频的改进优化之后生成。
作为一种可选的实施例,根据所述目标口型特征对所述多个音素所对应的多个口型模型进行融合,得到所述目标音频特征所对应的目标口型模型包括:
S81,获取所述多个音素中每个音素所对应的口型模型;
S82,以所述目标口型特征所指示的每个音素对应的概率为权重,将所述每个音素所对应的口型模型融合为所述目标口型模型。
可选地,在本实施例中,可以但不限于为每个音素制作对应的口型模型。
可选地,在本实施例中,口型模型的制作方式可以但不限于包括基于骨骼动画的方式或者基于BlendShape的方式等等,本实施例中对此不作限定。
作为一种可选的实施例,获取所述多个音素中每个音素所对应的口型模型包括:
S91,确定所述每个音素在多个音素组中所落入的目标音素组,其中,所述多个音素按照发音口型被划分为所述多个音素组,所述多个音素组与多个口型模型一一对应;
S92,将所述目标音素组在所述多个口型模型中所对应的目标口型模型确定为所述每个音素所对应的口型模型。
可选地,在本实施例中,也可以根据音素之间发音口型的相似程度将多个音素划分为多个音素组,并为多个音素组中每个音素组制作 对应的口型模型,得到与多个音素组一一对应的多个口型模型。
例如:对于参考国际音标表提取出的40个通用音素,包括:IH、AE、AA、AO、EY、AY、AW、L、IY、EH、M、B、P、AH、UH、W、UW、OY、OW、Z、S、CH、ZH、SH、ER、R、Y、N、NG、J、DH、D、G、T、K、TH、HH、F、V和silent。可以但不限于划分为11个音素组,分别为AI组,包括音素:IH、AE、AA、AO、EY、AY、AW;L组,包括音素:L;E组,包括音素:IY、EH;MBP组,包括音素:M、B、P;U组,包括音素:AH、UH;WQ组,包括音素:W;O组,包括音素:UW、OY、OW;ZHCHSHZCS组,包括音素:Z、S、CH、ZH、SH;CDGKNRSThYZ组,包括音素:ER、R、Y、N、NG、J、DH、D、G、T、K、TH、HH;FV组,包括音素:F、V;REST组,包括音素:silent。图6是根据本公开实施例的一种可选的音素组对应口型模型的示意图,如图6所示,属于同一音素组的音素对应同一个口型模型。
作为一种可选的实施例,使用所述目标口型模型生成所述音频资源对应的视频资源包括:
S101,从所述音频资源中获取所述目标音频帧的时间顺序;
S102,按照所述时间顺序将所述目标口型模型合并为所述视频资源。
可选地,在本实施例中,视频资源中的目标口型模型可以但不限于是按照音频资源中目标音频帧的时间顺序排列的。根据目标音频帧在音频资源中持续的时长在视频资源中展示相应的目标口型模型。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本公开并不受所描述的动作顺序的限制,因为依据本公开,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于可选实施例,所涉及的动作和模块并不 一定是本公开所必须的。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到根据上述实施例的方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本公开的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本公开各个实施例所述的方法。
根据本公开实施例的另一个方面,还提供了一种用于实施上述视频资源的生成方法的视频资源的生成装置。图7是根据本公开实施例的一种可选的视频资源的生成装置的示意图,如图7所示,该装置可以包括:
第一提取模块72,设置为提取音频资源中的目标音频帧对应的目标音频特征;
第一确定模块74,设置为确定所述目标音频特征对应的目标口型特征,其中,所述目标口型特征用于指示所述目标音频特征属于多个音素中的每个音素的概率;
融合模块76,设置为根据所述目标口型特征对所述多个音素所对应的多个口型模型进行融合,得到所述目标音频特征所对应的目标口型模型;
生成模块78,设置为使用所述目标口型模型生成所述音频资源对应的视频资源。
需要说明的是,该实施例中的第一提取模块72可以设置为执行本公开实施例中的步骤S202,该实施例中的第一确定模块74可以设置为 执行本公开实施例中的步骤S204,该实施例中的融合模块76可以设置为执行本公开实施例中的步骤S206,该实施例中的生成模块78可以设置为执行本公开实施例中的步骤S208。
此处需要说明的是,上述模块与对应的步骤所实现的示例和应用场景相同,但不限于上述实施例所公开的内容。需要说明的是,上述模块作为装置的一部分可以运行在如图1所示的硬件环境中,可以通过软件实现,也可以通过硬件实现。
通过上述模块,通过确定从音频资源中提取的目标音频特征对应的目标口型特征来得到目标音频特征属于多个音素中的每个音素的概率,这多个音素对应了多个口型模型,根据目标音频特征属于多个音素中的每个音素的概率对多个模型进行融合从而得到目标音频特征所对应的目标口型模型,再使用目标口型模型生成音频资源对应的视频资源,达到了通过音频资源直接生成视频资源的目的,从而实现了提高生成视频资源的过程复杂度的技术效果,进而解决了相关技术中生成视频资源的过程复杂度较高的技术问题。
作为一种可选的实施例,所述第一提取模块包括:
分帧单元,设置为对所述音频资源进行分帧,得到多个音频帧;
提取单元,设置为提取所述多个音频帧中每个音频帧对应的音频特征;
生成单元,设置为根据所述每个音频帧对应的音频特征生成所述目标音频特征。
作为一种可选的实施例,所述提取单元设置为:
计算所述每个音频帧的梅尔频谱倒谱系数;
获取所述梅尔频谱倒谱系数的一阶差分系数和二阶差分系数;
将所述梅尔频谱倒谱系数,所述一阶差分系数和所述二阶差分系数确定为所述每个音频帧对应的音频特征。
作为一种可选的实施例,所述生成单元设置为:
从所述多个音频帧中确定所述目标音频帧;
对所述目标音频帧之前第一数量的音频帧对应的音频特征、所述目标音频帧对应的音频特征和所述目标音频帧之后第二数量的音频帧对应的音频特征进行合并,得到所述目标音频帧对应的所述目标音频特征。
作为一种可选的实施例,所述生成单元设置为:
在所述音频资源上提取音素帧;
将所述音素帧在所述音频资源上对应的音频帧确定为所述目标音频帧。
作为一种可选的实施例,所述第一确定模块包括:
输入单元,设置为将所述目标音频特征输入目标特征检测模型,其中,所述目标特征检测模型是使用标注了口型特征的音频特征样本对初始特征检测模型进行训练得到的;
第一获取单元,设置为获取所述目标特征检测模型输出的所述目标口型特征。
作为一种可选的实施例,所述输入单元设置为:将所述目标音频特征输入所述目标特征检测模型的输入层;
所述第一获取单元设置为:获取所述目标特征检测模型的输出层输出的所述目标口型特征;
其中,所述目标特征检测模型包括依次连接的所述输入层,谱分析网络,协同发音网络和所述输出层,所述输入层用于接收音频特征 并通过卷积层对接收到的音频特征进行初始变换,所述谱分析网络用于在谱特征维度上对所述输入层输出的特征进行处理,所述协同发音网络用于在时域上对所述谱分析网络输出的特征进行处理,所述输出层用于通过卷积核将所述协同发音网络输出的特征映射成口型特征。
作为一种可选的实施例,所述装置还包括:
第二提取模块,设置为在将所述目标音频特征输入目标特征检测模型之前,从数据集中的音频数据中提取所述音频特征样本;
第二确定模块,设置为根据所述音频特征样本对应的文本信息确定所述音频特征样本对应的口型特征,其中,所述文本信息用于指示音频特征样本的发音对应的文字;
训练模块,设置为使用标注了所述口型特征的所述音频特征样本对所述初始特征检测模型进行训练,得到所述目标特征检测模型。
作为一种可选的实施例,所述融合模块包括:
第二获取单元,设置为获取所述多个音素中每个音素所对应的口型模型;
融合单元,设置为以所述目标口型特征所指示的每个音素对应的概率为权重,将所述每个音素所对应的口型模型融合为所述目标口型模型。
作为一种可选的实施例,所述第二获取单元设置为:
确定所述每个音素在多个音素组中所落入的目标音素组,其中,所述多个音素按照发音口型被划分为所述多个音素组,所述多个音素组与多个口型模型一一对应;
将所述目标音素组在所述多个口型模型中所对应的目标口型模型确定为所述每个音素所对应的口型模型。
作为一种可选的实施例,所述生成模块包括:
第三获取单元,设置为从所述音频资源中获取所述目标音频帧的时间顺序;
合并单元,设置为按照所述时间顺序将所述目标口型模型合并为所述视频资源。
此处需要说明的是,上述模块与对应的步骤所实现的示例和应用场景相同,但不限于上述实施例所公开的内容。需要说明的是,上述模块作为装置的一部分可以运行在如图1所示的硬件环境中,可以通过软件实现,也可以通过硬件实现,其中,硬件环境包括网络环境。
根据本公开实施例的另一个方面,还提供了一种用于实施上述视频资源的生成方法的服务器或终端。
图8是根据本公开实施例的一种终端的结构框图,如图8所示,该终端可以包括:一个或多个(图中仅示出一个)处理器801、存储器803、以及传输装置805,如图8所示,该终端还可以包括输入输出设备807。
其中,存储器803可设置为存储软件程序以及模块,如本公开实施例中的视频资源的生成方法和装置对应的程序指令/模块,处理器801通过运行存储在存储器803内的软件程序以及模块,从而执行各种功能应用以及数据处理,即实现上述的视频资源的生成方法。存储器803可包括高速随机存储器,还可以包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器803可包括相对于处理器801远程设置的存储器,这些远程存储器可以通过网络连接至终端。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
上述的传输装置805设置为经由一个网络接收或者发送数据,还可以设置为处理器与存储器之间的数据传输。上述的网络实例可包括有线网络及无线网络。在一个实例中,传输装置805包括一个网络适配器(Network Interface Controller,NIC),其可通过网线与其他网络设备与路由器相连从而可与互联网或局域网进行通讯。在一个实例中,传输装置805为射频(Radio Frequency,RF)模块,其设置为通过无线方式与互联网进行通讯。
其中,可选地,存储器803设置为存储应用程序。
处理器801可以通过传输装置805调用存储器803存储的应用程序,以执行下述步骤:
提取音频资源中的目标音频帧对应的目标音频特征;
确定所述目标音频特征对应的目标口型特征,其中,所述目标口型特征用于指示所述目标音频特征属于多个音素中的每个音素的概率;
根据所述目标口型特征对所述多个音素所对应的多个口型模型进行融合,得到所述目标音频特征所对应的目标口型模型;
使用所述目标口型模型生成所述音频资源对应的视频资源。
采用本公开实施例,提供了一种视频资源的生成的方案。通过确定从音频资源中提取的目标音频特征对应的目标口型特征来得到目标音频特征属于多个音素中的每个音素的概率,这多个音素对应了多个口型模型,根据目标音频特征属于多个音素中的每个音素的概率对多个模型进行融合从而得到目标音频特征所对应的目标口型模型,再使用目标口型模型生成音频资源对应的视频资源,达到了通过音频资源直接生成视频资源的目的,从而实现了提高生成视频资源的过程复杂度的技术效果,进而解决了相关技术中生成视频资源的过程复杂度较 高的技术问题。
可选地,本实施例中的可选示例可以参考上述实施例中所描述的示例,本实施例在此不再赘述。
本领域普通技术人员可以理解,图8所示的结构仅为示意,终端可以是智能手机(如Android手机、iOS手机等)、平板电脑、掌上电脑以及移动互联网设备(Mobile Internet Devices,MID)、PAD等终端设备。图8其并不对上述电子装置的结构造成限定。例如,终端还可包括比图8中所示更多或者更少的组件(如网络接口、显示装置等),或者具有与图8所示不同的配置。
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令终端设备相关的硬件来完成,该程序可以存储于一计算机可读存储介质中,存储介质可以包括:闪存盘、只读存储器(Read-Only Memory,ROM)、随机存取器(Random Access Memory,RAM)、磁盘或光盘等。
本公开的实施例还提供了一种存储介质。可选地,在本实施例中,上述存储介质可以设置为执行视频资源的生成方法的程序代码。
可选地,在本实施例中,上述存储介质可以位于上述实施例所示的网络中的多个网络设备中的至少一个网络设备上。
可选地,在本实施例中,存储介质被设置为存储用于执行以下步骤的程序代码:
提取音频资源中的目标音频帧对应的目标音频特征;
确定所述目标音频特征对应的目标口型特征,其中,所述目标口型特征用于指示所述目标音频特征属于多个音素中的每个音素的概率;
根据所述目标口型特征对所述多个音素所对应的多个口型模型进行融合,得到所述目标音频特征所对应的目标口型模型;
使用所述目标口型模型生成所述音频资源对应的视频资源。
可选地,本实施例中的可选示例可以参考上述实施例中所描述的示例,本实施例在此不再赘述。
可选地,在本实施例中,上述存储介质可以包括但不限于:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
上述本公开实施例序号仅仅为了描述,不代表实施例的优劣。
上述实施例中的集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在上述计算机可读取的存储介质中。基于这样的理解,本公开的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在存储介质中,包括若干指令用以使得一台或多台计算机设备(可为个人计算机、服务器或者网络设备等)执行本公开各个实施例所述方法的全部或部分步骤。
在本公开的上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
在本公开所提供的几个实施例中,应该理解到,所揭露的客户端,可通过其它的方式实现。其中,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口, 单元或模块的间接耦合或通信连接,可以是电性或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本公开各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
以上所述仅是本公开的可选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本公开原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本公开的保护范围。

Claims (14)

  1. 一种视频资源的生成方法,包括:
    提取音频资源中的目标音频帧对应的目标音频特征;
    确定所述目标音频特征对应的目标口型特征,其中,所述目标口型特征用于指示所述目标音频特征属于多个音素中的每个音素的概率;
    根据所述目标口型特征对所述多个音素所对应的多个口型模型进行融合,得到所述目标音频特征所对应的目标口型模型;
    使用所述目标口型模型生成所述音频资源对应的视频资源。
  2. 根据权利要求1所述的方法,其中,从音频资源中提取目标音频帧对应的目标音频特征包括:
    对所述音频资源进行分帧,得到多个音频帧;
    提取所述多个音频帧中每个音频帧对应的音频特征;
    根据所述每个音频帧对应的音频特征生成所述目标音频特征。
  3. 根据权利要求2所述的方法,其中,提取所述多个音频帧中每个音频帧对应的音频特征包括:
    计算所述每个音频帧的梅尔频谱倒谱系数;
    获取所述梅尔频谱倒谱系数的一阶差分系数和二阶差分系数;
    将所述梅尔频谱倒谱系数,所述一阶差分系数和所述二阶差分系数确定为所述每个音频帧对应的音频特征。
  4. 根据权利要求2所述的方法,其中,根据所述每个音频帧对应的音频特征生成所述目标音频特征包括:
    从所述多个音频帧中确定所述目标音频帧;
    对所述目标音频帧之前第一数量的音频帧对应的音频特征、所述目标音频帧对应的音频特征和所述目标音频帧之后第二数量的音频帧对应的音频特征进行合并,得到所述目标音频帧对应的所述目标音频特征。
  5. 根据权利要求4所述的方法,其中,从所述多个音频帧中确定所述目标音频帧包括:
    在所述音频资源上提取音素帧;
    将所述音素帧在所述音频资源上对应的音频帧确定为所述目标音频帧。
  6. 根据权利要求1所述的方法,其中,确定所述目标音频特征对应的目标口型特征包括:
    将所述目标音频特征输入目标特征检测模型,其中,所述目标特征检测模型是使用标注了口型特征的音频特征样本对初始特征检测模型进行训练得到的;
    获取所述目标特征检测模型输出的所述目标口型特征。
  7. 根据权利要求6所述的方法,其中,
    将所述目标音频特征输入所述目标特征检测模型包括:将所述目标音频特征输入所述目标特征检测模型的输入层;
    获取所述目标特征检测模型输出的所述目标口型特征包括:获取所述目标特征检测模型的输出层输出的所述目标口型特征;
    其中,所述目标特征检测模型包括依次连接的所述输入层,谱分析网络,协同发音网络和所述输出层,所述输入层用于接收音频特征并通过卷积层对接收到的音频特征进行初始变换,所述谱分析网络用于在谱特征维度上对所述输入层输出的特征进行处理,所述协同发音 网络用于在时域上对所述谱分析网络输出的特征进行处理,所述输出层用于通过卷积核将所述协同发音网络输出的特征映射成口型特征。
  8. 根据权利要求6所述的方法,其中,在将所述目标音频特征输入目标特征检测模型之前,所述方法还包括:
    从数据集中的音频数据中提取所述音频特征样本;
    根据所述音频特征样本对应的文本信息确定所述音频特征样本对应的口型特征,其中,所述文本信息用于指示音频特征样本的发音对应的文字;
    使用标注了所述口型特征的所述音频特征样本对所述初始特征检测模型进行训练,得到所述目标特征检测模型。
  9. 根据权利要求1所述的方法,其中,根据所述目标口型特征对所述多个音素所对应的多个口型模型进行融合,得到所述目标音频特征所对应的目标口型模型包括:
    获取所述多个音素中每个音素所对应的口型模型;
    以所述目标口型特征所指示的每个音素对应的概率为权重,将所述每个音素所对应的口型模型融合为所述目标口型模型。
  10. 根据权利要求9所述的方法,其中,获取所述多个音素中每个音素所对应的口型模型包括:
    确定所述每个音素在多个音素组中所落入的目标音素组,其中,所述多个音素按照发音口型被划分为所述多个音素组,所述多个音素组与多个口型模型一一对应;
    将所述目标音素组在所述多个口型模型中所对应的目标口型模型确定为所述每个音素所对应的口型模型。
  11. 根据权利要求1所述的方法,其中,使用所述目标口型模型 生成所述音频资源对应的视频资源包括:
    从所述音频资源中获取所述目标音频帧的时间顺序;
    按照所述时间顺序将所述目标口型模型合并为所述视频资源。
  12. 一种视频资源的生成装置,包括:
    第一提取模块,设置为提取音频资源中的目标音频帧对应的目标音频特征;
    第一确定模块,设置为确定所述目标音频特征对应的目标口型特征,其中,所述目标口型特征用于指示所述目标音频特征属于多个音素中的每个音素的概率;
    融合模块,设置为根据所述目标口型特征对所述多个音素所对应的多个口型模型进行融合,得到所述目标音频特征所对应的目标口型模型;
    生成模块,设置为使用所述目标口型模型生成所述音频资源对应的视频资源。
  13. 一种存储介质,所述存储介质包括存储的程序,其中,所述程序运行时执行上述权利要求1至11任一项中所述的方法。
  14. 一种电子装置,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器通过所述计算机程序执行上述权利要求1至11任一项中所述的方法。
PCT/CN2020/112683 2020-05-15 2020-08-31 一种视频资源的生成方法和装置 WO2021227308A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010415045.0 2020-05-15
CN202010415045.0A CN111698552A (zh) 2020-05-15 2020-05-15 一种视频资源的生成方法和装置

Publications (1)

Publication Number Publication Date
WO2021227308A1 true WO2021227308A1 (zh) 2021-11-18

Family

ID=72477799

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/112683 WO2021227308A1 (zh) 2020-05-15 2020-08-31 一种视频资源的生成方法和装置

Country Status (2)

Country Link
CN (1) CN111698552A (zh)
WO (1) WO2021227308A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115278382A (zh) * 2022-06-29 2022-11-01 北京捷通华声科技股份有限公司 基于音频片段的视频片段确定方法及装置
CN116912376A (zh) * 2023-09-14 2023-10-20 腾讯科技(深圳)有限公司 口型动画生成方法、装置、计算机设备和存储介质

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114359450A (zh) * 2022-01-17 2022-04-15 小哆智能科技(北京)有限公司 一种模拟虚拟人物说话的方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102004897A (zh) * 2009-08-31 2011-04-06 索尼公司 用于处理图像的装置、方法和程序
US20150279364A1 (en) * 2014-03-29 2015-10-01 Ajay Krishnan Mouth-Phoneme Model for Computerized Lip Reading
CN106297792A (zh) * 2016-09-14 2017-01-04 厦门幻世网络科技有限公司 一种语音口型动画的识别方法及装置
CN108763190A (zh) * 2018-04-12 2018-11-06 平安科技(深圳)有限公司 基于语音的口型动画合成装置、方法及可读存储介质
CN109377539A (zh) * 2018-11-06 2019-02-22 北京百度网讯科技有限公司 用于生成动画的方法和装置

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10839825B2 (en) * 2017-03-03 2020-11-17 The Governing Council Of The University Of Toronto System and method for animated lip synchronization
CN107331384B (zh) * 2017-06-12 2018-05-04 平安科技(深圳)有限公司 语音识别方法、装置、计算机设备及存储介质
CN110288682B (zh) * 2019-06-28 2023-09-26 北京百度网讯科技有限公司 用于控制三维虚拟人像口型变化的方法和装置
CN110503942A (zh) * 2019-08-29 2019-11-26 腾讯科技(深圳)有限公司 一种基于人工智能的语音驱动动画方法和装置
CN111081270B (zh) * 2019-12-19 2021-06-01 大连即时智能科技有限公司 一种实时音频驱动的虚拟人物口型同步控制方法
CN111145777A (zh) * 2019-12-31 2020-05-12 苏州思必驰信息科技有限公司 一种虚拟形象展示方法、装置、电子设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102004897A (zh) * 2009-08-31 2011-04-06 索尼公司 用于处理图像的装置、方法和程序
US20150279364A1 (en) * 2014-03-29 2015-10-01 Ajay Krishnan Mouth-Phoneme Model for Computerized Lip Reading
CN106297792A (zh) * 2016-09-14 2017-01-04 厦门幻世网络科技有限公司 一种语音口型动画的识别方法及装置
CN108763190A (zh) * 2018-04-12 2018-11-06 平安科技(深圳)有限公司 基于语音的口型动画合成装置、方法及可读存储介质
CN109377539A (zh) * 2018-11-06 2019-02-22 北京百度网讯科技有限公司 用于生成动画的方法和装置

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115278382A (zh) * 2022-06-29 2022-11-01 北京捷通华声科技股份有限公司 基于音频片段的视频片段确定方法及装置
CN116912376A (zh) * 2023-09-14 2023-10-20 腾讯科技(深圳)有限公司 口型动画生成方法、装置、计算机设备和存储介质
CN116912376B (zh) * 2023-09-14 2023-12-22 腾讯科技(深圳)有限公司 口型动画生成方法、装置、计算机设备和存储介质

Also Published As

Publication number Publication date
CN111698552A (zh) 2020-09-22

Similar Documents

Publication Publication Date Title
CN108305641B (zh) 情感信息的确定方法和装置
WO2021227308A1 (zh) 一种视频资源的生成方法和装置
CN105976812B (zh) 一种语音识别方法及其设备
JP6786751B2 (ja) 音声接続合成の処理方法及び装置、コンピュータ設備及びコンピュータプログラム
WO2021083071A1 (zh) 语音转换、文件生成、播音、语音处理方法、设备及介质
CN109754783B (zh) 用于确定音频语句的边界的方法和装置
JP6936298B2 (ja) 三次元仮想ポートレートの口形の変化を制御する方法および装置
CN104239394A (zh) 包括显示装置和服务器的翻译系统及其控制方法
CN111883107B (zh) 语音合成、特征提取模型训练方法、装置、介质及设备
CN109582825B (zh) 用于生成信息的方法和装置
WO2017059694A1 (zh) 一种语音模仿方法和装置
CN113205793B (zh) 音频生成方法、装置、存储介质及电子设备
CN103514882A (zh) 一种语音识别方法及系统
CN113299312A (zh) 一种图像生成方法、装置、设备以及存储介质
CN112116903A (zh) 语音合成模型的生成方法、装置、存储介质及电子设备
JP2023059937A (ja) データインタラクション方法、装置、電子機器、記憶媒体、および、プログラム
CN107680584B (zh) 用于切分音频的方法和装置
CN113436609A (zh) 语音转换模型及其训练方法、语音转换方法及系统
CN104882146A (zh) 音频推广信息的处理方法及装置
JP7372402B2 (ja) 音声合成方法、装置、電子機器及び記憶媒体
JP7375089B2 (ja) 音声応答速度確定方法、装置、コンピュータ読み取り可能な記憶媒体及びコンピュータプログラム
CN109213466B (zh) 庭审信息的显示方法及装置
JP2024507734A (ja) 音声類似度決定方法及び装置、プログラム製品
CN115967833A (zh) 视频生成方法、装置、设备计存储介质
CN112562733A (zh) 媒体数据处理方法及装置、存储介质、计算机设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20935460

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20935460

Country of ref document: EP

Kind code of ref document: A1