WO2020154916A1

WO2020154916A1 - Video subtitle synthesis method and apparatus, storage medium, and electronic device

Info

Publication number: WO2020154916A1
Application number: PCT/CN2019/073770
Authority: WO
Inventors: 叶青
Original assignee: 深圳市欢太科技有限公司; Oppo广东移动通信有限公司
Priority date: 2019-01-29
Filing date: 2019-01-29
Publication date: 2020-08-06
Also published as: CN113056908A; CN113056908B

Abstract

Disclosed in the present application are a video subtitle synthesis method, comprising: obtaining voice information in a video, and obtaining a voice to be recognized according to a feature of the voice information; inputting the voice to be recognized to a d-vector voiceprint recognition model to obtain a voiceprint identifier corresponding to the voice to be recognized, the voiceprint identifier comprising a d-vector feature; performing voice recognition on the voice to be recognized to obtain corresponding text information; and synthesizing the voiceprint identifier and the text information to generate subtitles of the voice to be recognized.

Description

Video caption synthesis method, device, storage medium and electronic equipment

Technical field

This application belongs to the technical field of video production, and in particular relates to a video caption synthesis method, device, storage medium and electronic equipment.

Background technique

With the rapid development of smart terminals, the current way humans receive and store data is not only in the form of pictures or texts. Video has also become the main medium in information transmission. Video can transmit sound and picture information, but in video When the languages are different, video subtitles appear to convey information.

Application content

This embodiment of the application provides a video caption synthesis method, device, storage medium, and electronic equipment, which can add speaker information and voice content to the video.

In the first aspect, this embodiment of the present application provides a subtitle synthesis method applied to an electronic device, and the method includes:

Acquire voice information in the video, and obtain the voice to be recognized according to the characteristics of the voice information;

Inputting the voice to be recognized into a d-vector voiceprint recognition model to obtain a voiceprint identifier corresponding to the voice to be recognized;

Performing voice recognition on the to-be-recognized voice to obtain corresponding text information;

The voiceprint identification and text information are synthesized to generate subtitles of the voice to be recognized.

In a second aspect, an embodiment of the present application provides a video caption synthesis device applied to an electronic device, and the device includes:

A voice acquisition device for acquiring voice information in a video, and obtaining a voice to be recognized according to the characteristics of the voice information;

A voiceprint recognition device for inputting the voice to be recognized into a d-vector voiceprint recognition model to obtain a voiceprint identifier corresponding to the voice to be recognized;

A voice recognition device for performing voice recognition on the voice to be recognized to obtain corresponding text information;

The caption synthesis device is used to synthesize the voiceprint identification and text information to generate the caption of the voice to be recognized.

In a third aspect, an embodiment of the present application provides a storage medium on which a computer program is stored, wherein when the computer program is executed on a computer, the computer is caused to execute the video subtitles provided in the first aspect of the embodiment. The method of synthesis.

In a fourth aspect, an embodiment of the present application provides an electronic device for video caption synthesis, including a processor and a memory, wherein the processor is configured to execute: by calling a computer program in the memory:

The embodiment of this application can use the d-vector voiceprint recognition model to recognize the speaker in the video, and then obtain the speaker's voiceprint identification, and finally synthesize the voiceprint identification and text information to generate subtitles with speaker information .

Description of the drawings

The following describes the specific implementations of the application in detail with reference to the accompanying drawings to make the technical solutions of the application and its beneficial effects obvious.

FIG. 1 is a schematic diagram of the first flow of a method for synthesizing video captions according to an embodiment of the present application.

FIG. 2 is a schematic diagram of a second flow of a method for synthesizing video captions provided by an embodiment of the present application.

Fig. 3 is a schematic diagram of a DNN neural network of a d-vector voiceprint recognition model provided by an embodiment of the present application.

Fig. 4 is a training flowchart of a d-vector voiceprint recognition model provided by an embodiment of the present application.

Fig. 5 is a flowchart of establishing a voiceprint database of a method for synthesizing video captions provided by an embodiment of the present application.

FIG. 6 is a schematic diagram of a first structure of a video caption synthesis device provided by an embodiment of the present application.

Fig. 7 is a schematic diagram of a second structure of a video caption synthesis device provided by an embodiment of the present application.

Fig. 8 is a schematic structural diagram of an electronic device for video caption synthesis provided by an embodiment of the present application.

detailed description

Please refer to the drawings, in which the same component symbols represent the same components, and the principle of the present application is implemented in an appropriate computing environment for illustration. The following description is based on the exemplified specific embodiments of the application, which should not be regarded as limiting other specific embodiments of the application that are not described in detail herein.

In the following description, specific embodiments of the present application will be described with reference to steps and symbols executed by one or more computers, unless otherwise stated. Therefore, these steps and operations will be mentioned several times as being executed by a computer. The computer execution referred to in this article includes the operation of a computer processing unit that represents an electronic signal of data in a structured form. This operation converts the data or maintains it in a position in the computer's memory system, which can be reconfigured or otherwise changed the operation of the computer in a manner well known to testers in the art. The data structure maintained by the data is the physical location of the memory, which has specific characteristics defined by the data format. However, the principle of the present application is described by the above text, which does not represent a limitation. Testers in the field will understand that the various steps and operations described below can also be implemented in hardware.

The terms "first", "second", and "third" in this application are used to distinguish different objects, rather than to describe a specific sequence. In addition, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or device that includes a series of steps or modules is not limited to the listed steps or modules, but some embodiments also include steps or modules that are not listed, or some embodiments It also includes other steps or modules inherent to these processes, methods, products, or equipment.

The detailed description will be given below.

With the rapid development of big data technology, the data stored by humans is no longer limited to text and pictures, and video has also become the main medium in information transmission. Subtitles can help different people better understand the content of the video, and can also speed up the sharing of videos in different languages. However, in some programs, it is difficult to judge the specific speaker only by the text content, and users will have some difficulties in understanding the video content. In the embodiment of this application, a d-vector voiceprint recognition model is added on the basis of speech synthesis. During the process of subtitle synthesis, the speaker’s voiceprint can be recognized, and the speaker’s information is added to each segment of the speaker’s subtitles. , The audience can know the identity of the speaker when they see the subtitles.

In an embodiment, a method for subtitle synthesis is applied to an electronic device, and the method includes:

Inputting the voice to be recognized into a d-vector voiceprint recognition model to obtain a voiceprint identifier corresponding to the voice to be recognized, the voiceprint identifier including d-vector features;

In an embodiment, before acquiring the voice information in the video, before obtaining the voice to be recognized according to the characteristics of the voice information, the method further includes:

Obtain the audio track information in the video;

Delete the audio tracks that do not require subtitle synthesis, and extract the voice information in the audio tracks that require subtitle synthesis.

In an embodiment, acquiring voice information in a video, and obtaining the voice to be recognized according to the characteristics of the voice information, includes:

Determine whether the voice information meets preset voice information conditions;

If so, confirm that the voice information corresponds to the target object, and extract the voice to be recognized of the target object.

In an embodiment, before inputting the voice to be recognized into the d-vector voiceprint recognition model to obtain the voiceprint identifier corresponding to the voice to be recognized, the method further includes:

Extracting the Mel frequency cepstrum coefficient of the speech in the video;

Inputting the Mel frequency cepstrum coefficients into the d-vector voiceprint recognition model;

The hot-coded label is used as the training reference target of the d-vector voiceprint recognition model, and the d-vector voiceprint recognition model is trained.

In one embodiment, the voice to be recognized is input into a d-vector voiceprint recognition model to obtain a voiceprint identifier corresponding to the voice to be recognized, and the voiceprint identifier includes d-vector features and further includes:

Input the Mel frequency cepstrum coefficient of the target object's voice into the d-vector voiceprint recognition model;

Acquiring the d-vector feature of the target object's voice through a d-vector voiceprint recognition model;

Average the d-vector features of the voice of the target object to obtain the standard d-vector features of the voice of the target object;

A voiceprint database is established according to the standard d-vector features.

In an embodiment, after acquiring the d-vector feature of the target object's voice through the d-vector voiceprint recognition model, the method further includes:

The WCCN method is used to perform channel compensation on the d-vector feature of the target object's voice.

In an embodiment, before obtaining the voiceprint identifier corresponding to the voice, the method further includes:

Input the voice to be identified into a d-vector voiceprint recognition model to obtain the d-vector feature of the voice to be identified;

Calculating the cosine distance between the d-vector feature of the voice to be identified and the standard d-vector feature;

Judging whether the cosine distance is less than a threshold;

If yes, the d-vector feature of the voice to be identified is the d-vector feature in the voiceprint database.

Please refer to FIG. 1. FIG. 1 is a schematic diagram of a first flow of a method for synthesizing video captions according to an embodiment of the present application. This method is suitable for electronic devices such as computers, mobile phones, and tablets. The video caption synthesis method may include:

In step 101, the voice information in the video is obtained, and the voice to be recognized is obtained according to the characteristics of the voice information.

It is understandable that the sound information and picture information are included in the video, and the background music and the voice of the character dialogue in the video are included in the sound information. The voice information of the character dialogue in the video is obtained. The voice information can include the number of words spoken in the voice. How many, the frequency of speaking, and the gender and age corresponding to the voice.

In a video that requires subtitle synthesis, the voice information in the video is obtained, and the voice information contains the voice information of the target person in the video, and the voice of the target person is extracted through filtering, so as to obtain the voice to be recognized.

In step 102, the voice to be recognized is input into the d-vector voiceprint recognition model to obtain the voiceprint identifier corresponding to the voice to be recognized.

It is understandable that voiceprint (Voiceprint) is a sound wave spectrum that carries speech information displayed by an electroacoustic instrument. Voiceprint is not only specific, but also relatively stable. Anyone's voiceprint is unique, and there will be no two speakers with the same voiceprint.

In the process of video caption synthesis, in addition to the need for a voice recognition model to recognize the content of the speaker’s speech and obtain the text information of the voice, the speaker’s voiceprint can also be identified through the voiceprint recognition model, and the speaker’s voiceprint corresponds to The identity of the speaker.

In the embodiment of this application, the voice to be recognized is input into the d-vector voiceprint recognition model, and finally the d-vector feature of the voice to be recognized is compared with the standard d-vector feature in the database. -The vector feature is the same as a standard d-vector feature in the database, then the speaker identity information of the voice to be recognized is determined, and then a voiceprint identifier is generated according to the d-vector feature of the voice to be recognized, which is used in the subsequent steps Video subtitle synthesis.

In step 103, speech recognition is performed on the speech to be recognized to obtain corresponding text information.

It is understandable that the voice recognition referred to in this application is to recognize the speaker’s voice, and then obtain the text information of the speaker’s voice. The voice recognition function can recognize the text information of the voice, but cannot obtain the voice of the voice.纹信息。 Pattern information.

Input the to-be-recognized voice into the voice recognition model to obtain the text information of the to-be-recognized voice. At this time, the text information corresponds to only each segment of the voice, and it is impossible to determine which speaker's voice is.

In step 104, the voiceprint identification and text information are synthesized to generate subtitles of the voice to be recognized.

It is understandable that after obtaining the voiceprint information of the voice to be recognized and the text information of the voice to be recognized, the text of the voice to be recognized and the voiceprint identifier of the voice to be recognized are synthesized to generate a subtitle containing the identity information of the speaker. When watching the video, the audience can better understand the video content based on the speaker's identity information and the speaker's speech content.

In summary, the embodiment of the application obtains the voice information in the video, and then obtains the voice to be recognized according to the voice information, and inputs the voice to be recognized into the d-vector voiceprint recognition model to obtain the voiceprint identification of the voice to be recognized. Then perform voice recognition on the voice to be recognized to obtain text information of the voice to be recognized, and finally synthesize the voiceprint identification and text information of the voice to be recognized to generate subtitles with speaker identity information. The embodiment of the application can add speaker's identity information to a video with a large number of speakers to help the audience understand the content in the video.

Please refer to FIG. 2, which is a schematic diagram of a second process of a method for synthesizing video captions according to an embodiment of the present application. This method is suitable for electronic devices such as computers, mobile phones, and tablets. The video caption synthesis method may include:

In step 201, the audio track information of the video is obtained, and the voice information in the audio track that needs subtitle synthesis is extracted.

It is understandable that there are multiple audio tracks in the audio and video files, and the content of each audio track is also different. For example, the songs we hear in our daily lives are recorded by using multi-track technology. Yes, the singer will record a separate track when singing, and the band will record a separate track when playing, and there are other background sounds, and you can also record a separate track. Therefore, in a video file, there are also multiple audio tracks.

For example, in the obtained video, the audio track information in the video is obtained. Audio track A represents the sound track of background music, audio track B represents the sound track of human voice, and audio track C represents the sound track of animal sounds. Among the three audio tracks, the only audio track that really needs subtitle annotation is audio track B, and the remaining audio tracks do not need subtitle annotation. In this case, you can extract audio track B that requires subtitles.

It should be noted that this embodiment can delete audio tracks for videos with multiple audio tracks to reduce the workload of video subtitle synthesis and improve the efficiency of video subtitle synthesis. In a single audio track or other complex audio tracks, both Contains the speaker’s voice as well as other noises. At this time, there is no need to identify and delete the audio track. The speaker’s voice can be recognized through subsequent steps, and the subtitles with speaker information are finally synthesized.

In step 202, it is determined whether the voice information meets a preset voice information condition.

After the voice information contained in the audio track is obtained, a judgment on the voice information needs to be made, and the voice information that needs subtitle synthesis is further screened out.

For example, it can be judged by the speaker's speaking frequency in the voice information. At this time, the preset voice information condition can be the speaker's speaking frequency, and it can also be judged by the speaker's gender in the voice information, which is only for men or Subtitle synthesis for female voices can also be judged by age. Children’s voices are different from adults’ voices. Subtitle synthesis can be performed on one of the voices. In addition, it can also be judged by information such as timbre and speech speed. According to the basis, the preset voice information conditions can be adjusted according to actual needs.

If the voice information in the audio track meets the preset voice condition, then step 203 is entered. If the voice information in the audio track does not meet the preset voice condition, then step 204 is entered.

In step 203, the voice information corresponds to the target object, and the to-be-recognized voice of the target object is extracted.

In step 202, it is determined that the voice information meets the preset voice information conditions, which means that the voice information is information that needs to be used for video caption synthesis. At this time, the voice information corresponds to the target object, and the target object is the speaker who needs subtitle synthesis , And then extract the to-be-recognized voice of the target object, and proceed to the next step.

It should be noted that the target object may include multiple objects, and the voice information may include multiple voice subjects. The voice subjects and the target object are corresponding. In the video that requires subtitle synthesis, extract the voice to be recognized of the target object.

In step 204, there is no need to perform subtitle synthesis on the speech.

If the determined voice information does not meet the preset voice information conditions, it means that the determined voice information is not voice information that needs to be processed, and there is no need to perform subtitle synthesis on the voice.

In step 205, the voice to be recognized is input into the d-vector voiceprint recognition model to obtain the d-vector feature of the voice to be recognized.

When inputting the voice to be recognized into the d-vector voiceprint recognition model, the d-vector voiceprint recognition model used at this time has been trained. The d-vector voiceprint recognition model adopts the neural network structure of DNN. Using different hidden layers of the DNN neural network, it can effectively extract the advantages from low-level features to high-level features, and extract voiceprint information in speech.

Please refer to FIG. 3, which is a schematic diagram of a DNN neural network of a d-vector voiceprint recognition model provided in an embodiment of the present application.

The d-vector voiceprint model shown in the figure is a 4-layer DNN neural network model. The model shown in the figure is only one of many situations and does not limit the application. The DNN neural network model in the embodiment of the application includes an input layer, four hidden layers, and an output layer.

When training the d-vector voiceprint recognition model, a large amount of speech data is required to be input to the DNN neural network, and finally a d-vector voiceprint recognition model that can be used normally can be trained. Please refer to Figure 4 for the specific training steps. The training flowchart of the d-vector voiceprint recognition model provided by the embodiment of the application. The training steps of the d-vector voiceprint recognition model specifically include:

In step 301, the Mel frequency cepstrum coefficients of the speech in the video are extracted.

In the field of sound processing, Mel-Frequency Cepstrum (Mel-Frequency Cepstrum) is a linear transformation of the logarithmic energy spectrum based on the non-linear mel scale of sound frequency. Mel-Frequency Cepstral Coefficients are the coefficients that make up the Mel-Frequency Cepstral Coefficients.

In the embodiment of the application, the Mel frequency cepstrum coefficient of the voice in the video is extracted and used as the input value into the training model of the d-vector voiceprint recognition model. The voice of the target object can be the voice in the video. The voice can also be the voice of the target object from other sources. A large amount of voice data is used as the input data for training the d-vector voiceprint recognition model. After extracting the Mel frequency cepstrum coefficient of the voice data, proceed to the next step.

It should be noted that the d-vector voiceprint recognition model that needs to be trained in this embodiment is the untrained DNN neural network model.

In step 302, the Mel cepstrum coefficients are input to the d-vector voiceprint recognition model.

It is understandable that the d-vector voiceprint recognition model at this time is a model that has not been trained. In this step, it is only used as a training model, and the Mel frequency cepstral coefficient is input to the d-vector voiceprint recognition model for training.

In step 303, the hot-encoded label is used as the training model of the d-vector voiceprint recognition model.

In the embodiment of this application, the d-vector voiceprint recognition model is trained by using labels in the form of one-hot encoding as the reference target for training. Using hot encoding can solve the problem that the classifier cannot handle discrete data. To a certain extent, it also plays a role in expanding the feature. Of course, other forms of tags can also be used as the reference target for model training, which is not limited here.

In step 304, the d-vector voiceprint recognition model is trained by the gradient descent method until the model training is completed.

Gradient descent is a first-order optimization algorithm, usually called the steepest descent method. In the process of training the model, the gradient descent training method can be used to train the model. Of course, other methods can also be used to train the DNN neural network. The model is trained, that is, the d-vector voiceprint recognition model is trained until the model training converges.

After the training of the d-vector voiceprint recognition model is completed, a voiceprint database needs to be established according to the trained model, which is used for the verification of the voice to be recognized in the subsequent steps. It should be noted,

Please refer to Figure 5. Figure 5 is a flowchart of establishing a voiceprint database. The specific process of establishing a voiceprint database includes:

In step 401, the Mel frequency cepstrum coefficient of the voice of the target object is input into the d-vector voiceprint recognition model.

It is understandable that the purpose of establishing the voiceprint database is to use the data in the voiceprint database as the reference target in the subsequent verification steps. When the database is established, the voice of the target object must be included. Will contain the d-vector characteristics of the target object.

The Mel frequency cepstrum coefficient of the target object’s voice is input as the input value to the d-vector voiceprint recognition model. At this time, the d-vector voiceprint recognition model is a trained model. Take the vector voiceprint model diagram as an example. After removing the output layer, the output of the last layer is the required d-vector feature, which is the hidden layer 4 shown in Figure 3.

In step 402, the d-vector feature of the target object's voice is obtained through the d-vector voiceprint recognition model.

Enter the Mel frequency cepstrum coefficient of the target object’s voice. When the d-vector feature of the target object’s voice is obtained, in order to reduce the influence of channel differences on the acquired d-vector features, the WCCN method is used in this embodiment to perform channel Channel compensation. The WCCN method scales the subspace to attenuate the dimensionality of the high-class intra-variance, which can then be used as a channel compensation technology. After the channel compensation of the WCCN method, the compensated d-vector feature, that is, d-vector V _{WCCN is obtained} .

First, we need to calculate the within-class variance matrix W:

Among them, S represents the target object,

The Mel frequency cepstral coefficients of the target speech are input to the d-vector feature obtained by the d-vector voiceprint recognition model, and then the Cholesky decomposition is used to calculate the WCCN matrix B ₁ , where the Cholesky decomposition is a symmetric positive definite matrix Expressed as a decomposition of the product of a lower triangular matrix L and its transpose. The formula is as follows:

Then the d-vector V _WCCN after WCCN channel compensation is:

It should be noted that the WCCN method exemplified in this embodiment to perform channel compensation is only one of the channel compensation methods. Specifically, channel compensation can also be performed according to methods such as LDA, PLDA, NAP, etc., to reduce the impact of channel differences.

In step 403, the d-vector features of the voice of the target object are averaged to obtain standard d-vector features.

The input voice of the target object can be multiple segments of speech. After the Mel frequency cepstral coefficient of each segment of speech is input into the d-vector voiceprint recognition model, and then channel compensation is performed, the corresponding d-vector feature of each segment of speech will be generated. The d-vector feature obtained for each segment of speech, and all the d-vector features of the target object are averaged to obtain a standard d-vector feature. Since a large amount of data is used to generate the d-vector feature, it can be used as a database The adopted value in.

In step 404, a voiceprint database is established according to standard d-vector features.

It is understandable that after the voice of each target object passes through the d-vector voiceprint recognition model, the standard d-vector features corresponding to each target object are generated, and these d-vector features are combined into a database for subsequent steps Verification of the voice to be recognized.

Please continue to refer to FIG. 2. In step 206, the cosine distance between the d-vector feature of the voice to be recognized and the standard d-vector feature is calculated.

After the voice to be recognized is input into the d-vector voiceprint recognition model, the d-vector feature of the voice to be recognized is generated. In this embodiment, the cosine between the d-vector feature of the voice to be recognized and the standard d-vector feature is calculated Distance is used to verify whether the d-vector feature of the voice to be recognized is the d-vector feature of the target object in the database.

It should be noted that, in the embodiment, other methods may be used to verify the speech to be recognized, for example, using Perceptual Linear Predictive (PLP) features and Gaussian Mixture Model (GMM). Voiceprint authentication.

In step 207, it is determined whether the cosine distance is less than a threshold.

In this embodiment, the threshold value of the cosine distance is set as a standard to determine whether the d-vector feature of the voice to be identified matches the standard d-vector feature. For example, the threshold value of the cosine distance is set to 0.5. When the cosine distance between the d-vector feature and the standard d-vector feature is 0.2, it means that the d-vector feature of the voice to be recognized matches the standard d-vector feature.

In step 208, the d-vector feature value of the voice to be recognized is the d-vector feature in the voiceprint database.

According to the result of step 207, when the cosine distance between the d-vector feature of the voice to be recognized and the standard d-vector feature is less than the threshold, it is determined that the d-vector feature of the voice to be recognized matches the database, and the next step is performed .

In step 209, the d-vector feature value of the voice to be recognized is not the d-vector feature in the voiceprint database.

According to the result of step 207, if the cosine distance between the d-vector feature of the voice to be recognized and the standard d-vector feature is not less than the threshold, it is determined that the d-vector feature of the voice to be recognized does not match the database, and there is no need to recognize Voice synthesis of video captions.

In step 210, the voiceprint identifier of the voice to be recognized is obtained according to the d-vector feature, and the voiceprint identifier and the text information are synthesized into subtitles.

After the judgment of the voice to be recognized, the voiceprint identifier of the voice to be recognized is obtained, that is, the identifier generated by the corresponding standard d-vector feature in the data, and then the voiceprint identifier of the voice to be recognized and its text information are synthesized to generate a voice Video subtitles with personal identification information. When watching the video, the audience can know exactly which sentence was said by which person, so as to help understand the content in the video.

In summary, the embodiment of the present application determines the audio track information in the video, deletes useless audio tracks, extracts the audio information from the audio tracks that require subtitle synthesis, and then determines whether the audio information meets the preset audio information conditions. When the preset voice information conditions are met, it is confirmed that the voice information corresponds to the target object, and the voice to be recognized of the target object is extracted. Before the voice to be recognized is input into the trained d-vector voiceprint recognition model, It is also necessary to train the d-vector voiceprint recognition model. After the trained d-vector voiceprint recognition model is obtained, a large number of target object’s voices are input to obtain the standard d-vector features of the target object. According to the standard d-vector features generate a database, and then input the voice to be recognized into the trained d-vector voiceprint recognition model to obtain the d-vector feature of the voice to be recognized, and calculate the d-vector feature of the voice to be recognized and the standard d-vector The cosine distance of the feature. If the cosine distance of the two is less than the threshold, it is judged that the two are matched. The speaker of the voice to be recognized and the speaker of the standard d-vector feature in the database are the same person, and finally the text information of the voice to be recognized is obtained , The text information of the voice to be recognized and the voiceprint identification generated according to the d-vector feature are synthesized to generate video subtitles with speaker information. In the embodiments of the present application, the efficiency of speech recognition can be improved through the screening of audio tracks and the screening of voice information. When d-vector features are obtained, the WCCN method is used for channel compensation, which reduces the impact of channel differences. Finally, the video caption synthesis with speaker information is realized, which can help the audience understand the content of the video.

In an embodiment, an apparatus for subtitle synthesis includes:

The voice acquisition module is used to acquire voice information in the video, and obtain the voice to be recognized according to the characteristics of the voice information;

A voiceprint recognition module, configured to input the voice to be recognized into a d-vector voiceprint recognition model to obtain a voiceprint identifier corresponding to the voice to be recognized;

A voice recognition module, configured to perform voice recognition on the voice to be recognized to obtain corresponding text information;

The subtitle synthesis module is used to synthesize the voiceprint identification and text information to generate the subtitle of the voice to be recognized.

In an embodiment, a device for synthesizing subtitles, wherein the voice acquisition module includes:

The extraction module is used to obtain audio track information in the video, delete audio tracks that do not require subtitle synthesis, and extract voice information in audio tracks that require subtitle synthesis;

The first judging module is configured to judge whether the voice information meets the preset voice information condition, and if so, confirm that the voice information corresponds to the target object, and extract the voice to be recognized of the target object.

In an embodiment, an apparatus for subtitle synthesis includes:

A training module for extracting the Mel frequency cepstrum coefficient of the voice in the video;

In an embodiment, in a device for synthesizing subtitles, the voiceprint recognition module includes:

The database module is used to input the Mel frequency cepstrum coefficient of the target object's voice into the d-vector voiceprint recognition model;

The second judgment module is configured to input the voice to be identified into a d-vector voiceprint recognition model to obtain the d-vector feature of the voice to be identified;

Judging whether the cosine distance is less than a threshold;

If yes, the d-vector feature value of the voice to be identified is the d-vector feature value in the voiceprint database.

Please refer to FIG. 6. FIG. 6 is a schematic diagram of a first structure of a video caption synthesis device provided by an embodiment of the present application. The video caption synthesis device includes a speech acquisition module 510, a training module 520, a voiceprint recognition module 530, a speech recognition module 540, and a caption synthesis module 550.

The voice acquisition module 510 is configured to acquire voice information in the video, and obtain the voice to be recognized according to the characteristics of the voice information.

Specifically, the voice information acquired by the voice acquisition module 510 contains the voice of the character conversation in the video, and the voice information of the character conversation in the video is obtained. The voice information may include the number of words spoken in the voice and the frequency of the number of times of speech. , And the gender and age corresponding to the voice.

Please refer to FIG. 7. FIG. 7 is a schematic diagram of a second structure of a video caption synthesis device according to an embodiment of the present application. The voice acquisition module 510 further includes an extraction module 511 and a first judgment module 512.

The extraction module 511 is used to obtain audio track information in the video, delete audio tracks that do not require subtitle synthesis, and extract voice information in audio tracks that require subtitle synthesis.

There are multiple audio tracks in a video file, and the content of each audio track is different. You can delete the background music track in the video, and only keep the audio track containing the voice information.

The first determining module 512 is configured to determine whether the voice information meets the preset voice information condition, and if so, confirm that the voice information corresponds to the target object, and extract the voice to be recognized of the target object.

Specifically, it can be judged by the speaker's speaking frequency in the voice information. At this time, the preset voice information condition can be the speaker's speaking frequency, and it can also be judged by the speaker's gender in the voice information, which is only for males. Or women’s voices can be subtitles synthesized, which can also be judged by age. Children’s voices are different from adults’ voices. Subtitles can be synthesized for one of the voices. In addition, it can also be based on information such as timbre and speech speed. Judgment basis, the preset voice information conditions can be adjusted according to actual needs.

The training module 520 is used to extract the Mel frequency cepstrum coefficients of the target object's voice;

Specifically, the training module 520 trains the model through a large number of voices, and trains the d-vector voiceprint recognition model by using labels in the form of one-hot encoding as the reference target for training. Using hot encoding can solve the classifier It is not good to deal with the problem of discrete data, and to a certain extent it also plays a role in expanding the characteristics.

The voiceprint recognition module 530 is used to input the voice to be recognized into the d-vector voiceprint recognition model to obtain the voiceprint identifier corresponding to the voice to be recognized.

Please refer to FIG. 7, where the voiceprint recognition module 530 includes a database module 531 and a second judgment module 532.

The database module 531 is configured to input the Mel frequency cepstrum coefficient of the target object's voice into the d-vector voiceprint recognition model;

Specifically, the input voice of the target object can be multiple segments of speech. After the Mel frequency cepstral coefficient of each segment of speech is input into the d-vector voiceprint recognition model, and then channel compensation is performed, a d-vector corresponding to each segment of speech will be generated Features. According to the d-vector features obtained from each segment of speech, averaging all the d-vector features of the target object to obtain a standard d-vector feature. Since a large amount of data is used to generate the d-vector feature, then Can be used as the adopted value in the database.

The second judgment module 532 is configured to input the voice to be identified into a d-vector voiceprint recognition model to obtain the d-vector feature of the voice to be identified;

Judging whether the cosine distance is less than a threshold;

The voice recognition module 540 is configured to perform voice recognition on the to-be-recognized voice to obtain corresponding text information.

Input the voice to be recognized into the voice recognition module, and obtain the text information of the voice to be recognized through the voice recognition module.

The subtitle synthesis module 550 is configured to synthesize the voiceprint identification and text information to generate subtitles of the voice to be recognized.

In the embodiments of this application, the video caption synthesis device belongs to the same concept as the video caption synthesis method in the above embodiment. Any method provided in the video subtitle synthesis method embodiment can be run on the video subtitle synthesis device, and its specific implementation process For details, refer to the embodiment of the method for synthesizing video captions, which will not be repeated here.

The term "module" used herein can be regarded as a software object executed on the operating system. The different components, modules, engines, and services described in this article can be regarded as implementation objects on the computing system. The devices and methods described herein can be implemented in the form of software, or of course, can also be implemented on hardware, and they are all within the protection scope of the present application.

To this end, an embodiment of the present invention provides a storage medium in which a plurality of instructions are stored, and the instructions can be loaded by a processor to execute the steps in any virtual resource transfer method provided in the embodiments of the present invention. For example, the instruction can perform the following steps:

For the specific implementation of each of the above operations, refer to the previous embodiment, which will not be repeated here.

The storage medium may include: Read Only Memory (ROM, Read Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disks or optical disks, etc.

Since the instructions stored in the storage medium can execute the steps in any method for transferring virtual resources provided by the embodiments of the present invention, any method for transferring virtual resources provided by the embodiments of the present invention can be implemented. The beneficial effects that can be achieved are detailed in the previous embodiments, and will not be repeated here.

The embodiment of the present application also provides an electronic device, such as a tablet computer, a mobile phone, and other electronic devices. The processor in the electronic device will load the instructions corresponding to the process of one or more application programs into the memory according to the following steps, and the processor will run the application programs stored in the memory to implement various functions:

In one embodiment, before acquiring the voice information in the video and obtaining the voice to be recognized according to the characteristics of the voice information, the processor is configured to perform the following steps:

Obtain the audio track information in the video;

In one embodiment, when the voice information in the video is obtained, and the voice to be recognized is obtained according to the characteristics of the voice information, the processor is configured to perform the following steps:

In an embodiment, when the voice to be recognized is input into the d-vector voiceprint recognition model to obtain the voiceprint identifier corresponding to the voice to be recognized, the processor is configured to perform the following steps:

Extracting the Mel frequency cepstrum coefficient of the speech in the video;

Average the d-vector features of the target object’s voice to obtain the standard d-vector features of the target object’s voice;

In an embodiment, after acquiring the d-vector feature of the target object's voice through the d-vector voiceprint recognition model, the processor is configured to perform the following steps:

In an embodiment, before obtaining the voiceprint identifier corresponding to the voice, the processor is configured to perform the following steps:

Judging whether the cosine distance is less than a threshold;

Please refer to FIG. 8. FIG. 8 is a schematic structural diagram of an electronic device for video caption synthesis provided by an embodiment of the present application.

The electronic device 700 includes a processor 701, a display 702, a memory 703, a radio frequency circuit 704, an audio module 705, and a power supply 706.

Among them, the processor 701 is the control center of the electronic device 700, which uses various interfaces and lines to connect various parts of the entire electronic device, by running or loading a computer program stored in the memory 702, and calling data stored in the memory 702, Various functions of the electronic device 700 are executed and data are processed, thereby overall monitoring of the electronic device 700 is performed.

The memory 702 may be used to store software programs and modules, and the processor 701 executes various functional applications and data processing by running the computer programs and modules stored in the memory 702. The memory 702 may mainly include a storage program area and a storage data area. The storage program area may store an operating system, a computer program required by at least one function (such as a sound playback function, an image playback function, etc.), etc.; Data created by the use of electronic equipment, etc. In addition, the memory 702 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices. Correspondingly, the memory 702 may further include a memory controller to provide the processor 701 with access to the memory 702.

In the embodiment of the present application, the processor 701 in the electronic device 700 loads the instructions corresponding to the process of one or more computer programs into the memory 702 according to the following steps, and the instructions are run by the processor 701 and stored in the memory 702 In order to realize various functions in the computer program, as follows:

Acquire the voice information in the video, and obtain the voice to be recognized according to the characteristics of the voice information;

Input the voice to be recognized into the d-vector voiceprint recognition model to obtain the voiceprint identifier corresponding to the voice to be recognized, and the voiceprint identifier contains d-vector features;

Perform voice recognition on the voice to be recognized to obtain the corresponding text information;

The display 703 may be used to display information input by the user or information provided to the user, and various graphical user interfaces. These graphical user interfaces may be composed of graphics, text, icons, videos, and any combination thereof. The display 703 may include a display panel. In some embodiments, the display panel may be configured in the form of a liquid crystal display (LCD) or an organic light-emitting diode (OLED).

The radio frequency circuit 704 may be used to transmit and receive radio frequency signals to establish wireless communication with network equipment or other electronic equipment through wireless communication, and to transmit and receive signals with the network equipment or other electronic equipment.

The audio module 705 includes dual speakers and audio circuits. The audio circuit can transmit the electrical signal converted from the received audio data to the dual speakers, which are converted into sound signals for output; on the other hand, the microphone converts the collected sound signals into electrical signals, which are converted after being received by the audio circuit The audio data is processed by the audio data output processor 701, and then sent to, for example, another terminal through the radio frequency circuit 704, or the audio data is output to the memory 702 for further processing. The audio circuit may also include an earplug jack to provide communication between the peripheral earphone and the terminal.

The power supply 706 can be used to power various components of the electronic device 700. In some embodiments, the power supply 706 may be logically connected to the processor 701 through a power management system, so that functions such as charging, discharging, and power consumption management can be managed through the power management system.

Although not shown in FIG. 8, the electronic device 700 may also include a camera, a Bluetooth module, etc., which will not be described in detail here.

In the embodiment of the present application, the storage medium may be a magnetic disk, an optical disc, a read only memory (Read Only Memory, ROM), or a random access memory (Random Access Memory, RAM), etc.

In the above-mentioned embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail in an embodiment, reference may be made to related descriptions of other embodiments.

It should be noted that for the video caption synthesis method of the embodiment of the present application, ordinary testers in the field can understand that all or part of the process of implementing the video caption synthesis method of the embodiment of the present application can be controlled by a computer program. When completed, the computer program can be stored in a computer readable storage medium, such as stored in the memory of an electronic device, and executed by at least one processor in the electronic device. The execution process may include such as a video caption synthesis method. The flow of the embodiment. Among them, the storage medium can be magnetic disk, optical disk, read-only memory, random access memory, etc.

For the video caption synthesis device of the embodiment of the present application, its functional modules may be integrated in one processing chip, or each module may exist alone physically, or two or more modules may be integrated in one module. The above-mentioned integrated modules can be implemented in the form of hardware or software functional modules. If the integrated module is implemented in the form of a software function module and sold or used as an independent product, it can also be stored in a computer readable storage medium, such as a read-only memory, a magnetic disk, or an optical disk.

The above provides a detailed introduction to the video caption synthesis method, device, storage medium, and electronic equipment provided by the embodiments of the present application. Specific examples are used in this article to illustrate the principles and implementations of the present application. The description of the above embodiments It is only used to help understand the methods and core ideas of this application; at the same time, for those skilled in the art, according to the ideas of this application, there will be changes in the specific implementation and the scope of application. In summary, this The content of the description should not be construed as a limitation on this application.

Claims

A method for synthesizing video captions applied to an electronic device, wherein the method includes:

Acquire voice information in the video, and obtain the voice to be recognized according to the characteristics of the voice information;

Inputting the voice to be recognized into a d-vector voiceprint recognition model to obtain a voiceprint identifier corresponding to the voice to be recognized, the voiceprint identifier including d-vector features;

Performing voice recognition on the to-be-recognized voice to obtain corresponding text information;

The voiceprint identification and text information are synthesized to generate subtitles of the voice to be recognized.
The method for synthesizing video captions according to claim 1, wherein, before acquiring the voice information in the video, before obtaining the voice to be recognized according to the characteristics of the voice information, the method further comprises:

Obtain the audio track information in the video;

Delete the audio track that does not require video subtitle synthesis, and obtain the voice information in the audio track that requires video subtitle synthesis.
The method for synthesizing video captions according to claim 1, wherein the acquiring voice information in the video and obtaining the voice to be recognized according to the characteristics of the voice information comprises:

Determine whether the voice information meets preset voice information conditions;

If so, confirm that the voice information corresponds to the target object, and extract the voice to be recognized of the target object.
The method for synthesizing video captions according to claim 3, wherein, before inputting the voice to be recognized into a d-vector voiceprint recognition model to obtain the voiceprint identifier corresponding to the voice to be recognized, the method further include:

Extracting the Mel frequency cepstrum coefficient of the speech in the video;

Inputting the Mel frequency cepstrum coefficients into the d-vector voiceprint recognition model;

The hot-coded label is used as the training reference target of the d-vector voiceprint recognition model, and the d-vector voiceprint recognition model is trained.
The method for synthesizing video captions according to claim 4, wherein the voice to be recognized is input into a d-vector voiceprint recognition model to obtain a voiceprint identifier corresponding to the voice to be recognized, and the voiceprint identifier Contains d-vector features, including:

Input the Mel frequency cepstrum coefficients of the target object's voice into the trained d-vector voiceprint recognition model;

Acquiring the d-vector feature of the target object's voice through a d-vector voiceprint recognition model;

Average the d-vector features of the voice of the target object to obtain the standard d-vector features of the voice of the target object;

A voiceprint database is established according to the standard d-vector features.
5. The method for synthesizing video captions according to claim 5, wherein after acquiring the d-vector feature of the target object's voice through a d-vector voiceprint recognition model, the method further comprises:

The WCCN method is used to perform channel compensation on the d-vector feature of the target object's voice.
The method for synthesizing video captions according to claim 5, wherein before obtaining the voiceprint identifier corresponding to the voice, the method further comprises:

Input the voice to be identified into a d-vector voiceprint recognition model to obtain the d-vector feature of the voice to be identified;

Calculating the cosine distance between the d-vector feature of the voice to be identified and the standard d-vector feature;

Judging whether the cosine distance is less than a threshold;

If yes, the d-vector feature of the voice to be identified is the d-vector feature in the voiceprint database.
A video caption synthesis device applied to electronic equipment, including:

The voice acquisition module is used to acquire voice information in the video, and obtain the voice to be recognized according to the characteristics of the voice information;

A voiceprint recognition module, configured to input the voice to be recognized into a d-vector voiceprint recognition model to obtain a voiceprint identifier corresponding to the voice to be recognized, the voiceprint identifier including d-vector features;

A voice recognition module, configured to perform voice recognition on the voice to be recognized to obtain corresponding text information;

The subtitle synthesis module is used to synthesize the voiceprint identification and text information to generate the subtitle of the voice to be recognized.
The video caption synthesis device according to claim 8, wherein the device further comprises:

A training module for extracting the Mel frequency cepstrum coefficient of the voice in the video;

Inputting the Mel frequency cepstrum coefficients into the d-vector voiceprint recognition model;

The hot-coded label is used as the training reference target of the d-vector voiceprint recognition model, and the d-vector voiceprint recognition model is trained.
The video caption synthesis device according to claim 8, wherein the voice acquisition module comprises:

The extraction module is used to obtain the audio track information in the video, delete the audio track that does not require video subtitle synthesis, and obtain the voice information in the audio track that requires video subtitle synthesis.

The first judging module is configured to judge whether the voice information meets the preset voice information condition, and if so, confirm that the voice information corresponds to the target object, and extract the voice to be recognized of the target object.
The device for synthesizing video captions according to claim 8, wherein the voiceprint recognition module comprises:

The database module is used to input the Mel frequency cepstrum coefficients of the target object's voice into the trained d-vector voiceprint recognition model;

Acquiring the d-vector feature of the target object's voice through a d-vector voiceprint recognition model;

Average the d-vector features of the voice of the target object to obtain the standard d-vector features of the voice of the target object;

A voiceprint database is established according to the standard d-vector features.
The device for synthesizing video captions according to claim 8, wherein the voiceprint recognition module comprises:

The second judgment module is configured to input the voice to be identified into a d-vector voiceprint recognition model to obtain the d-vector feature of the voice to be identified;

Calculating the cosine distance between the d-vector feature of the voice to be identified and the standard d-vector feature;

Judging whether the cosine distance is less than a threshold;

If yes, the d-vector feature of the voice to be identified is the d-vector feature in the voiceprint database.
A storage medium having a computer program stored thereon, wherein, when the computer program is executed on a computer, the computer is caused to execute the method according to any one of claims 1 to 7.
An electronic device for synthesizing video captions, including a processor and a memory, wherein the processor is used to execute: by calling a computer program in the memory:

Acquire voice information in the video, and obtain the voice to be recognized according to the characteristics of the voice information;

Inputting the voice to be recognized into a d-vector voiceprint recognition model to obtain a voiceprint identifier corresponding to the voice to be recognized, the voiceprint identifier including d-vector features;

Performing voice recognition on the to-be-recognized voice to obtain corresponding text information;

The voiceprint identification and text information are synthesized to generate subtitles of the voice to be recognized.
The video caption synthesis device according to claim 14, wherein, before acquiring the voice information in the video and obtaining the voice to be recognized according to the characteristics of the voice information, the processor is configured to execute:

Obtain the audio track information in the video;

Delete the audio track that does not require video subtitle synthesis, and obtain the voice information in the audio track that requires video subtitle synthesis.
The video caption synthesis device according to claim 14, wherein, when the voice information in the video is obtained and the voice to be recognized is obtained according to the characteristics of the voice information, the processor is configured to execute:

Determine whether the voice information meets preset voice information conditions;

If so, confirm that the voice information corresponds to the target object, and extract the voice to be recognized of the target object.
The video caption synthesis device according to claim 16, wherein, before inputting the voice to be recognized into a d-vector voiceprint recognition model to obtain the voiceprint identifier corresponding to the voice to be recognized, the processor uses To execute:

Extracting the Mel frequency cepstrum coefficient of the speech in the video;

Inputting the Mel frequency cepstrum coefficients into the d-vector voiceprint recognition model;

The hot-coded label is used as the training reference target of the d-vector voiceprint recognition model, and the d-vector voiceprint recognition model is trained.
The video caption synthesis device according to claim 17, wherein, when the voice to be recognized is input into a d-vector voiceprint recognition model to obtain a voiceprint identifier corresponding to the voice to be recognized, the processor is configured to carried out:

Input the Mel frequency cepstrum coefficients of the target object's voice into the trained d-vector voiceprint recognition model;

Acquiring the d-vector feature of the target object's voice through a d-vector voiceprint recognition model;

Average the d-vector features of the voice of the target object to obtain the standard d-vector features of the voice of the target object;

A voiceprint database is established according to the standard d-vector features.
18. The video caption synthesis device according to claim 18, wherein, after obtaining the d-vector feature of the target object's voice through a d-vector voiceprint recognition model, the processor is configured to execute:

The WCCN method is used to perform channel compensation on the d-vector feature of the target object's voice.
18. The video caption synthesis device according to claim 18, wherein, before obtaining the voiceprint identifier corresponding to the voice, the processor is configured to execute:

Input the voice to be identified into a d-vector voiceprint recognition model to obtain the d-vector feature of the voice to be identified;

Calculating the cosine distance between the d-vector feature of the voice to be identified and the standard d-vector feature;

Judging whether the cosine distance is less than a threshold;

If yes, the d-vector feature of the voice to be identified is the d-vector feature in the voiceprint database.