WO2020154916A1 - Video subtitle synthesis method and apparatus, storage medium, and electronic device - Google Patents

Video subtitle synthesis method and apparatus, storage medium, and electronic device Download PDF

Info

Publication number
WO2020154916A1
WO2020154916A1 PCT/CN2019/073770 CN2019073770W WO2020154916A1 WO 2020154916 A1 WO2020154916 A1 WO 2020154916A1 CN 2019073770 W CN2019073770 W CN 2019073770W WO 2020154916 A1 WO2020154916 A1 WO 2020154916A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
vector
recognized
video
voiceprint
Prior art date
Application number
PCT/CN2019/073770
Other languages
French (fr)
Chinese (zh)
Inventor
叶青
Original Assignee
深圳市欢太科技有限公司
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市欢太科技有限公司, Oppo广东移动通信有限公司 filed Critical 深圳市欢太科技有限公司
Priority to CN201980076343.7A priority Critical patent/CN113056908B/en
Priority to PCT/CN2019/073770 priority patent/WO2020154916A1/en
Publication of WO2020154916A1 publication Critical patent/WO2020154916A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/278Subtitling

Definitions

  • This application belongs to the technical field of video production, and in particular relates to a video caption synthesis method, device, storage medium and electronic equipment.
  • Video has also become the main medium in information transmission. Video can transmit sound and picture information, but in video When the languages are different, video subtitles appear to convey information.
  • This embodiment of the application provides a video caption synthesis method, device, storage medium, and electronic equipment, which can add speaker information and voice content to the video.
  • this embodiment of the present application provides a subtitle synthesis method applied to an electronic device, and the method includes:
  • the voiceprint identification and text information are synthesized to generate subtitles of the voice to be recognized.
  • an embodiment of the present application provides a video caption synthesis device applied to an electronic device, and the device includes:
  • a voice acquisition device for acquiring voice information in a video, and obtaining a voice to be recognized according to the characteristics of the voice information
  • a voiceprint recognition device for inputting the voice to be recognized into a d-vector voiceprint recognition model to obtain a voiceprint identifier corresponding to the voice to be recognized;
  • a voice recognition device for performing voice recognition on the voice to be recognized to obtain corresponding text information
  • the caption synthesis device is used to synthesize the voiceprint identification and text information to generate the caption of the voice to be recognized.
  • an embodiment of the present application provides a storage medium on which a computer program is stored, wherein when the computer program is executed on a computer, the computer is caused to execute the video subtitles provided in the first aspect of the embodiment.
  • the method of synthesis is a simple synthesis of the computer program.
  • an embodiment of the present application provides an electronic device for video caption synthesis, including a processor and a memory, wherein the processor is configured to execute: by calling a computer program in the memory:
  • the voiceprint identification and text information are synthesized to generate subtitles of the voice to be recognized.
  • the embodiment of this application can use the d-vector voiceprint recognition model to recognize the speaker in the video, and then obtain the speaker's voiceprint identification, and finally synthesize the voiceprint identification and text information to generate subtitles with speaker information .
  • FIG. 1 is a schematic diagram of the first flow of a method for synthesizing video captions according to an embodiment of the present application.
  • FIG. 2 is a schematic diagram of a second flow of a method for synthesizing video captions provided by an embodiment of the present application.
  • Fig. 3 is a schematic diagram of a DNN neural network of a d-vector voiceprint recognition model provided by an embodiment of the present application.
  • Fig. 4 is a training flowchart of a d-vector voiceprint recognition model provided by an embodiment of the present application.
  • Fig. 5 is a flowchart of establishing a voiceprint database of a method for synthesizing video captions provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of a first structure of a video caption synthesis device provided by an embodiment of the present application.
  • Fig. 7 is a schematic diagram of a second structure of a video caption synthesis device provided by an embodiment of the present application.
  • Fig. 8 is a schematic structural diagram of an electronic device for video caption synthesis provided by an embodiment of the present application.
  • the computer execution referred to in this article includes the operation of a computer processing unit that represents an electronic signal of data in a structured form. This operation converts the data or maintains it in a position in the computer's memory system, which can be reconfigured or otherwise changed the operation of the computer in a manner well known to testers in the art.
  • the data structure maintained by the data is the physical location of the memory, which has specific characteristics defined by the data format.
  • Testers in the field will understand that the various steps and operations described below can also be implemented in hardware.
  • a d-vector voiceprint recognition model is added on the basis of speech synthesis.
  • the speaker’s voiceprint can be recognized, and the speaker’s information is added to each segment of the speaker’s subtitles. , The audience can know the identity of the speaker when they see the subtitles.
  • a method for subtitle synthesis is applied to an electronic device, and the method includes:
  • the voiceprint identification and text information are synthesized to generate subtitles of the voice to be recognized.
  • the method before acquiring the voice information in the video, before obtaining the voice to be recognized according to the characteristics of the voice information, the method further includes:
  • acquiring voice information in a video, and obtaining the voice to be recognized according to the characteristics of the voice information includes:
  • the method before inputting the voice to be recognized into the d-vector voiceprint recognition model to obtain the voiceprint identifier corresponding to the voice to be recognized, the method further includes:
  • the hot-coded label is used as the training reference target of the d-vector voiceprint recognition model, and the d-vector voiceprint recognition model is trained.
  • the voice to be recognized is input into a d-vector voiceprint recognition model to obtain a voiceprint identifier corresponding to the voice to be recognized, and the voiceprint identifier includes d-vector features and further includes:
  • a voiceprint database is established according to the standard d-vector features.
  • the method further includes:
  • the WCCN method is used to perform channel compensation on the d-vector feature of the target object's voice.
  • the method before obtaining the voiceprint identifier corresponding to the voice, the method further includes:
  • the d-vector feature of the voice to be identified is the d-vector feature in the voiceprint database.
  • FIG. 1 is a schematic diagram of a first flow of a method for synthesizing video captions according to an embodiment of the present application. This method is suitable for electronic devices such as computers, mobile phones, and tablets.
  • the video caption synthesis method may include:
  • step 101 the voice information in the video is obtained, and the voice to be recognized is obtained according to the characteristics of the voice information.
  • the sound information and picture information are included in the video, and the background music and the voice of the character dialogue in the video are included in the sound information.
  • the voice information of the character dialogue in the video is obtained.
  • the voice information can include the number of words spoken in the voice. How many, the frequency of speaking, and the gender and age corresponding to the voice.
  • the voice information in the video is obtained, and the voice information contains the voice information of the target person in the video, and the voice of the target person is extracted through filtering, so as to obtain the voice to be recognized.
  • step 102 the voice to be recognized is input into the d-vector voiceprint recognition model to obtain the voiceprint identifier corresponding to the voice to be recognized.
  • voiceprint is a sound wave spectrum that carries speech information displayed by an electroacoustic instrument. Voiceprint is not only specific, but also relatively stable. Teen's voiceprint is unique, and there will be no two speakers with the same voiceprint.
  • the speaker’s voiceprint can also be identified through the voiceprint recognition model, and the speaker’s voiceprint corresponds to The identity of the speaker.
  • the voice to be recognized is input into the d-vector voiceprint recognition model, and finally the d-vector feature of the voice to be recognized is compared with the standard d-vector feature in the database.
  • the vector feature is the same as a standard d-vector feature in the database, then the speaker identity information of the voice to be recognized is determined, and then a voiceprint identifier is generated according to the d-vector feature of the voice to be recognized, which is used in the subsequent steps Video subtitle synthesis.
  • step 103 speech recognition is performed on the speech to be recognized to obtain corresponding text information.
  • the voice recognition referred to in this application is to recognize the speaker’s voice, and then obtain the text information of the speaker’s voice.
  • the voice recognition function can recognize the text information of the voice, but cannot obtain the voice of the voice. ⁇ Pattern information.
  • the to-be-recognized voice into the voice recognition model to obtain the text information of the to-be-recognized voice.
  • the text information corresponds to only each segment of the voice, and it is impossible to determine which speaker's voice is.
  • step 104 the voiceprint identification and text information are synthesized to generate subtitles of the voice to be recognized.
  • the text of the voice to be recognized and the voiceprint identifier of the voice to be recognized are synthesized to generate a subtitle containing the identity information of the speaker.
  • the audience can better understand the video content based on the speaker's identity information and the speaker's speech content.
  • the embodiment of the application obtains the voice information in the video, and then obtains the voice to be recognized according to the voice information, and inputs the voice to be recognized into the d-vector voiceprint recognition model to obtain the voiceprint identification of the voice to be recognized. Then perform voice recognition on the voice to be recognized to obtain text information of the voice to be recognized, and finally synthesize the voiceprint identification and text information of the voice to be recognized to generate subtitles with speaker identity information.
  • the embodiment of the application can add speaker's identity information to a video with a large number of speakers to help the audience understand the content in the video.
  • FIG. 2 is a schematic diagram of a second process of a method for synthesizing video captions according to an embodiment of the present application.
  • This method is suitable for electronic devices such as computers, mobile phones, and tablets.
  • the video caption synthesis method may include:
  • step 201 the audio track information of the video is obtained, and the voice information in the audio track that needs subtitle synthesis is extracted.
  • each audio track there are multiple audio tracks in the audio and video files, and the content of each audio track is also different.
  • the songs we hear in our daily lives are recorded by using multi-track technology. Yes, the singer will record a separate track when singing, and the band will record a separate track when playing, and there are other background sounds, and you can also record a separate track. Therefore, in a video file, there are also multiple audio tracks.
  • Audio track A represents the sound track of background music
  • audio track B represents the sound track of human voice
  • audio track C represents the sound track of animal sounds.
  • the only audio track that really needs subtitle annotation is audio track B, and the remaining audio tracks do not need subtitle annotation. In this case, you can extract audio track B that requires subtitles.
  • this embodiment can delete audio tracks for videos with multiple audio tracks to reduce the workload of video subtitle synthesis and improve the efficiency of video subtitle synthesis.
  • the speaker’s voice can be recognized through subsequent steps, and the subtitles with speaker information are finally synthesized.
  • step 202 it is determined whether the voice information meets a preset voice information condition.
  • the preset voice information condition can be the speaker's speaking frequency, and it can also be judged by the speaker's gender in the voice information, which is only for men or Subtitle synthesis for female voices can also be judged by age. Children’s voices are different from adults’ voices. Subtitle synthesis can be performed on one of the voices. In addition, it can also be judged by information such as timbre and speech speed. According to the basis, the preset voice information conditions can be adjusted according to actual needs.
  • step 203 is entered. If the voice information in the audio track does not meet the preset voice condition, then step 204 is entered.
  • step 203 the voice information corresponds to the target object, and the to-be-recognized voice of the target object is extracted.
  • step 202 it is determined that the voice information meets the preset voice information conditions, which means that the voice information is information that needs to be used for video caption synthesis.
  • the voice information corresponds to the target object, and the target object is the speaker who needs subtitle synthesis , And then extract the to-be-recognized voice of the target object, and proceed to the next step.
  • the target object may include multiple objects, and the voice information may include multiple voice subjects.
  • the voice subjects and the target object are corresponding.
  • step 204 there is no need to perform subtitle synthesis on the speech.
  • the determined voice information does not meet the preset voice information conditions, it means that the determined voice information is not voice information that needs to be processed, and there is no need to perform subtitle synthesis on the voice.
  • step 205 the voice to be recognized is input into the d-vector voiceprint recognition model to obtain the d-vector feature of the voice to be recognized.
  • the d-vector voiceprint recognition model used at this time has been trained.
  • the d-vector voiceprint recognition model adopts the neural network structure of DNN. Using different hidden layers of the DNN neural network, it can effectively extract the advantages from low-level features to high-level features, and extract voiceprint information in speech.
  • FIG. 3 is a schematic diagram of a DNN neural network of a d-vector voiceprint recognition model provided in an embodiment of the present application.
  • the d-vector voiceprint model shown in the figure is a 4-layer DNN neural network model.
  • the model shown in the figure is only one of many situations and does not limit the application.
  • the DNN neural network model in the embodiment of the application includes an input layer, four hidden layers, and an output layer.
  • the training flowchart of the d-vector voiceprint recognition model provided by the embodiment of the application.
  • the training steps of the d-vector voiceprint recognition model specifically include:
  • step 301 the Mel frequency cepstrum coefficients of the speech in the video are extracted.
  • Mel-Frequency Cepstrum (Mel-Frequency Cepstrum) is a linear transformation of the logarithmic energy spectrum based on the non-linear mel scale of sound frequency.
  • Mel-Frequency Cepstral Coefficients are the coefficients that make up the Mel-Frequency Cepstral Coefficients.
  • the Mel frequency cepstrum coefficient of the voice in the video is extracted and used as the input value into the training model of the d-vector voiceprint recognition model.
  • the voice of the target object can be the voice in the video.
  • the voice can also be the voice of the target object from other sources.
  • a large amount of voice data is used as the input data for training the d-vector voiceprint recognition model. After extracting the Mel frequency cepstrum coefficient of the voice data, proceed to the next step.
  • the d-vector voiceprint recognition model that needs to be trained in this embodiment is the untrained DNN neural network model.
  • step 302 the Mel cepstrum coefficients are input to the d-vector voiceprint recognition model.
  • the d-vector voiceprint recognition model at this time is a model that has not been trained. In this step, it is only used as a training model, and the Mel frequency cepstral coefficient is input to the d-vector voiceprint recognition model for training.
  • step 303 the hot-encoded label is used as the training model of the d-vector voiceprint recognition model.
  • the d-vector voiceprint recognition model is trained by using labels in the form of one-hot encoding as the reference target for training.
  • hot encoding can solve the problem that the classifier cannot handle discrete data. To a certain extent, it also plays a role in expanding the feature.
  • other forms of tags can also be used as the reference target for model training, which is not limited here.
  • step 304 the d-vector voiceprint recognition model is trained by the gradient descent method until the model training is completed.
  • Gradient descent is a first-order optimization algorithm, usually called the steepest descent method.
  • the gradient descent training method can be used to train the model.
  • other methods can also be used to train the DNN neural network.
  • the model is trained, that is, the d-vector voiceprint recognition model is trained until the model training converges.
  • a voiceprint database needs to be established according to the trained model, which is used for the verification of the voice to be recognized in the subsequent steps. It should be noted,
  • Figure 5 is a flowchart of establishing a voiceprint database.
  • the specific process of establishing a voiceprint database includes:
  • step 401 the Mel frequency cepstrum coefficient of the voice of the target object is input into the d-vector voiceprint recognition model.
  • the purpose of establishing the voiceprint database is to use the data in the voiceprint database as the reference target in the subsequent verification steps.
  • the voice of the target object must be included. Will contain the d-vector characteristics of the target object.
  • the Mel frequency cepstrum coefficient of the target object’s voice is input as the input value to the d-vector voiceprint recognition model.
  • the d-vector voiceprint recognition model is a trained model. Take the vector voiceprint model diagram as an example. After removing the output layer, the output of the last layer is the required d-vector feature, which is the hidden layer 4 shown in Figure 3.
  • step 402 the d-vector feature of the target object's voice is obtained through the d-vector voiceprint recognition model.
  • the WCCN method is used in this embodiment to perform channel Channel compensation.
  • the WCCN method scales the subspace to attenuate the dimensionality of the high-class intra-variance, which can then be used as a channel compensation technology.
  • the compensated d-vector feature that is, d-vector V WCCN is obtained .
  • S represents the target object
  • the Mel frequency cepstral coefficients of the target speech are input to the d-vector feature obtained by the d-vector voiceprint recognition model, and then the Cholesky decomposition is used to calculate the WCCN matrix B 1 , where the Cholesky decomposition is a symmetric positive definite matrix Expressed as a decomposition of the product of a lower triangular matrix L and its transpose.
  • the formula is as follows:
  • channel compensation is only one of the channel compensation methods. Specifically, channel compensation can also be performed according to methods such as LDA, PLDA, NAP, etc., to reduce the impact of channel differences.
  • step 403 the d-vector features of the voice of the target object are averaged to obtain standard d-vector features.
  • the input voice of the target object can be multiple segments of speech. After the Mel frequency cepstral coefficient of each segment of speech is input into the d-vector voiceprint recognition model, and then channel compensation is performed, the corresponding d-vector feature of each segment of speech will be generated.
  • the d-vector feature obtained for each segment of speech, and all the d-vector features of the target object are averaged to obtain a standard d-vector feature. Since a large amount of data is used to generate the d-vector feature, it can be used as a database The adopted value in.
  • step 404 a voiceprint database is established according to standard d-vector features.
  • the standard d-vector features corresponding to each target object are generated, and these d-vector features are combined into a database for subsequent steps Verification of the voice to be recognized.
  • step 206 the cosine distance between the d-vector feature of the voice to be recognized and the standard d-vector feature is calculated.
  • the d-vector feature of the voice to be recognized is generated.
  • the cosine between the d-vector feature of the voice to be recognized and the standard d-vector feature is calculated Distance is used to verify whether the d-vector feature of the voice to be recognized is the d-vector feature of the target object in the database.
  • PDP Perceptual Linear Predictive
  • GMM Gaussian Mixture Model
  • step 207 it is determined whether the cosine distance is less than a threshold.
  • the threshold value of the cosine distance is set as a standard to determine whether the d-vector feature of the voice to be identified matches the standard d-vector feature.
  • the threshold value of the cosine distance is set to 0.5.
  • the cosine distance between the d-vector feature and the standard d-vector feature is 0.2, it means that the d-vector feature of the voice to be recognized matches the standard d-vector feature.
  • the d-vector feature value of the voice to be recognized is the d-vector feature in the voiceprint database.
  • step 207 when the cosine distance between the d-vector feature of the voice to be recognized and the standard d-vector feature is less than the threshold, it is determined that the d-vector feature of the voice to be recognized matches the database, and the next step is performed .
  • step 209 the d-vector feature value of the voice to be recognized is not the d-vector feature in the voiceprint database.
  • step 207 if the cosine distance between the d-vector feature of the voice to be recognized and the standard d-vector feature is not less than the threshold, it is determined that the d-vector feature of the voice to be recognized does not match the database, and there is no need to recognize Voice synthesis of video captions.
  • step 210 the voiceprint identifier of the voice to be recognized is obtained according to the d-vector feature, and the voiceprint identifier and the text information are synthesized into subtitles.
  • the voiceprint identifier of the voice to be recognized is obtained, that is, the identifier generated by the corresponding standard d-vector feature in the data, and then the voiceprint identifier of the voice to be recognized and its text information are synthesized to generate a voice Video subtitles with personal identification information.
  • the audience can know exactly which sentence was said by which person, so as to help understand the content in the video.
  • the embodiment of the present application determines the audio track information in the video, deletes useless audio tracks, extracts the audio information from the audio tracks that require subtitle synthesis, and then determines whether the audio information meets the preset audio information conditions.
  • the preset voice information conditions are met, it is confirmed that the voice information corresponds to the target object, and the voice to be recognized of the target object is extracted.
  • the voice to be recognized is input into the trained d-vector voiceprint recognition model. It is also necessary to train the d-vector voiceprint recognition model. After the trained d-vector voiceprint recognition model is obtained, a large number of target object’s voices are input to obtain the standard d-vector features of the target object.
  • the standard d-vector features generate a database, and then input the voice to be recognized into the trained d-vector voiceprint recognition model to obtain the d-vector feature of the voice to be recognized, and calculate the d-vector feature of the voice to be recognized and the standard d-vector The cosine distance of the feature. If the cosine distance of the two is less than the threshold, it is judged that the two are matched.
  • the speaker of the voice to be recognized and the speaker of the standard d-vector feature in the database are the same person, and finally the text information of the voice to be recognized is obtained ,
  • the text information of the voice to be recognized and the voiceprint identification generated according to the d-vector feature are synthesized to generate video subtitles with speaker information.
  • the efficiency of speech recognition can be improved through the screening of audio tracks and the screening of voice information.
  • the WCCN method is used for channel compensation, which reduces the impact of channel differences.
  • the video caption synthesis with speaker information is realized, which can help the audience understand the content of the video.
  • an apparatus for subtitle synthesis includes:
  • the voice acquisition module is used to acquire voice information in the video, and obtain the voice to be recognized according to the characteristics of the voice information;
  • a voiceprint recognition module configured to input the voice to be recognized into a d-vector voiceprint recognition model to obtain a voiceprint identifier corresponding to the voice to be recognized;
  • a voice recognition module configured to perform voice recognition on the voice to be recognized to obtain corresponding text information
  • the subtitle synthesis module is used to synthesize the voiceprint identification and text information to generate the subtitle of the voice to be recognized.
  • a device for synthesizing subtitles wherein the voice acquisition module includes:
  • the extraction module is used to obtain audio track information in the video, delete audio tracks that do not require subtitle synthesis, and extract voice information in audio tracks that require subtitle synthesis;
  • the first judging module is configured to judge whether the voice information meets the preset voice information condition, and if so, confirm that the voice information corresponds to the target object, and extract the voice to be recognized of the target object.
  • an apparatus for subtitle synthesis includes:
  • a training module for extracting the Mel frequency cepstrum coefficient of the voice in the video
  • the hot-coded label is used as the training reference target of the d-vector voiceprint recognition model, and the d-vector voiceprint recognition model is trained.
  • the voiceprint recognition module includes:
  • the database module is used to input the Mel frequency cepstrum coefficient of the target object's voice into the d-vector voiceprint recognition model;
  • a voiceprint database is established according to the standard d-vector features.
  • the voiceprint recognition module includes:
  • the second judgment module is configured to input the voice to be identified into a d-vector voiceprint recognition model to obtain the d-vector feature of the voice to be identified;
  • the d-vector feature value of the voice to be identified is the d-vector feature value in the voiceprint database.
  • FIG. 6 is a schematic diagram of a first structure of a video caption synthesis device provided by an embodiment of the present application.
  • the video caption synthesis device includes a speech acquisition module 510, a training module 520, a voiceprint recognition module 530, a speech recognition module 540, and a caption synthesis module 550.
  • the voice acquisition module 510 is configured to acquire voice information in the video, and obtain the voice to be recognized according to the characteristics of the voice information.
  • the voice information acquired by the voice acquisition module 510 contains the voice of the character conversation in the video, and the voice information of the character conversation in the video is obtained.
  • the voice information may include the number of words spoken in the voice and the frequency of the number of times of speech. , And the gender and age corresponding to the voice.
  • the voice information in the video is obtained, and the voice information contains the voice information of the target person in the video, and the voice of the target person is extracted through filtering, so as to obtain the voice to be recognized.
  • FIG. 7 is a schematic diagram of a second structure of a video caption synthesis device according to an embodiment of the present application.
  • the voice acquisition module 510 further includes an extraction module 511 and a first judgment module 512.
  • the extraction module 511 is used to obtain audio track information in the video, delete audio tracks that do not require subtitle synthesis, and extract voice information in audio tracks that require subtitle synthesis.
  • the first determining module 512 is configured to determine whether the voice information meets the preset voice information condition, and if so, confirm that the voice information corresponds to the target object, and extract the voice to be recognized of the target object.
  • the preset voice information condition can be the speaker's speaking frequency, and it can also be judged by the speaker's gender in the voice information, which is only for males.
  • women’s voices can be subtitles synthesized, which can also be judged by age. Children’s voices are different from adults’ voices. Subtitles can be synthesized for one of the voices. In addition, it can also be based on information such as timbre and speech speed. Judgment basis, the preset voice information conditions can be adjusted according to actual needs.
  • the training module 520 is used to extract the Mel frequency cepstrum coefficients of the target object's voice
  • the hot-coded label is used as the training reference target of the d-vector voiceprint recognition model, and the d-vector voiceprint recognition model is trained.
  • the training module 520 trains the model through a large number of voices, and trains the d-vector voiceprint recognition model by using labels in the form of one-hot encoding as the reference target for training.
  • hot encoding can solve the classifier It is not good to deal with the problem of discrete data, and to a certain extent it also plays a role in expanding the characteristics.
  • the voiceprint recognition module 530 is used to input the voice to be recognized into the d-vector voiceprint recognition model to obtain the voiceprint identifier corresponding to the voice to be recognized.
  • the voiceprint recognition module 530 includes a database module 531 and a second judgment module 532.
  • the database module 531 is configured to input the Mel frequency cepstrum coefficient of the target object's voice into the d-vector voiceprint recognition model;
  • a voiceprint database is established according to the standard d-vector features.
  • the input voice of the target object can be multiple segments of speech.
  • a d-vector corresponding to each segment of speech will be generated Features.
  • the d-vector features obtained from each segment of speech averaging all the d-vector features of the target object to obtain a standard d-vector feature. Since a large amount of data is used to generate the d-vector feature, then Can be used as the adopted value in the database.
  • the second judgment module 532 is configured to input the voice to be identified into a d-vector voiceprint recognition model to obtain the d-vector feature of the voice to be identified;
  • the d-vector feature value of the voice to be identified is the d-vector feature value in the voiceprint database.
  • the threshold value of the cosine distance is set as a standard to determine whether the d-vector feature of the voice to be identified matches the standard d-vector feature.
  • the threshold value of the cosine distance is set to 0.5.
  • the cosine distance between the d-vector feature and the standard d-vector feature is 0.2, it means that the d-vector feature of the voice to be recognized matches the standard d-vector feature.
  • the voice recognition module 540 is configured to perform voice recognition on the to-be-recognized voice to obtain corresponding text information.
  • the subtitle synthesis module 550 is configured to synthesize the voiceprint identification and text information to generate subtitles of the voice to be recognized.
  • the voiceprint identifier of the voice to be recognized is obtained, that is, the identifier generated by the corresponding standard d-vector feature in the data, and then the voiceprint identifier of the voice to be recognized and its text information are synthesized to generate a voice Video subtitles with personal identification information.
  • the audience can know exactly which sentence was said by which person, so as to help understand the content in the video.
  • the embodiment of the present application determines the audio track information in the video, deletes useless audio tracks, extracts the audio information from the audio tracks that require subtitle synthesis, and then determines whether the audio information meets the preset audio information conditions.
  • the preset voice information conditions are met, it is confirmed that the voice information corresponds to the target object, and the voice to be recognized of the target object is extracted.
  • the voice to be recognized is input into the trained d-vector voiceprint recognition model. It is also necessary to train the d-vector voiceprint recognition model. After the trained d-vector voiceprint recognition model is obtained, a large number of target object’s voices are input to obtain the standard d-vector features of the target object.
  • the standard d-vector features generate a database, and then input the voice to be recognized into the trained d-vector voiceprint recognition model to obtain the d-vector feature of the voice to be recognized, and calculate the d-vector feature of the voice to be recognized and the standard d-vector The cosine distance of the feature. If the cosine distance of the two is less than the threshold, it is judged that the two are matched.
  • the speaker of the voice to be recognized and the speaker of the standard d-vector feature in the database are the same person, and finally the text information of the voice to be recognized is obtained ,
  • the text information of the voice to be recognized and the voiceprint identification generated according to the d-vector feature are synthesized to generate video subtitles with speaker information.
  • the efficiency of speech recognition can be improved through the screening of audio tracks and the screening of voice information.
  • the WCCN method is used for channel compensation, which reduces the impact of channel differences.
  • the video caption synthesis with speaker information is realized, which can help the audience understand the content of the video.
  • the video caption synthesis device belongs to the same concept as the video caption synthesis method in the above embodiment. Any method provided in the video subtitle synthesis method embodiment can be run on the video subtitle synthesis device, and its specific implementation process For details, refer to the embodiment of the method for synthesizing video captions, which will not be repeated here.
  • module used herein can be regarded as a software object executed on the operating system.
  • the different components, modules, engines, and services described in this article can be regarded as implementation objects on the computing system.
  • the devices and methods described herein can be implemented in the form of software, or of course, can also be implemented on hardware, and they are all within the protection scope of the present application.
  • an embodiment of the present invention provides a storage medium in which a plurality of instructions are stored, and the instructions can be loaded by a processor to execute the steps in any virtual resource transfer method provided in the embodiments of the present invention.
  • the instruction can perform the following steps:
  • the voiceprint identification and text information are synthesized to generate subtitles of the voice to be recognized.
  • the storage medium may include: Read Only Memory (ROM, Read Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disks or optical disks, etc.
  • ROM Read Only Memory
  • RAM Random Access Memory
  • magnetic disks or optical disks etc.
  • any method for transferring virtual resources provided by the embodiments of the present invention can be implemented.
  • the beneficial effects that can be achieved are detailed in the previous embodiments, and will not be repeated here.
  • the embodiment of the present application also provides an electronic device, such as a tablet computer, a mobile phone, and other electronic devices.
  • the processor in the electronic device will load the instructions corresponding to the process of one or more application programs into the memory according to the following steps, and the processor will run the application programs stored in the memory to implement various functions:
  • the voiceprint identification and text information are synthesized to generate subtitles of the voice to be recognized.
  • the processor before acquiring the voice information in the video and obtaining the voice to be recognized according to the characteristics of the voice information, the processor is configured to perform the following steps:
  • the processor when the voice information in the video is obtained, and the voice to be recognized is obtained according to the characteristics of the voice information, the processor is configured to perform the following steps:
  • the processor when the voice to be recognized is input into the d-vector voiceprint recognition model to obtain the voiceprint identifier corresponding to the voice to be recognized, the processor is configured to perform the following steps:
  • the hot-coded label is used as the training reference target of the d-vector voiceprint recognition model, and the d-vector voiceprint recognition model is trained.
  • the processor when the voice to be recognized is input into the d-vector voiceprint recognition model to obtain the voiceprint identifier corresponding to the voice to be recognized, the processor is configured to perform the following steps:
  • a voiceprint database is established according to the standard d-vector features.
  • the processor after acquiring the d-vector feature of the target object's voice through the d-vector voiceprint recognition model, the processor is configured to perform the following steps:
  • the WCCN method is used to perform channel compensation on the d-vector feature of the target object's voice.
  • the processor before obtaining the voiceprint identifier corresponding to the voice, the processor is configured to perform the following steps:
  • the d-vector feature value of the voice to be identified is the d-vector feature value in the voiceprint database.
  • FIG. 8 is a schematic structural diagram of an electronic device for video caption synthesis provided by an embodiment of the present application.
  • the electronic device 700 includes a processor 701, a display 702, a memory 703, a radio frequency circuit 704, an audio module 705, and a power supply 706.
  • the processor 701 is the control center of the electronic device 700, which uses various interfaces and lines to connect various parts of the entire electronic device, by running or loading a computer program stored in the memory 702, and calling data stored in the memory 702, Various functions of the electronic device 700 are executed and data are processed, thereby overall monitoring of the electronic device 700 is performed.
  • the memory 702 may be used to store software programs and modules, and the processor 701 executes various functional applications and data processing by running the computer programs and modules stored in the memory 702.
  • the memory 702 may mainly include a storage program area and a storage data area.
  • the storage program area may store an operating system, a computer program required by at least one function (such as a sound playback function, an image playback function, etc.), etc.; Data created by the use of electronic equipment, etc.
  • the memory 702 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices.
  • the memory 702 may further include a memory controller to provide the processor 701 with access to the memory 702.
  • the processor 701 in the electronic device 700 loads the instructions corresponding to the process of one or more computer programs into the memory 702 according to the following steps, and the instructions are run by the processor 701 and stored in the memory 702 In order to realize various functions in the computer program, as follows:
  • the voiceprint identification and text information are synthesized to generate subtitles of the voice to be recognized.
  • the display 703 may be used to display information input by the user or information provided to the user, and various graphical user interfaces. These graphical user interfaces may be composed of graphics, text, icons, videos, and any combination thereof.
  • the display 703 may include a display panel.
  • the display panel may be configured in the form of a liquid crystal display (LCD) or an organic light-emitting diode (OLED).
  • LCD liquid crystal display
  • OLED organic light-emitting diode
  • the radio frequency circuit 704 may be used to transmit and receive radio frequency signals to establish wireless communication with network equipment or other electronic equipment through wireless communication, and to transmit and receive signals with the network equipment or other electronic equipment.
  • the audio module 705 includes dual speakers and audio circuits.
  • the audio circuit can transmit the electrical signal converted from the received audio data to the dual speakers, which are converted into sound signals for output; on the other hand, the microphone converts the collected sound signals into electrical signals, which are converted after being received by the audio circuit
  • the audio data is processed by the audio data output processor 701, and then sent to, for example, another terminal through the radio frequency circuit 704, or the audio data is output to the memory 702 for further processing.
  • the audio circuit may also include an earplug jack to provide communication between the peripheral earphone and the terminal.
  • the power supply 706 can be used to power various components of the electronic device 700.
  • the power supply 706 may be logically connected to the processor 701 through a power management system, so that functions such as charging, discharging, and power consumption management can be managed through the power management system.
  • the electronic device 700 may also include a camera, a Bluetooth module, etc., which will not be described in detail here.
  • the storage medium may be a magnetic disk, an optical disc, a read only memory (Read Only Memory, ROM), or a random access memory (Random Access Memory, RAM), etc.
  • the video caption synthesis method of the embodiment of the present application ordinary testers in the field can understand that all or part of the process of implementing the video caption synthesis method of the embodiment of the present application can be controlled by a computer program.
  • the computer program can be stored in a computer readable storage medium, such as stored in the memory of an electronic device, and executed by at least one processor in the electronic device.
  • the execution process may include such as a video caption synthesis method.
  • the storage medium can be magnetic disk, optical disk, read-only memory, random access memory, etc.
  • the video caption synthesis device of the embodiment of the present application its functional modules may be integrated in one processing chip, or each module may exist alone physically, or two or more modules may be integrated in one module.
  • the above-mentioned integrated modules can be implemented in the form of hardware or software functional modules. If the integrated module is implemented in the form of a software function module and sold or used as an independent product, it can also be stored in a computer readable storage medium, such as a read-only memory, a magnetic disk, or an optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

Disclosed in the present application are a video subtitle synthesis method, comprising: obtaining voice information in a video, and obtaining a voice to be recognized according to a feature of the voice information; inputting the voice to be recognized to a d-vector voiceprint recognition model to obtain a voiceprint identifier corresponding to the voice to be recognized, the voiceprint identifier comprising a d-vector feature; performing voice recognition on the voice to be recognized to obtain corresponding text information; and synthesizing the voiceprint identifier and the text information to generate subtitles of the voice to be recognized.

Description

视频字幕合成方法、装置、存储介质及电子设备Video caption synthesis method, device, storage medium and electronic equipment 技术领域Technical field
本申请属于视频制作技术领域,尤其涉及一种视频字幕合成方法、装置、存储介质及电子设备。This application belongs to the technical field of video production, and in particular relates to a video caption synthesis method, device, storage medium and electronic equipment.
背景技术Background technique
随着智能终端的飞速发展,目前人类接收和存储数据的方式已经不仅仅是以图片或者文字的形式,视频也已成为信息传输中的主要媒介,视频可以传递声音和画面信息,但是在视频中的语言不同时,就出现了视频字幕来传递信息。With the rapid development of smart terminals, the current way humans receive and store data is not only in the form of pictures or texts. Video has also become the main medium in information transmission. Video can transmit sound and picture information, but in video When the languages are different, video subtitles appear to convey information.
申请内容Application content
本申实施例提供一种视频字幕合成方法、装置、存储介质及电子设备,可以在视频中添加说话人的信息以及语音的内容。This embodiment of the application provides a video caption synthesis method, device, storage medium, and electronic equipment, which can add speaker information and voice content to the video.
第一方面,本申实施例提供一种字幕合成方法,应用于电子设备,所述方法包括:In the first aspect, this embodiment of the present application provides a subtitle synthesis method applied to an electronic device, and the method includes:
获取视频当中的语音信息,根据所述语音信息的特征得到待识别语音;Acquire voice information in the video, and obtain the voice to be recognized according to the characteristics of the voice information;
将所述待识别语音输入至d-vector声纹识别模型,以得到所述待识别语音所对应的声纹标识;Inputting the voice to be recognized into a d-vector voiceprint recognition model to obtain a voiceprint identifier corresponding to the voice to be recognized;
对所述待识别语音进行语音识别以得到对应的文本信息;Performing voice recognition on the to-be-recognized voice to obtain corresponding text information;
将所述声纹标识和文本信息进行合成,以生成所述待识别语音的字幕。The voiceprint identification and text information are synthesized to generate subtitles of the voice to be recognized.
第二方面,本申请实施例提供一种视频字幕合成装置,应用于电子设备,所述装置包括:In a second aspect, an embodiment of the present application provides a video caption synthesis device applied to an electronic device, and the device includes:
语音获取装置,用于获取视频当中的语音信息,根据所述语音信息的特征得到待识别语音;A voice acquisition device for acquiring voice information in a video, and obtaining a voice to be recognized according to the characteristics of the voice information;
声纹识别装置,用于将所述待识别语音输入至d-vector声纹识别模型,以得到所述待识别语音所对应的声纹标识;A voiceprint recognition device for inputting the voice to be recognized into a d-vector voiceprint recognition model to obtain a voiceprint identifier corresponding to the voice to be recognized;
语音识别装置,用于对所述待识别语音进行语音识别以得到对应的文本信息;A voice recognition device for performing voice recognition on the voice to be recognized to obtain corresponding text information;
字幕合成装置,用于将所述声纹标识和文本信息进行合成,以生成所述待识别语音的字幕。The caption synthesis device is used to synthesize the voiceprint identification and text information to generate the caption of the voice to be recognized.
第三方面,本申请实施例提供了一种存储介质,其上存储有计算机程序,其中,当所述计算机程序在计算机上执行时,使得所述计算机执行本实施例第一方面提供的视频字幕合成的方法。In a third aspect, an embodiment of the present application provides a storage medium on which a computer program is stored, wherein when the computer program is executed on a computer, the computer is caused to execute the video subtitles provided in the first aspect of the embodiment. The method of synthesis.
第四方面,本申请实施例提供了一种用于视频字幕合成的电子设备,包括处理器及存储器,其中,所述处理器通过调用所述存储器中的计算机程序,用于执行:In a fourth aspect, an embodiment of the present application provides an electronic device for video caption synthesis, including a processor and a memory, wherein the processor is configured to execute: by calling a computer program in the memory:
获取视频当中的语音信息,根据所述语音信息的特征得到待识别语音;Acquire voice information in the video, and obtain the voice to be recognized according to the characteristics of the voice information;
将所述待识别语音输入至d-vector声纹识别模型,以得到所述待识别语音所对应的声纹标识;Inputting the voice to be recognized into a d-vector voiceprint recognition model to obtain a voiceprint identifier corresponding to the voice to be recognized;
对所述待识别语音进行语音识别以得到对应的文本信息;Performing voice recognition on the to-be-recognized voice to obtain corresponding text information;
将所述声纹标识和文本信息进行合成,以生成所述待识别语音的字幕。The voiceprint identification and text information are synthesized to generate subtitles of the voice to be recognized.
本申请实施例可以使用d-vector声纹识别模型对视频中的说话人进行识别,然后获 取到说话人的声纹标识,最后将声纹标识和文本信息进行合成,生成具有说话人信息的字幕。The embodiment of this application can use the d-vector voiceprint recognition model to recognize the speaker in the video, and then obtain the speaker's voiceprint identification, and finally synthesize the voiceprint identification and text information to generate subtitles with speaker information .
附图说明Description of the drawings
下面结合附图,通过对本申请的具体实施方式详细描述,将使本申请的技术方案及其有益效果显而易见。The following describes the specific implementations of the application in detail with reference to the accompanying drawings to make the technical solutions of the application and its beneficial effects obvious.
图1是本申请实施例提供的视频字幕合成方法的第一流程示意图。FIG. 1 is a schematic diagram of the first flow of a method for synthesizing video captions according to an embodiment of the present application.
图2是本申请实施例提供的视频字幕合成方法的第二流程示意图。FIG. 2 is a schematic diagram of a second flow of a method for synthesizing video captions provided by an embodiment of the present application.
图3是本申请实施例提供的d-vector声纹识别模型的DNN神经网络示意图。Fig. 3 is a schematic diagram of a DNN neural network of a d-vector voiceprint recognition model provided by an embodiment of the present application.
图4是本申请实施例提供的d-vector声纹识别模型的训练流程图。Fig. 4 is a training flowchart of a d-vector voiceprint recognition model provided by an embodiment of the present application.
图5是本申请实施例提供的视频字幕合成方法的建立声纹数据库的流程图。Fig. 5 is a flowchart of establishing a voiceprint database of a method for synthesizing video captions provided by an embodiment of the present application.
图6是本申请实施例提供的视频字幕合成装置的第一结构示意图。FIG. 6 is a schematic diagram of a first structure of a video caption synthesis device provided by an embodiment of the present application.
图7是本申请实施例提供的视频字幕合成装置的第二结构示意图。Fig. 7 is a schematic diagram of a second structure of a video caption synthesis device provided by an embodiment of the present application.
图8是本申请实施例提供的用于视频字幕合成的电子设备的结构示意图。Fig. 8 is a schematic structural diagram of an electronic device for video caption synthesis provided by an embodiment of the present application.
具体实施方式detailed description
请参照图式,其中相同的组件符号代表相同的组件,本申请的原理是以实施在一适当的运算环境中来举例说明。以下的说明是基于所例示的本申请具体实施例,其不应被视为限制本申请未在此详述的其它具体实施例。Please refer to the drawings, in which the same component symbols represent the same components, and the principle of the present application is implemented in an appropriate computing environment for illustration. The following description is based on the exemplified specific embodiments of the application, which should not be regarded as limiting other specific embodiments of the application that are not described in detail herein.
在以下的说明中,本申请的具体实施例将参考由一部或多部计算机所执行的步骤及符号来说明,除非另有述明。因此,这些步骤及操作将有数次提到由计算机执行,本文所指的计算机执行包括了由代表了以一结构化型式中的数据的电子信号的计算机处理单元的操作。此操作转换该数据或将其维持在该计算机的内存系统中的位置处,其可重新配置或另外以本领域测试人员所熟知的方式来改变该计算机的运作。该数据所维持的数据结构为该内存的实体位置,其具有由该数据格式所定义的特定特性。但是,本申请原理以上述文字来说明,其并不代表为一种限制,本领域测试人员将可了解到以下所述的多种步骤及操作亦可实施在硬件当中。In the following description, specific embodiments of the present application will be described with reference to steps and symbols executed by one or more computers, unless otherwise stated. Therefore, these steps and operations will be mentioned several times as being executed by a computer. The computer execution referred to in this article includes the operation of a computer processing unit that represents an electronic signal of data in a structured form. This operation converts the data or maintains it in a position in the computer's memory system, which can be reconfigured or otherwise changed the operation of the computer in a manner well known to testers in the art. The data structure maintained by the data is the physical location of the memory, which has specific characteristics defined by the data format. However, the principle of the present application is described by the above text, which does not represent a limitation. Testers in the field will understand that the various steps and operations described below can also be implemented in hardware.
本申请中的术语“第一”、“第二”和“第三”等是用于区别不同对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或模块的过程、方法、系统、产品或设备没有限定于已列出的步骤或模块,而是某些实施例还包括没有列出的步骤或模块,或某些实施例还包括对于这些过程、方法、产品或设备固有的其它步骤或模块。The terms "first", "second", and "third" in this application are used to distinguish different objects, rather than to describe a specific sequence. In addition, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or device that includes a series of steps or modules is not limited to the listed steps or modules, but some embodiments also include steps or modules that are not listed, or some embodiments It also includes other steps or modules inherent to these processes, methods, products, or equipment.
以下将分别进行详细说明。The detailed description will be given below.
随着大数据技术的飞速发展,人类储存的数据早已不局限于文本与图片,视频也已成为信息传输中的主要媒介。字幕能够帮助不同人们更好的理解视频内容,也能加速不同语言视频之间的分享。但是在一些节目上,仅仅依靠文字内容难以判断具体的说话者,用户在视频内容的理解上会有一些困难。本申请实施例通过在语音合成的基础上添加d-vector声纹识别模型,在字幕合成的过程中,能够对说话者的声纹进行识别,为每一段说话者的字幕中加入说话者的信息,观众在看到字幕时,能够知道说话者的身份。With the rapid development of big data technology, the data stored by humans is no longer limited to text and pictures, and video has also become the main medium in information transmission. Subtitles can help different people better understand the content of the video, and can also speed up the sharing of videos in different languages. However, in some programs, it is difficult to judge the specific speaker only by the text content, and users will have some difficulties in understanding the video content. In the embodiment of this application, a d-vector voiceprint recognition model is added on the basis of speech synthesis. During the process of subtitle synthesis, the speaker’s voiceprint can be recognized, and the speaker’s information is added to each segment of the speaker’s subtitles. , The audience can know the identity of the speaker when they see the subtitles.
在一实施例中,一种字幕合成的方法,应用于电子设备,方法包括:In an embodiment, a method for subtitle synthesis is applied to an electronic device, and the method includes:
获取视频当中的语音信息,根据所述语音信息的特征得到待识别语音;Acquire voice information in the video, and obtain the voice to be recognized according to the characteristics of the voice information;
将所述待识别语音输入至d-vector声纹识别模型,以得到所述待识别语音所对应的声纹标识,所述声纹标识包含d-vector特征;Inputting the voice to be recognized into a d-vector voiceprint recognition model to obtain a voiceprint identifier corresponding to the voice to be recognized, the voiceprint identifier including d-vector features;
对所述待识别语音进行语音识别以得到对应的文本信息;Performing voice recognition on the to-be-recognized voice to obtain corresponding text information;
将所述声纹标识和文本信息进行合成,以生成所述待识别语音的字幕。The voiceprint identification and text information are synthesized to generate subtitles of the voice to be recognized.
在一实施例中,在所述获取视频当中的语音信息,根据所述语音信息的特征得到待识别语音之前,还包括:In an embodiment, before acquiring the voice information in the video, before obtaining the voice to be recognized according to the characteristics of the voice information, the method further includes:
获取视频中的音轨信息;Obtain the audio track information in the video;
删除无需字幕合成的音轨,提取需要字幕合成的音轨中的语音信息。Delete the audio tracks that do not require subtitle synthesis, and extract the voice information in the audio tracks that require subtitle synthesis.
在一实施例中,获取视频当中的语音信息,根据所述语音信息的特征得到待识别语音,包括:In an embodiment, acquiring voice information in a video, and obtaining the voice to be recognized according to the characteristics of the voice information, includes:
判断所述语音信息是否符合预设语音信息条件;Determine whether the voice information meets preset voice information conditions;
若是,则确认所述语音信息对应目标对象,提取所述目标对象的待识别语音。If so, confirm that the voice information corresponds to the target object, and extract the voice to be recognized of the target object.
在一实施例中,在将所述待识别语音输入至d-vector声纹识别模型,以得到所述待识别语音所对应的声纹标识之前,还包括:In an embodiment, before inputting the voice to be recognized into the d-vector voiceprint recognition model to obtain the voiceprint identifier corresponding to the voice to be recognized, the method further includes:
提取所述视频中的语音的梅尔频率倒谱系数;Extracting the Mel frequency cepstrum coefficient of the speech in the video;
将所述梅尔频率倒谱系数输入至所述d-vector声纹识别模型;Inputting the Mel frequency cepstrum coefficients into the d-vector voiceprint recognition model;
将热编码形式的标签作为d-vector声纹识别模型的训练参照目标,训练完成所述d-vector声纹识别模型。The hot-coded label is used as the training reference target of the d-vector voiceprint recognition model, and the d-vector voiceprint recognition model is trained.
在一实施例中,将所述待识别语音输入至d-vector声纹识别模型,以得到所述待识别语音所对应的声纹标识,所述声纹标识包含d-vector特征,还包括:In one embodiment, the voice to be recognized is input into a d-vector voiceprint recognition model to obtain a voiceprint identifier corresponding to the voice to be recognized, and the voiceprint identifier includes d-vector features and further includes:
将所述目标对象的语音的梅尔频率倒谱系数输入至d-vector声纹识别模型;Input the Mel frequency cepstrum coefficient of the target object's voice into the d-vector voiceprint recognition model;
通过d-vector声纹识别模型获取所述目标对象的语音的d-vector特征;Acquiring the d-vector feature of the target object's voice through a d-vector voiceprint recognition model;
将目标对象的语音的d-vector特征求平均值,获取目标对象的语音的标准d-vector特征;Average the d-vector features of the voice of the target object to obtain the standard d-vector features of the voice of the target object;
根据所述标准d-vector特征建立声纹数据库。A voiceprint database is established according to the standard d-vector features.
在一实施例中,在通过d-vector声纹识别模型获取所述目标对象的语音的d-vector特征之后,还包括:In an embodiment, after acquiring the d-vector feature of the target object's voice through the d-vector voiceprint recognition model, the method further includes:
使用WCCN方法对所述目标对象的语音的d-vector特征进行信道补偿。The WCCN method is used to perform channel compensation on the d-vector feature of the target object's voice.
在一实施例中,在得到所述语音所对应的声纹标识之前,还包括:In an embodiment, before obtaining the voiceprint identifier corresponding to the voice, the method further includes:
将所述待标识语音输入至d-vector声纹识别模型以得到所述待标识语音的d-vector特征;Input the voice to be identified into a d-vector voiceprint recognition model to obtain the d-vector feature of the voice to be identified;
计算所述待标识语音的d-vector特征与所述标准d-vector特征的余弦距离;Calculating the cosine distance between the d-vector feature of the voice to be identified and the standard d-vector feature;
判断所述余弦距离是否小于阈值;Judging whether the cosine distance is less than a threshold;
若是,则所述待标识语音的d-vector特征为声纹数据库中的d-vector特征。If yes, the d-vector feature of the voice to be identified is the d-vector feature in the voiceprint database.
请参阅图1,图1是本申请实施例提供的视频字幕合成方法的第一流程示意图。该方法适用于电脑、手机、平板电脑等电子设备。该视频字幕合成方法可以包括:Please refer to FIG. 1. FIG. 1 is a schematic diagram of a first flow of a method for synthesizing video captions according to an embodiment of the present application. This method is suitable for electronic devices such as computers, mobile phones, and tablets. The video caption synthesis method may include:
在步骤101中,获取视频当中的语音信息,根据语音信息的特征得到待识别语音。In step 101, the voice information in the video is obtained, and the voice to be recognized is obtained according to the characteristics of the voice information.
可以理解的是,在视频中包含声音信息和画面信息,在声音信息中包含着背景音乐和视频中人物对话的语音,获取视频中人物对话的语音信息,语音信息可以包括语音中说话的字数的多少,说话次数的频率,以及语音所对应的性别、年龄等信息。It is understandable that the sound information and picture information are included in the video, and the background music and the voice of the character dialogue in the video are included in the sound information. The voice information of the character dialogue in the video is obtained. The voice information can include the number of words spoken in the voice. How many, the frequency of speaking, and the gender and age corresponding to the voice.
在需要字幕合成的视频中,获取视频中的语音信息,语音信息中包含视频中目标人物的语音信息,通过筛选,将目标人物的语音提取出来,从而获得待识别语音。In a video that requires subtitle synthesis, the voice information in the video is obtained, and the voice information contains the voice information of the target person in the video, and the voice of the target person is extracted through filtering, so as to obtain the voice to be recognized.
在步骤102中,将待识别语音输入至d-vector声纹识别模型,以得到待识别语音所对应的声纹标识。In step 102, the voice to be recognized is input into the d-vector voiceprint recognition model to obtain the voiceprint identifier corresponding to the voice to be recognized.
可以理解的是,声纹(Voiceprint),是用电声学仪器显示的携带言语信息的声波频谱。声纹不仅具有特定性,而且有相对稳定性的特点,任何人的声纹都是独一无二的,不会存在两个说话人声纹相同的情况。It is understandable that voiceprint (Voiceprint) is a sound wave spectrum that carries speech information displayed by an electroacoustic instrument. Voiceprint is not only specific, but also relatively stable. Anyone's voiceprint is unique, and there will be no two speakers with the same voiceprint.
在视频字幕合成的过程中,除了需要语音识别模型对说话者讲话的内容进行识别,得到语音的文本信息,还可以通过声纹识别模型识别出说话者的声纹,通过说话者的声纹对应说话者的身份信息。In the process of video caption synthesis, in addition to the need for a voice recognition model to recognize the content of the speaker’s speech and obtain the text information of the voice, the speaker’s voiceprint can also be identified through the voiceprint recognition model, and the speaker’s voiceprint corresponds to The identity of the speaker.
在本申请实施例中,通过将待识别语音输入至d-vector声纹识别模型,最后通过待识别语音的d-vector特征与数据库中的标准d-vector特征进行对比,如果待识别语音的d-vector特征与数据库中的一条标准d-vector特征相同,则待识别语音的说话人身份信息就确定了,让后根据待识别语音的d-vector特征生成一个声纹标识,用于后续步骤中的视频字幕合成。In the embodiment of this application, the voice to be recognized is input into the d-vector voiceprint recognition model, and finally the d-vector feature of the voice to be recognized is compared with the standard d-vector feature in the database. -The vector feature is the same as a standard d-vector feature in the database, then the speaker identity information of the voice to be recognized is determined, and then a voiceprint identifier is generated according to the d-vector feature of the voice to be recognized, which is used in the subsequent steps Video subtitle synthesis.
在步骤103中,对待识别语音进行语音识别以得到对应的文本信息。In step 103, speech recognition is performed on the speech to be recognized to obtain corresponding text information.
可以理解的是,在本申请中所指的语音识别是对说话者的语音进行识别,然后获取到说话者语音的文本信息,语音识别功能能够识别出语音的文本信息,但是不能获取语音的声纹信息。It is understandable that the voice recognition referred to in this application is to recognize the speaker’s voice, and then obtain the text information of the speaker’s voice. The voice recognition function can recognize the text information of the voice, but cannot obtain the voice of the voice.纹信息。 Pattern information.
将待识别语音输入至语音识别模型,获取到待识别语音的文本信息,此时,文本信息所对应的只有每段语音,并不能确定是哪个说话人的语音。Input the to-be-recognized voice into the voice recognition model to obtain the text information of the to-be-recognized voice. At this time, the text information corresponds to only each segment of the voice, and it is impossible to determine which speaker's voice is.
在步骤104中,将声纹标识和文本信息进行合成,以生成待识别语音的字幕。In step 104, the voiceprint identification and text information are synthesized to generate subtitles of the voice to be recognized.
可以理解的是,在获取到待识别语音的声纹信息和待识别语音的文本信息之后,将待识别语音的文本和待识别语音的声纹标识合成,生成一个含有说话人身份信息的字幕,观众在观看视频的时候,能够根据说话人的身份信息结合说话人的讲话内容更好的理解视频内容。It is understandable that after obtaining the voiceprint information of the voice to be recognized and the text information of the voice to be recognized, the text of the voice to be recognized and the voiceprint identifier of the voice to be recognized are synthesized to generate a subtitle containing the identity information of the speaker. When watching the video, the audience can better understand the video content based on the speaker's identity information and the speaker's speech content.
综上所述,本申请实施例通过获取视频中的语音信息,然后根据语音信息得到待识别语音,将待识别语音输入至d-vector声纹识别模型,以得到待识别语音的声纹标识,再将待识别语音进行语音识别,以得到待识别语音的文本信息,最后将待识别语音的声纹标识和文本信息进行合成,生成带有说话人身份信息的字幕。本申请实施例能在说话人较多的视频中加入说话人的身份信息,帮助观众理解视频中的内容。In summary, the embodiment of the application obtains the voice information in the video, and then obtains the voice to be recognized according to the voice information, and inputs the voice to be recognized into the d-vector voiceprint recognition model to obtain the voiceprint identification of the voice to be recognized. Then perform voice recognition on the voice to be recognized to obtain text information of the voice to be recognized, and finally synthesize the voiceprint identification and text information of the voice to be recognized to generate subtitles with speaker identity information. The embodiment of the application can add speaker's identity information to a video with a large number of speakers to help the audience understand the content in the video.
请参阅图2,图2是本申请实施例提供的视频字幕合成方法的第二流程示意图。该方法适用于电脑、手机、平板电脑等电子设备。该视频字幕合成方法可以包括:Please refer to FIG. 2, which is a schematic diagram of a second process of a method for synthesizing video captions according to an embodiment of the present application. This method is suitable for electronic devices such as computers, mobile phones, and tablets. The video caption synthesis method may include:
在步骤201中,获取视频的音轨信息,提取需要字幕合成的音轨中的语音信息。In step 201, the audio track information of the video is obtained, and the voice information in the audio track that needs subtitle synthesis is extracted.
可以理解的是,在影音文件内含有多个音轨,每个音轨对应的内容也是不同的,比如,在我们日常生活中听到的歌曲,歌曲的录制是采用多音轨的技术录制成的,歌手在演唱的时候会单独录制出一条音轨,乐队在演奏的时候会单独录制出一条音轨,另外还有其他背景音,也可以单独录制一条音轨。所以,在一个视频文件内,所包含的音轨也是多条的。It is understandable that there are multiple audio tracks in the audio and video files, and the content of each audio track is also different. For example, the songs we hear in our daily lives are recorded by using multi-track technology. Yes, the singer will record a separate track when singing, and the band will record a separate track when playing, and there are other background sounds, and you can also record a separate track. Therefore, in a video file, there are also multiple audio tracks.
例如,在获取的视频中,获取视频中的音轨信息,音轨A代表背景音乐的声轨,音轨B代表人声的音轨,音轨C代表动物的叫声的音轨,在这3个音轨里,真正需要字幕标注的音轨只有音轨B,其余的音轨无需字幕标注,此时可以提取需要合成字幕的音轨B。For example, in the obtained video, the audio track information in the video is obtained. Audio track A represents the sound track of background music, audio track B represents the sound track of human voice, and audio track C represents the sound track of animal sounds. Among the three audio tracks, the only audio track that really needs subtitle annotation is audio track B, and the remaining audio tracks do not need subtitle annotation. In this case, you can extract audio track B that requires subtitles.
需要说明的是,本实施例针对多音轨的视频可以采取删除音轨的方式来减少视频字幕合成的工作量,提高视频字幕合成的效率,在单音轨或者其他复杂的音轨中,既包含说话人语音,也包含其他杂音,此时无需对音轨进行识别删除,可以通过后续步骤对说话人语音进行识别,最终合成带有说话人信息的字幕。It should be noted that this embodiment can delete audio tracks for videos with multiple audio tracks to reduce the workload of video subtitle synthesis and improve the efficiency of video subtitle synthesis. In a single audio track or other complex audio tracks, both Contains the speaker’s voice as well as other noises. At this time, there is no need to identify and delete the audio track. The speaker’s voice can be recognized through subsequent steps, and the subtitles with speaker information are finally synthesized.
在步骤202中,判断语音信息是否符合预设语音信息条件。In step 202, it is determined whether the voice information meets a preset voice information condition.
在获取到音轨中含有的语音信息之后,需要对语音信息进行一个判断,进一步筛选出需要字幕合成的语音信息。After the voice information contained in the audio track is obtained, a judgment on the voice information needs to be made, and the voice information that needs subtitle synthesis is further screened out.
例如,可以通过语音信息中说话人说话的频率进行判断,此时预设的语音信息条件可以为说话人说话的频率,另外还可以通过语音信息中说话人的性别来进行判断,只针对男性或者女性的声音进行字幕合成,还可以通过年龄来判断,小孩的声音和成年人的声音有所不同,可以针对其中一方的声音进行字幕合成,另外,还可以通过音色、语速等信息来作为判断依据,预设的语音信息条件可以根据实际需要进行调整。For example, it can be judged by the speaker's speaking frequency in the voice information. At this time, the preset voice information condition can be the speaker's speaking frequency, and it can also be judged by the speaker's gender in the voice information, which is only for men or Subtitle synthesis for female voices can also be judged by age. Children’s voices are different from adults’ voices. Subtitle synthesis can be performed on one of the voices. In addition, it can also be judged by information such as timbre and speech speed. According to the basis, the preset voice information conditions can be adjusted according to actual needs.
如果音轨中的语音信息符合预设的语音条件,则进入步骤203中,如果音轨中的语音信息不符合预设的语音条件,则进入步骤204中。If the voice information in the audio track meets the preset voice condition, then step 203 is entered. If the voice information in the audio track does not meet the preset voice condition, then step 204 is entered.
在步骤203中,语音信息对应目标对象,提取目标对象的待识别语音。In step 203, the voice information corresponds to the target object, and the to-be-recognized voice of the target object is extracted.
在步骤202中判断出语音信息符合预设的语音信息条件,则说明语音信息是需要用来进行视频字幕合成的信息,此时语音信息对应着目标对象,目标对象是需要进行字幕合成的说话人,然后提取目标对象的待识别语音,进行下一步骤。In step 202, it is determined that the voice information meets the preset voice information conditions, which means that the voice information is information that needs to be used for video caption synthesis. At this time, the voice information corresponds to the target object, and the target object is the speaker who needs subtitle synthesis , And then extract the to-be-recognized voice of the target object, and proceed to the next step.
需要说明的是,目标对象可以包含多个对象,语音信息中可以包含多个声音主体,声音主体和目标对象是对应的,在需要字幕合成的视频中,提取目标对象的待识别语音。It should be noted that the target object may include multiple objects, and the voice information may include multiple voice subjects. The voice subjects and the target object are corresponding. In the video that requires subtitle synthesis, extract the voice to be recognized of the target object.
在步骤204中,无需对语音进行字幕合成。In step 204, there is no need to perform subtitle synthesis on the speech.
如果判断出的语音信息不符合预设的语音信息条件,说明判断出的语音信息不是需要处理的语音信息,则是无需对语音进行字幕合成。If the determined voice information does not meet the preset voice information conditions, it means that the determined voice information is not voice information that needs to be processed, and there is no need to perform subtitle synthesis on the voice.
在步骤205中,将待识别语音输入d-vector声纹识别模型,得到待识别语音的d-vector特征。In step 205, the voice to be recognized is input into the d-vector voiceprint recognition model to obtain the d-vector feature of the voice to be recognized.
在将待识别语音输入至d-vector声纹识别模型时,此时所使用的d-vector声纹识别模型是已经训练完成的。d-vector声纹识别模型采用了DNN的神经网络结构,利用DNN神经网络的不同隐层,能够有效提取从低等级特征到高等级特征的优点,提取语音中的声纹信息。When inputting the voice to be recognized into the d-vector voiceprint recognition model, the d-vector voiceprint recognition model used at this time has been trained. The d-vector voiceprint recognition model adopts the neural network structure of DNN. Using different hidden layers of the DNN neural network, it can effectively extract the advantages from low-level features to high-level features, and extract voiceprint information in speech.
请参阅图3,图3是本申请实施例提供d-vector声纹识别模型的DNN神经网络示意图。Please refer to FIG. 3, which is a schematic diagram of a DNN neural network of a d-vector voiceprint recognition model provided in an embodiment of the present application.
在图中所示的d-vector声纹模型为4层的DNN神经网络模型,图中所示模型只是诸多情况的一种,不对本申请做出限制。本申请实施例中的DNN神经网络模型包含了输入层、4层隐藏层以及输出层。The d-vector voiceprint model shown in the figure is a 4-layer DNN neural network model. The model shown in the figure is only one of many situations and does not limit the application. The DNN neural network model in the embodiment of the application includes an input layer, four hidden layers, and an output layer.
在训练d-vector声纹识别模型时,需要大量的语音数据输入至DNN神经网络,最终才能训练出能够正常使用的d-vector声纹识别模型,具体训练的步骤请参阅图4,图4是本申请实施例提供的d-vector声纹识别模型的训练流程图。其中d-vector声纹识别模型训练步骤具体包括:When training the d-vector voiceprint recognition model, a large amount of speech data is required to be input to the DNN neural network, and finally a d-vector voiceprint recognition model that can be used normally can be trained. Please refer to Figure 4 for the specific training steps. The training flowchart of the d-vector voiceprint recognition model provided by the embodiment of the application. The training steps of the d-vector voiceprint recognition model specifically include:
在步骤301中,提取所述视频中的语音的梅尔频率倒谱系数。In step 301, the Mel frequency cepstrum coefficients of the speech in the video are extracted.
在声音处理领域中,梅尔频率倒谱(Mel-Frequency Cepstrum)是基于声音频率的非线性梅尔刻度(mel scale)的对数能量频谱的线性变换。梅尔频率倒谱系数(Mel-Frequency Cepstral Coefficients)就是组成梅尔频率倒谱的系数。In the field of sound processing, Mel-Frequency Cepstrum (Mel-Frequency Cepstrum) is a linear transformation of the logarithmic energy spectrum based on the non-linear mel scale of sound frequency. Mel-Frequency Cepstral Coefficients are the coefficients that make up the Mel-Frequency Cepstral Coefficients.
本申请实施例中通过提取视频中的语音的梅尔频率倒谱系数,将其作为输入值至输入至d-vector声纹识别模型的训练模型中,其中,目标对象的语音可以是视频中的语音,也可以是其他来源的目标对象的语音,用大量的语音数据作为训练d-vector声纹识别模型的输入数据,在提取语音数据的梅尔频率倒谱系数之后,从而进行下一步骤。In the embodiment of the application, the Mel frequency cepstrum coefficient of the voice in the video is extracted and used as the input value into the training model of the d-vector voiceprint recognition model. The voice of the target object can be the voice in the video. The voice can also be the voice of the target object from other sources. A large amount of voice data is used as the input data for training the d-vector voiceprint recognition model. After extracting the Mel frequency cepstrum coefficient of the voice data, proceed to the next step.
需要说明的是本实施例中的需要训练的d-vector声纹识别模型就是还未训练的DNN神经网络模型。It should be noted that the d-vector voiceprint recognition model that needs to be trained in this embodiment is the untrained DNN neural network model.
在步骤302中,将梅尔倒谱系数输入至d-vector声纹识别模型。In step 302, the Mel cepstrum coefficients are input to the d-vector voiceprint recognition model.
可以理解的是,此时的d-vector声纹识别模型是未完成训练的模型,在本步骤中只作为训练模型,将梅尔频率倒谱系数输入至d-vector声纹识别模型进行训练。It is understandable that the d-vector voiceprint recognition model at this time is a model that has not been trained. In this step, it is only used as a training model, and the Mel frequency cepstral coefficient is input to the d-vector voiceprint recognition model for training.
在步骤303中,将热编码形式的标签作为d-vector声纹识别模型的训练模型。In step 303, the hot-encoded label is used as the training model of the d-vector voiceprint recognition model.
在本申请实施例中是通过将热编码(one-hot编码)形式的标签作为训练的参照目标来训练d-vector声纹识别模型,采用热编码可以解决分类器不好处理离散数据的问题,在一定程度上也起到了扩充特征的作用,当然也可以采用其他形式的标签作为模型训练的参照目标,在此不作限定。In the embodiment of this application, the d-vector voiceprint recognition model is trained by using labels in the form of one-hot encoding as the reference target for training. Using hot encoding can solve the problem that the classifier cannot handle discrete data. To a certain extent, it also plays a role in expanding the feature. Of course, other forms of tags can also be used as the reference target for model training, which is not limited here.
在步骤304中,用梯度下降法训练d-vector声纹识别模型,直至模型训练完成。In step 304, the d-vector voiceprint recognition model is trained by the gradient descent method until the model training is completed.
梯度下降法(Gradient descent)是一个一阶最优化算法,通常也称为最速下降法,在训练模型的过程中可以采用梯度下降训练法对模型进行训练,当然也可以采用其他方法对DNN神经网络模型进行训练,即对d-vector声纹识别模型进行训练,直至模型训练收敛为止。Gradient descent is a first-order optimization algorithm, usually called the steepest descent method. In the process of training the model, the gradient descent training method can be used to train the model. Of course, other methods can also be used to train the DNN neural network. The model is trained, that is, the d-vector voiceprint recognition model is trained until the model training converges.
在d-vector声纹识别模型训练完成后还需要根据训练好的模型建立声纹数据库,用于后续步骤中对待识别语音的验证。需要说明的是,After the training of the d-vector voiceprint recognition model is completed, a voiceprint database needs to be established according to the trained model, which is used for the verification of the voice to be recognized in the subsequent steps. It should be noted,
请参阅图5,图5是建立声纹数据库的流程图。其中建立声纹数据库的具体流程包括:Please refer to Figure 5. Figure 5 is a flowchart of establishing a voiceprint database. The specific process of establishing a voiceprint database includes:
在步骤401中,将目标对象的语音的梅尔频率倒谱系数输入至d-vector声纹识别模型。In step 401, the Mel frequency cepstrum coefficient of the voice of the target object is input into the d-vector voiceprint recognition model.
可以理解的是,建立声纹数据库的目的是在后续的验证步骤中,使用声纹数据库中的数据作为参照目标,在建立数据库时,必须要含有目标对象的语音,在建立完成的数据库中才会含有目标对象的d-vector特征。It is understandable that the purpose of establishing the voiceprint database is to use the data in the voiceprint database as the reference target in the subsequent verification steps. When the database is established, the voice of the target object must be included. Will contain the d-vector characteristics of the target object.
将目标对象的语音的梅尔频率倒谱系数作为输入值输入至d-vector声纹识别模型,此时,d-vector声纹识别模型是训练完成的模型,模型以图3中所示的d-vector声纹模型图为例,去掉输出层之后,最后一层的输出则为所需要的d-vector特征,也就是图3中所示的隐藏层4。The Mel frequency cepstrum coefficient of the target object’s voice is input as the input value to the d-vector voiceprint recognition model. At this time, the d-vector voiceprint recognition model is a trained model. Take the vector voiceprint model diagram as an example. After removing the output layer, the output of the last layer is the required d-vector feature, which is the hidden layer 4 shown in Figure 3.
在步骤402中,通过d-vector声纹识别模型获取目标对象语音的d-vector特征。In step 402, the d-vector feature of the target object's voice is obtained through the d-vector voiceprint recognition model.
输入目标对象的语音的梅尔频率倒谱系数,在得到目标对象语音的d-vector特征时,为了减少信道差异对获取的d-vector特征造成的影响,本实施例中采用了WCCN方法进行信道补偿(channel compensation),WCCN方法通过缩放子空间,以衰减高类内方差的维数,进而可以作为一种信道补偿技术。在经过WCCN方法信道补偿之后,得到补偿之后的d-vector特征,即d-vector V WCCNEnter the Mel frequency cepstrum coefficient of the target object’s voice. When the d-vector feature of the target object’s voice is obtained, in order to reduce the influence of channel differences on the acquired d-vector features, the WCCN method is used in this embodiment to perform channel Channel compensation. The WCCN method scales the subspace to attenuate the dimensionality of the high-class intra-variance, which can then be used as a channel compensation technology. After the channel compensation of the WCCN method, the compensated d-vector feature, that is, d-vector V WCCN is obtained .
首先,需要计算类内方差矩阵W:First, we need to calculate the within-class variance matrix W:
Figure PCTCN2019073770-appb-000001
Figure PCTCN2019073770-appb-000001
其中,S代表目标对象,
Figure PCTCN2019073770-appb-000002
是目标对象的语音的梅尔频率倒谱系数输入至d-vector声纹识别模型得到的d-vector特征,接下来使用Cholesky分解计算WCCN矩阵B 1,其中,Cholesky分解是把一个对称正定的矩阵表示成一个下三角矩阵L和其转置的乘积的分解。公式如下:
Among them, S represents the target object,
Figure PCTCN2019073770-appb-000002
The Mel frequency cepstral coefficients of the target speech are input to the d-vector feature obtained by the d-vector voiceprint recognition model, and then the Cholesky decomposition is used to calculate the WCCN matrix B 1 , where the Cholesky decomposition is a symmetric positive definite matrix Expressed as a decomposition of the product of a lower triangular matrix L and its transpose. The formula is as follows:
Figure PCTCN2019073770-appb-000003
Figure PCTCN2019073770-appb-000003
那么经过WCCN信道补偿后的d-vector V WCCN为: Then the d-vector V WCCN after WCCN channel compensation is:
Figure PCTCN2019073770-appb-000004
Figure PCTCN2019073770-appb-000004
需要说明的是,本实施例中例举的WCCN方法来进行信道补偿只是信道补偿方法的一种,具体的,还可以根据LDA、PLDA、NAP等方法进行信道补偿,减少信道差异造成的影响。It should be noted that the WCCN method exemplified in this embodiment to perform channel compensation is only one of the channel compensation methods. Specifically, channel compensation can also be performed according to methods such as LDA, PLDA, NAP, etc., to reduce the impact of channel differences.
在步骤403中,将目标对象的语音的d-vector特征求平均值,获取标准d-vector特征。In step 403, the d-vector features of the voice of the target object are averaged to obtain standard d-vector features.
输入的目标对象的语音可以是多段语音,每一段语音的梅尔频率倒谱系数输入至d-vector声纹识别模型之后,再经过信道补偿,都会生成每段语音对应的d-vector特征,根据每段语音得到的d-vector特征,将目标对象所有的d-vector特征求平均值,则得到一个标准的d-vector特征,由于采用了大量的数据生成该d-vector特征,则可以作为数据库中的采用值。The input voice of the target object can be multiple segments of speech. After the Mel frequency cepstral coefficient of each segment of speech is input into the d-vector voiceprint recognition model, and then channel compensation is performed, the corresponding d-vector feature of each segment of speech will be generated. The d-vector feature obtained for each segment of speech, and all the d-vector features of the target object are averaged to obtain a standard d-vector feature. Since a large amount of data is used to generate the d-vector feature, it can be used as a database The adopted value in.
在步骤404中,根据标准d-vector特征建立声纹数据库。In step 404, a voiceprint database is established according to standard d-vector features.
可以理解的是,每个目标对象的语音在通过d-vector声纹识别模型之后生成每个目标对象所对应的标准的d-vector特征,这些d-vector特征组合成一个数据库,用于后续步骤中对待识别语音的验证。It is understandable that after the voice of each target object passes through the d-vector voiceprint recognition model, the standard d-vector features corresponding to each target object are generated, and these d-vector features are combined into a database for subsequent steps Verification of the voice to be recognized.
请继续参阅图2,在步骤206中,计算待识别语音的d-vector特征与标准d-vector特征的余弦距离。Please continue to refer to FIG. 2. In step 206, the cosine distance between the d-vector feature of the voice to be recognized and the standard d-vector feature is calculated.
待识别语音在输入至d-vector声纹识别模型后,生成待识别语音的d-vector特征,本实施例中,通过计算待识别语音的d-vector特征与标准d-vector特征之间的余弦距离,用于验证待识别语音的d-vector特征是否为数据库中目标对象的d-vector特征。After the voice to be recognized is input into the d-vector voiceprint recognition model, the d-vector feature of the voice to be recognized is generated. In this embodiment, the cosine between the d-vector feature of the voice to be recognized and the standard d-vector feature is calculated Distance is used to verify whether the d-vector feature of the voice to be recognized is the d-vector feature of the target object in the database.
需要说明的是,在实施例中,还可以采用其他方式进行对待识别语音的验证,例如,采用感知加权线性预测(Perceptual Linear Predictive,PLP)特征,以及高斯混合模型(Gaussian Mixture Model,GMM)进行声纹认证。It should be noted that, in the embodiment, other methods may be used to verify the speech to be recognized, for example, using Perceptual Linear Predictive (PLP) features and Gaussian Mixture Model (GMM). Voiceprint authentication.
在步骤207中,判断余弦距离是否小于阈值。In step 207, it is determined whether the cosine distance is less than a threshold.
在本实施例中,是通过设置余弦距离的阈值为标准来判断待标识语音的d-vector特征是否与标准d-vector特征匹配,例如,设定余弦距离的阈值为0.5,在判断待识别语音的d-vector特征与标准d-vector特征的余弦距离为0.2的情况下,则说明待识别语音的d-vector特征与标准d-vector特征是匹配的。In this embodiment, the threshold value of the cosine distance is set as a standard to determine whether the d-vector feature of the voice to be identified matches the standard d-vector feature. For example, the threshold value of the cosine distance is set to 0.5. When the cosine distance between the d-vector feature and the standard d-vector feature is 0.2, it means that the d-vector feature of the voice to be recognized matches the standard d-vector feature.
在步骤208中,待识别语音的d-vector特征值为声纹数据库中的d-vector特征。In step 208, the d-vector feature value of the voice to be recognized is the d-vector feature in the voiceprint database.
根据步骤207判断的结果,在待识别语音的d-vector特征与标准d-vector特征余弦距离小于阈值的情况下,就判定待识别语音的d-vector特征与数据库中匹配,从而进行下一步骤。According to the result of step 207, when the cosine distance between the d-vector feature of the voice to be recognized and the standard d-vector feature is less than the threshold, it is determined that the d-vector feature of the voice to be recognized matches the database, and the next step is performed .
在步骤209中,待识别语音的d-vector特征值不是声纹数据库中的d-vector特征。In step 209, the d-vector feature value of the voice to be recognized is not the d-vector feature in the voiceprint database.
根据步骤207判断的结果,在待识别语音的d-vector特征与标准d-vector特征余弦距离不小于阈值的情况下,就判定待识别语音的d-vector特征与数据库中不匹配,无需对待识别语音进行视频字幕合成。According to the result of step 207, if the cosine distance between the d-vector feature of the voice to be recognized and the standard d-vector feature is not less than the threshold, it is determined that the d-vector feature of the voice to be recognized does not match the database, and there is no need to recognize Voice synthesis of video captions.
在步骤210中,根据d-vector特征得到待识别语音的声纹标识,将声纹标识与文本信息合成为字幕。In step 210, the voiceprint identifier of the voice to be recognized is obtained according to the d-vector feature, and the voiceprint identifier and the text information are synthesized into subtitles.
经过对待识别语音的判断之后,得到待识别语音的声纹标识,即数据中与其对应的标准d-vector特征生成的标识,然后将待识别语音的声纹标识与其文本信息合成,生成带有说话人身份信息的视频字幕。观众在观看视频时,能够准确的知道哪句话是哪个人说的,从而帮助理解视频中的内容。After the judgment of the voice to be recognized, the voiceprint identifier of the voice to be recognized is obtained, that is, the identifier generated by the corresponding standard d-vector feature in the data, and then the voiceprint identifier of the voice to be recognized and its text information are synthesized to generate a voice Video subtitles with personal identification information. When watching the video, the audience can know exactly which sentence was said by which person, so as to help understand the content in the video.
综上所述,本申请实施例通过判断视频中的音轨信息,删除无用的音轨,在需要字幕合成的音轨中提取语音信息,然后再判断语音信息是否满足预设的语音信息条件,在满足预设的语音信息条件的情况下,则确认所述语音信息对应目标对象,提取所述目标对象的待识别语音,在待识别语音输入至训练好的d-vector声纹识别模型之前,还需要对d-vector声纹识别模型进行训练,在得到训练完成的d-vector声纹识别模型之后,通过输入大量的目标对象的语音,从而得到目标对象标准的d-vector特征,根据标准的d-vector特征生成一个数据库,再将待识别语音输入至训练好的d-vector声纹识别模型,得到待识别语音的d-vector特征,计算待识别语音的d-vector特征与标准d-vector特征的余弦距离,如果二者余弦距离小于阈值,则判断二者是匹配的,待识别语音的说话人和数据库中标准d-vector特征的说话人是同一个人,最后获取待识别语音的文本信息,将待识别语音的文本信息和根据d-vector特征生成的声纹标识合成,生成带有说话人信 息的视频字幕。本申请实施例中,通过对音轨的筛选和对语音信息的筛选,能够提高语音识别的效率,在获取d-vector特征时,使用WCCN方法进行信道补偿,减小了信道差异造成的影响,最后实现了带有说话人信息的视频字幕合成,能够帮助观众理解视频中的内容。In summary, the embodiment of the present application determines the audio track information in the video, deletes useless audio tracks, extracts the audio information from the audio tracks that require subtitle synthesis, and then determines whether the audio information meets the preset audio information conditions. When the preset voice information conditions are met, it is confirmed that the voice information corresponds to the target object, and the voice to be recognized of the target object is extracted. Before the voice to be recognized is input into the trained d-vector voiceprint recognition model, It is also necessary to train the d-vector voiceprint recognition model. After the trained d-vector voiceprint recognition model is obtained, a large number of target object’s voices are input to obtain the standard d-vector features of the target object. According to the standard d-vector features generate a database, and then input the voice to be recognized into the trained d-vector voiceprint recognition model to obtain the d-vector feature of the voice to be recognized, and calculate the d-vector feature of the voice to be recognized and the standard d-vector The cosine distance of the feature. If the cosine distance of the two is less than the threshold, it is judged that the two are matched. The speaker of the voice to be recognized and the speaker of the standard d-vector feature in the database are the same person, and finally the text information of the voice to be recognized is obtained , The text information of the voice to be recognized and the voiceprint identification generated according to the d-vector feature are synthesized to generate video subtitles with speaker information. In the embodiments of the present application, the efficiency of speech recognition can be improved through the screening of audio tracks and the screening of voice information. When d-vector features are obtained, the WCCN method is used for channel compensation, which reduces the impact of channel differences. Finally, the video caption synthesis with speaker information is realized, which can help the audience understand the content of the video.
在一实施例中,一种字幕合成的装置,包括:In an embodiment, an apparatus for subtitle synthesis includes:
语音获取模块,用于获取视频当中的语音信息,根据所述语音信息的特征得到待识别语音;The voice acquisition module is used to acquire voice information in the video, and obtain the voice to be recognized according to the characteristics of the voice information;
声纹识别模块,用于将所述待识别语音输入至d-vector声纹识别模型,以得到所述待识别语音所对应的声纹标识;A voiceprint recognition module, configured to input the voice to be recognized into a d-vector voiceprint recognition model to obtain a voiceprint identifier corresponding to the voice to be recognized;
语音识别模块,用于对所述待识别语音进行语音识别以得到对应的文本信息;A voice recognition module, configured to perform voice recognition on the voice to be recognized to obtain corresponding text information;
字幕合成模块,用于将所述声纹标识和文本信息进行合成,以生成所述待识别语音的字幕。The subtitle synthesis module is used to synthesize the voiceprint identification and text information to generate the subtitle of the voice to be recognized.
在一实施例中,一种字幕合成的装置,其中语音获取模块,包括:In an embodiment, a device for synthesizing subtitles, wherein the voice acquisition module includes:
提取模块,用于获取视频中的音轨信息,删除无需字幕合成的音轨,提取需要字幕合成的音轨中的语音信息;The extraction module is used to obtain audio track information in the video, delete audio tracks that do not require subtitle synthesis, and extract voice information in audio tracks that require subtitle synthesis;
第一判断模块,用于判断所述语音信息是否符合预设语音信息条件,若是,则确认所述语音信息对应目标对象,提取所述目标对象的待识别语音。The first judging module is configured to judge whether the voice information meets the preset voice information condition, and if so, confirm that the voice information corresponds to the target object, and extract the voice to be recognized of the target object.
在一实施例中,一种字幕合成的装置,包括:In an embodiment, an apparatus for subtitle synthesis includes:
训练模块,用于提取所述视频中的语音的梅尔频率倒谱系数;A training module for extracting the Mel frequency cepstrum coefficient of the voice in the video;
将所述梅尔频率倒谱系数输入至所述d-vector声纹识别模型;Inputting the Mel frequency cepstrum coefficients into the d-vector voiceprint recognition model;
将热编码形式的标签作为d-vector声纹识别模型的训练参照目标,训练完成所述d-vector声纹识别模型。The hot-coded label is used as the training reference target of the d-vector voiceprint recognition model, and the d-vector voiceprint recognition model is trained.
在一实施例中,一种字幕合成的装置,声纹识别模块包括:In an embodiment, in a device for synthesizing subtitles, the voiceprint recognition module includes:
数据库模块,用于将所述目标对象的语音的梅尔频率倒谱系数输入至d-vector声纹识别模型;The database module is used to input the Mel frequency cepstrum coefficient of the target object's voice into the d-vector voiceprint recognition model;
通过d-vector声纹识别模型获取所述目标对象的语音的d-vector特征;Acquiring the d-vector feature of the target object's voice through a d-vector voiceprint recognition model;
将目标对象的语音的d-vector特征求平均值,获取目标对象的语音的标准d-vector特征;Average the d-vector features of the voice of the target object to obtain the standard d-vector features of the voice of the target object;
根据所述标准d-vector特征建立声纹数据库。A voiceprint database is established according to the standard d-vector features.
在一实施例中,一种字幕合成的装置,声纹识别模块包括:In an embodiment, in a device for synthesizing subtitles, the voiceprint recognition module includes:
第二判断模块,用于将所述待标识语音输入至d-vector声纹识别模型以得到所述待标识语音的d-vector特征;The second judgment module is configured to input the voice to be identified into a d-vector voiceprint recognition model to obtain the d-vector feature of the voice to be identified;
计算所述待标识语音的d-vector特征与所述标准d-vector特征的余弦距离;Calculating the cosine distance between the d-vector feature of the voice to be identified and the standard d-vector feature;
判断所述余弦距离是否小于阈值;Judging whether the cosine distance is less than a threshold;
若是,则所述待标识语音的d-vector特征值为声纹数据库中的d-vector特征值。If yes, the d-vector feature value of the voice to be identified is the d-vector feature value in the voiceprint database.
请参阅图6,图6时本申请实施例提供的视频字幕合成装置的第一结构示意图。视频字幕合成装置包括语音获取模块510、训练模块520、声纹识别模块530、语音识别模块540和字幕合成模块550。Please refer to FIG. 6. FIG. 6 is a schematic diagram of a first structure of a video caption synthesis device provided by an embodiment of the present application. The video caption synthesis device includes a speech acquisition module 510, a training module 520, a voiceprint recognition module 530, a speech recognition module 540, and a caption synthesis module 550.
其中,语音获取模块510,用于获取视频当中的语音信息,根据所述语音信息的特征得到待识别语音。The voice acquisition module 510 is configured to acquire voice information in the video, and obtain the voice to be recognized according to the characteristics of the voice information.
具体地,语音获取模块510获取的语音信息,在语音信息中包含着视频中人物对话的语音,获取视频中人物对话的语音信息,语音信息可以包括语音中说话的字数的多少,说话次数的频率,以及语音所对应的性别、年龄等信息。Specifically, the voice information acquired by the voice acquisition module 510 contains the voice of the character conversation in the video, and the voice information of the character conversation in the video is obtained. The voice information may include the number of words spoken in the voice and the frequency of the number of times of speech. , And the gender and age corresponding to the voice.
在需要字幕合成的视频中,获取视频中的语音信息,语音信息中包含视频中目标人物的语音信息,通过筛选,将目标人物的语音提取出来,从而获得待识别语音。In a video that requires subtitle synthesis, the voice information in the video is obtained, and the voice information contains the voice information of the target person in the video, and the voice of the target person is extracted through filtering, so as to obtain the voice to be recognized.
请参阅图7,图7是本申请实施例的的视频字幕合成装置的第二结构示意图,其中,语音获取模块510还包括提取模块511和第一判断模块512。Please refer to FIG. 7. FIG. 7 is a schematic diagram of a second structure of a video caption synthesis device according to an embodiment of the present application. The voice acquisition module 510 further includes an extraction module 511 and a first judgment module 512.
提取模块511用于获取视频中的音轨信息,删除无需字幕合成的音轨,提取需要字幕合成的音轨中的语音信息。The extraction module 511 is used to obtain audio track information in the video, delete audio tracks that do not require subtitle synthesis, and extract voice information in audio tracks that require subtitle synthesis.
在影音文件内含有多个音轨,每个音轨对应的内容也是不同的,可以将视频中背景音乐的音轨删除,只保留含有语音信息的音轨。There are multiple audio tracks in a video file, and the content of each audio track is different. You can delete the background music track in the video, and only keep the audio track containing the voice information.
第一判断模块512,用于判断所述语音信息是否符合预设语音信息条件,若是,则确认所述语音信息对应目标对象,提取所述目标对象的待识别语音。The first determining module 512 is configured to determine whether the voice information meets the preset voice information condition, and if so, confirm that the voice information corresponds to the target object, and extract the voice to be recognized of the target object.
具体地,可以通过语音信息中说话人说话的频率进行判断,此时预设的语音信息条件可以为说话人说话的频率,另外还可以通过语音信息中说话人的性别来进行判断,只针对男性或者女性的声音进行字幕合成,还可以通过年龄来判断,小孩的声音和成年人的声音有所不同,可以针对其中一方的声音进行字幕合成,另外,还可以通过音色、语速等信息来作为判断依据,预设的语音信息条件可以根据实际需要进行调整。Specifically, it can be judged by the speaker's speaking frequency in the voice information. At this time, the preset voice information condition can be the speaker's speaking frequency, and it can also be judged by the speaker's gender in the voice information, which is only for males. Or women’s voices can be subtitles synthesized, which can also be judged by age. Children’s voices are different from adults’ voices. Subtitles can be synthesized for one of the voices. In addition, it can also be based on information such as timbre and speech speed. Judgment basis, the preset voice information conditions can be adjusted according to actual needs.
训练模块520,用于提取所述目标对象的语音的梅尔频率倒谱系数;The training module 520 is used to extract the Mel frequency cepstrum coefficients of the target object's voice;
将所述梅尔频率倒谱系数输入至所述d-vector声纹识别模型;Inputting the Mel frequency cepstrum coefficients into the d-vector voiceprint recognition model;
将热编码形式的标签作为d-vector声纹识别模型的训练参照目标,训练完成所述d-vector声纹识别模型。The hot-coded label is used as the training reference target of the d-vector voiceprint recognition model, and the d-vector voiceprint recognition model is trained.
具体地,训练模块520通过大量的语音对模型进行训练,通过将热编码(one-hot编码)形式的标签作为训练的参照目标来训练d-vector声纹识别模型,采用热编码可以解决分类器不好处理离散数据的问题,在一定程度上也起到了扩充特征的作用。Specifically, the training module 520 trains the model through a large number of voices, and trains the d-vector voiceprint recognition model by using labels in the form of one-hot encoding as the reference target for training. Using hot encoding can solve the classifier It is not good to deal with the problem of discrete data, and to a certain extent it also plays a role in expanding the characteristics.
声纹识别模块530,用于将所述待识别语音输入至d-vector声纹识别模型,以得到所述待识别语音所对应的声纹标识。The voiceprint recognition module 530 is used to input the voice to be recognized into the d-vector voiceprint recognition model to obtain the voiceprint identifier corresponding to the voice to be recognized.
请参阅图7,其中,声纹识别模块530包括数据库模块531和第二判断模块532。Please refer to FIG. 7, where the voiceprint recognition module 530 includes a database module 531 and a second judgment module 532.
数据库模块531,用于将所述目标对象的语音的梅尔频率倒谱系数输入至d-vector声纹识别模型;The database module 531 is configured to input the Mel frequency cepstrum coefficient of the target object's voice into the d-vector voiceprint recognition model;
通过d-vector声纹识别模型获取所述目标对象的语音的d-vector特征;Acquiring the d-vector feature of the target object's voice through a d-vector voiceprint recognition model;
将目标对象的语音的d-vector特征求平均值,获取目标对象的语音的标准d-vector特征;Average the d-vector features of the voice of the target object to obtain the standard d-vector features of the voice of the target object;
根据所述标准d-vector特征建立声纹数据库。A voiceprint database is established according to the standard d-vector features.
具体地,输入的目标对象的语音可以是多段语音,每一段语音的梅尔频率倒谱系数输入至d-vector声纹识别模型之后,再经过信道补偿,都会生成每段语音对应的d-vector 特征,根据每段语音得到的d-vector特征,将目标对象所有的d-vector特征求平均值,则得到一个标准的d-vector特征,由于采用了大量的数据生成该d-vector特征,则可以作为数据库中的采用值。Specifically, the input voice of the target object can be multiple segments of speech. After the Mel frequency cepstral coefficient of each segment of speech is input into the d-vector voiceprint recognition model, and then channel compensation is performed, a d-vector corresponding to each segment of speech will be generated Features. According to the d-vector features obtained from each segment of speech, averaging all the d-vector features of the target object to obtain a standard d-vector feature. Since a large amount of data is used to generate the d-vector feature, then Can be used as the adopted value in the database.
第二判断模块532,用于将所述待标识语音输入至d-vector声纹识别模型以得到所述待标识语音的d-vector特征;The second judgment module 532 is configured to input the voice to be identified into a d-vector voiceprint recognition model to obtain the d-vector feature of the voice to be identified;
计算所述待标识语音的d-vector特征与所述标准d-vector特征的余弦距离;Calculating the cosine distance between the d-vector feature of the voice to be identified and the standard d-vector feature;
判断所述余弦距离是否小于阈值;Judging whether the cosine distance is less than a threshold;
若是,则所述待标识语音的d-vector特征值为声纹数据库中的d-vector特征值。If yes, the d-vector feature value of the voice to be identified is the d-vector feature value in the voiceprint database.
在本实施例中,是通过设置余弦距离的阈值为标准来判断待标识语音的d-vector特征是否与标准d-vector特征匹配,例如,设定余弦距离的阈值为0.5,在判断待识别语音的d-vector特征与标准d-vector特征的余弦距离为0.2的情况下,则说明待识别语音的d-vector特征与标准d-vector特征是匹配的。In this embodiment, the threshold value of the cosine distance is set as a standard to determine whether the d-vector feature of the voice to be identified matches the standard d-vector feature. For example, the threshold value of the cosine distance is set to 0.5. When the cosine distance between the d-vector feature and the standard d-vector feature is 0.2, it means that the d-vector feature of the voice to be recognized matches the standard d-vector feature.
语音识别模块540,用于对所述待识别语音进行语音识别以得到对应的文本信息。The voice recognition module 540 is configured to perform voice recognition on the to-be-recognized voice to obtain corresponding text information.
将待识别语音输入至语音识别模块,通过语音识别模块来获取待识别语音的文本信息。Input the voice to be recognized into the voice recognition module, and obtain the text information of the voice to be recognized through the voice recognition module.
字幕合成模块550,用于将所述声纹标识和文本信息进行合成,以生成所述待识别语音的字幕。The subtitle synthesis module 550 is configured to synthesize the voiceprint identification and text information to generate subtitles of the voice to be recognized.
经过对待识别语音的判断之后,得到待识别语音的声纹标识,即数据中与其对应的标准d-vector特征生成的标识,然后将待识别语音的声纹标识与其文本信息合成,生成带有说话人身份信息的视频字幕。观众在观看视频时,能够准确的知道哪句话是哪个人说的,从而帮助理解视频中的内容。After the judgment of the voice to be recognized, the voiceprint identifier of the voice to be recognized is obtained, that is, the identifier generated by the corresponding standard d-vector feature in the data, and then the voiceprint identifier of the voice to be recognized and its text information are synthesized to generate a voice Video subtitles with personal identification information. When watching the video, the audience can know exactly which sentence was said by which person, so as to help understand the content in the video.
综上所述,本申请实施例通过判断视频中的音轨信息,删除无用的音轨,在需要字幕合成的音轨中提取语音信息,然后再判断语音信息是否满足预设的语音信息条件,在满足预设的语音信息条件的情况下,则确认所述语音信息对应目标对象,提取所述目标对象的待识别语音,在待识别语音输入至训练好的d-vector声纹识别模型之前,还需要对d-vector声纹识别模型进行训练,在得到训练完成的d-vector声纹识别模型之后,通过输入大量的目标对象的语音,从而得到目标对象标准的d-vector特征,根据标准的d-vector特征生成一个数据库,再将待识别语音输入至训练好的d-vector声纹识别模型,得到待识别语音的d-vector特征,计算待识别语音的d-vector特征与标准d-vector特征的余弦距离,如果二者余弦距离小于阈值,则判断二者是匹配的,待识别语音的说话人和数据库中标准d-vector特征的说话人是同一个人,最后获取待识别语音的文本信息,将待识别语音的文本信息和根据d-vector特征生成的声纹标识合成,生成带有说话人信息的视频字幕。本申请实施例中,通过对音轨的筛选和对语音信息的筛选,能够提高语音识别的效率,在获取d-vector特征时,使用WCCN方法进行信道补偿,减小了信道差异造成的影响,最后实现了带有说话人信息的视频字幕合成,能够帮助观众理解视频中的内容。In summary, the embodiment of the present application determines the audio track information in the video, deletes useless audio tracks, extracts the audio information from the audio tracks that require subtitle synthesis, and then determines whether the audio information meets the preset audio information conditions. When the preset voice information conditions are met, it is confirmed that the voice information corresponds to the target object, and the voice to be recognized of the target object is extracted. Before the voice to be recognized is input into the trained d-vector voiceprint recognition model, It is also necessary to train the d-vector voiceprint recognition model. After the trained d-vector voiceprint recognition model is obtained, a large number of target object’s voices are input to obtain the standard d-vector features of the target object. According to the standard d-vector features generate a database, and then input the voice to be recognized into the trained d-vector voiceprint recognition model to obtain the d-vector feature of the voice to be recognized, and calculate the d-vector feature of the voice to be recognized and the standard d-vector The cosine distance of the feature. If the cosine distance of the two is less than the threshold, it is judged that the two are matched. The speaker of the voice to be recognized and the speaker of the standard d-vector feature in the database are the same person, and finally the text information of the voice to be recognized is obtained , The text information of the voice to be recognized and the voiceprint identification generated according to the d-vector feature are synthesized to generate video subtitles with speaker information. In the embodiments of the present application, the efficiency of speech recognition can be improved through the screening of audio tracks and the screening of voice information. When d-vector features are obtained, the WCCN method is used for channel compensation, which reduces the impact of channel differences. Finally, the video caption synthesis with speaker information is realized, which can help the audience understand the content of the video.
本申请实施例中,视频字幕合成装置与上文实施例中的视频字幕合成方法属于同一构思,在视频字幕合成装置上可以运行视频字幕合成方法实施例中提供的任一方法,其 具体实现过程详见视频字幕合成方法的实施例,此处不再赘述。In the embodiments of this application, the video caption synthesis device belongs to the same concept as the video caption synthesis method in the above embodiment. Any method provided in the video subtitle synthesis method embodiment can be run on the video subtitle synthesis device, and its specific implementation process For details, refer to the embodiment of the method for synthesizing video captions, which will not be repeated here.
本文所使用的术语“模块”可看作为在该运算系统上执行的软件对象。本文所述的不同组件、模块、引擎及服务可看作为在该运算系统上的实施对象。而本文所述的装置及方法可以以软件的方式进行实施,当然也可在硬件上进行实施,均在本申请保护范围之内。The term "module" used herein can be regarded as a software object executed on the operating system. The different components, modules, engines, and services described in this article can be regarded as implementation objects on the computing system. The devices and methods described herein can be implemented in the form of software, or of course, can also be implemented on hardware, and they are all within the protection scope of the present application.
为此,本发明实施例提供一种存储介质,其中存储有多条指令,该指令能够被处理器进行加载,以执行本发明实施例所提供的任一种虚拟资源的转移方法中的步骤。例如,该指令可以执行如下步骤:To this end, an embodiment of the present invention provides a storage medium in which a plurality of instructions are stored, and the instructions can be loaded by a processor to execute the steps in any virtual resource transfer method provided in the embodiments of the present invention. For example, the instruction can perform the following steps:
获取视频当中的语音信息,根据所述语音信息的特征得到待识别语音;Acquire voice information in the video, and obtain the voice to be recognized according to the characteristics of the voice information;
将所述待识别语音输入至d-vector声纹识别模型,以得到所述待识别语音所对应的声纹标识,所述声纹标识包含d-vector特征;Inputting the voice to be recognized into a d-vector voiceprint recognition model to obtain a voiceprint identifier corresponding to the voice to be recognized, the voiceprint identifier including d-vector features;
对所述待识别语音进行语音识别以得到对应的文本信息;Performing voice recognition on the to-be-recognized voice to obtain corresponding text information;
将所述声纹标识和文本信息进行合成,以生成所述待识别语音的字幕。The voiceprint identification and text information are synthesized to generate subtitles of the voice to be recognized.
以上各个操作的具体实施可参见前面的实施例,在此不再赘述。For the specific implementation of each of the above operations, refer to the previous embodiment, which will not be repeated here.
其中,该存储介质可以包括:只读存储器(ROM,Read Only Memory)、随机存取记忆体(RAM,Random Access Memory)、磁盘或光盘等。The storage medium may include: Read Only Memory (ROM, Read Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disks or optical disks, etc.
由于该存储介质中所存储的指令,可以执行本发明实施例所提供的任一种虚拟资源的转移方法中的步骤,因此,可以实现本发明实施例所提供的任一种虚拟资源的转移方法所能实现的有益效果,详见前面的实施例,在此不再赘述。Since the instructions stored in the storage medium can execute the steps in any method for transferring virtual resources provided by the embodiments of the present invention, any method for transferring virtual resources provided by the embodiments of the present invention can be implemented. The beneficial effects that can be achieved are detailed in the previous embodiments, and will not be repeated here.
本申请实施例还提供一种电子设备,如平板电脑、手机等电子设备。电子设备中的处理器会按照如下的步骤,将一个或一个以上的应用程序的进程对应的指令加载到存储器中,并由处理器来运行存储在存储器中的应用程序,从而实现各种功能:The embodiment of the present application also provides an electronic device, such as a tablet computer, a mobile phone, and other electronic devices. The processor in the electronic device will load the instructions corresponding to the process of one or more application programs into the memory according to the following steps, and the processor will run the application programs stored in the memory to implement various functions:
获取视频当中的语音信息,根据所述语音信息的特征得到待识别语音;Acquire voice information in the video, and obtain the voice to be recognized according to the characteristics of the voice information;
将所述待识别语音输入至d-vector声纹识别模型,以得到所述待识别语音所对应的声纹标识,所述声纹标识包含d-vector特征;Inputting the voice to be recognized into a d-vector voiceprint recognition model to obtain a voiceprint identifier corresponding to the voice to be recognized, the voiceprint identifier including d-vector features;
对所述待识别语音进行语音识别以得到对应的文本信息;Performing voice recognition on the to-be-recognized voice to obtain corresponding text information;
将所述声纹标识和文本信息进行合成,以生成所述待识别语音的字幕。The voiceprint identification and text information are synthesized to generate subtitles of the voice to be recognized.
在一实施例中,在获取视频当中的语音信息,根据语音信息的特征得到待识别语音之前,处理器用于执行以下步骤:In one embodiment, before acquiring the voice information in the video and obtaining the voice to be recognized according to the characteristics of the voice information, the processor is configured to perform the following steps:
获取视频中的音轨信息;Obtain the audio track information in the video;
删除无需字幕合成的音轨,提取需要字幕合成的音轨中的语音信息。Delete the audio tracks that do not require subtitle synthesis, and extract the voice information in the audio tracks that require subtitle synthesis.
在一实施例中,获取视频当中的语音信息,根据语音信息的特征得到待识别语音时,处理器用于执行以下步骤:In one embodiment, when the voice information in the video is obtained, and the voice to be recognized is obtained according to the characteristics of the voice information, the processor is configured to perform the following steps:
判断所述语音信息是否符合预设语音信息条件;Determine whether the voice information meets preset voice information conditions;
若是,则确认所述语音信息对应目标对象,提取所述目标对象的待识别语音。If so, confirm that the voice information corresponds to the target object, and extract the voice to be recognized of the target object.
在一实施例中,将所述待识别语音输入至d-vector声纹识别模型,以得到待识别语音所对应的声纹标识时,处理器用于执行以下步骤:In an embodiment, when the voice to be recognized is input into the d-vector voiceprint recognition model to obtain the voiceprint identifier corresponding to the voice to be recognized, the processor is configured to perform the following steps:
提取所述视频中的语音的梅尔频率倒谱系数;Extracting the Mel frequency cepstrum coefficient of the speech in the video;
将所述梅尔频率倒谱系数输入至所述d-vector声纹识别模型;Inputting the Mel frequency cepstrum coefficients into the d-vector voiceprint recognition model;
将热编码形式的标签作为d-vector声纹识别模型的训练参照目标,训练完成所述d-vector声纹识别模型。The hot-coded label is used as the training reference target of the d-vector voiceprint recognition model, and the d-vector voiceprint recognition model is trained.
在一实施例中,将所述待识别语音输入至d-vector声纹识别模型,以得到所述待识别语音所对应的声纹标识时,处理器用于执行以下步骤:In an embodiment, when the voice to be recognized is input into the d-vector voiceprint recognition model to obtain the voiceprint identifier corresponding to the voice to be recognized, the processor is configured to perform the following steps:
将所述目标对象的语音的梅尔频率倒谱系数输入至d-vector声纹识别模型;Input the Mel frequency cepstrum coefficient of the target object's voice into the d-vector voiceprint recognition model;
通过d-vector声纹识别模型获取所述目标对象的语音的d-vector特征;Acquiring the d-vector feature of the target object's voice through a d-vector voiceprint recognition model;
将目标对象的语音的d-vector特征求平均值,获取目标对象语音的标准d-vector特征;Average the d-vector features of the target object’s voice to obtain the standard d-vector features of the target object’s voice;
根据所述标准d-vector特征建立声纹数据库。A voiceprint database is established according to the standard d-vector features.
在一实施例中,在通过d-vector声纹识别模型获取所述目标对象的语音的d-vector特征之后,处理器用于执行以下步骤:In an embodiment, after acquiring the d-vector feature of the target object's voice through the d-vector voiceprint recognition model, the processor is configured to perform the following steps:
使用WCCN方法对所述目标对象的语音的d-vector特征进行信道补偿。The WCCN method is used to perform channel compensation on the d-vector feature of the target object's voice.
在一实施例中,在得到所述语音所对应的声纹标识之前,处理器用于执行以下步骤:In an embodiment, before obtaining the voiceprint identifier corresponding to the voice, the processor is configured to perform the following steps:
将所述待标识语音输入至d-vector声纹识别模型以得到所述待标识语音的d-vector特征;Input the voice to be identified into a d-vector voiceprint recognition model to obtain the d-vector feature of the voice to be identified;
计算所述待标识语音的d-vector特征与所述标准d-vector特征的余弦距离;Calculating the cosine distance between the d-vector feature of the voice to be identified and the standard d-vector feature;
判断所述余弦距离是否小于阈值;Judging whether the cosine distance is less than a threshold;
若是,则所述待标识语音的d-vector特征值为声纹数据库中的d-vector特征值。If yes, the d-vector feature value of the voice to be identified is the d-vector feature value in the voiceprint database.
请参阅图8,图8是本申请实施例提供的用于视频字幕合成的电子设备的结构示意图。Please refer to FIG. 8. FIG. 8 is a schematic structural diagram of an electronic device for video caption synthesis provided by an embodiment of the present application.
电子设备700包括:处理器701、显示器702、存储器703、射频电路704、音频模块705以及电源706。The electronic device 700 includes a processor 701, a display 702, a memory 703, a radio frequency circuit 704, an audio module 705, and a power supply 706.
其中,处理器701是电子设备700的控制中心,利用各种接口和线路连接整个电子设备的各个部分,通过运行或加载存储在存储器702内的计算机程序,以及调用存储在存储器702内的数据,执行电子设备700的各种功能并处理数据,从而对电子设备700进行整体监控。Among them, the processor 701 is the control center of the electronic device 700, which uses various interfaces and lines to connect various parts of the entire electronic device, by running or loading a computer program stored in the memory 702, and calling data stored in the memory 702, Various functions of the electronic device 700 are executed and data are processed, thereby overall monitoring of the electronic device 700 is performed.
存储器702可用于存储软件程序以及模块,处理器701通过运行存储在存储器702的计算机程序以及模块,从而执行各种功能应用以及数据处理。存储器702可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的计算机程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据电子设备的使用所创建的数据等。此外,存储器702可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。相应地,存储器702还可以包括存储器控制器,以提供处理器701对存储器702的访问。The memory 702 may be used to store software programs and modules, and the processor 701 executes various functional applications and data processing by running the computer programs and modules stored in the memory 702. The memory 702 may mainly include a storage program area and a storage data area. The storage program area may store an operating system, a computer program required by at least one function (such as a sound playback function, an image playback function, etc.), etc.; Data created by the use of electronic equipment, etc. In addition, the memory 702 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices. Correspondingly, the memory 702 may further include a memory controller to provide the processor 701 with access to the memory 702.
在本申请实施例中,电子设备700中的处理器701会按照如下的步骤,将一个或一个以上的计算机程序的进程对应的指令加载到存储器702中,并由处理器701运行存储在存储器702中的计算机程序,从而实现各种功能,如下:In the embodiment of the present application, the processor 701 in the electronic device 700 loads the instructions corresponding to the process of one or more computer programs into the memory 702 according to the following steps, and the instructions are run by the processor 701 and stored in the memory 702 In order to realize various functions in the computer program, as follows:
获取视频当中的语音信息,根据语音信息的特征得到待识别语音;Acquire the voice information in the video, and obtain the voice to be recognized according to the characteristics of the voice information;
将待识别语音输入至d-vector声纹识别模型,以得到待识别语音所对应的声纹标识,声纹标识包含d-vector特征;Input the voice to be recognized into the d-vector voiceprint recognition model to obtain the voiceprint identifier corresponding to the voice to be recognized, and the voiceprint identifier contains d-vector features;
对待识别语音进行语音识别以得到对应的文本信息;Perform voice recognition on the voice to be recognized to obtain the corresponding text information;
将声纹标识和文本信息进行合成,以生成待识别语音的字幕。The voiceprint identification and text information are synthesized to generate subtitles of the voice to be recognized.
显示器703可以用于显示由用户输入的信息或提供给用户的信息以及各种图形用户接口,这些图形用户接口可以由图形、文本、图标、视频和其任意组合来构成。显示器703可以包括显示面板,在一些实施方式中,可以采用液晶显示器(Liquid Crystal Display,LCD)、或者有机发光二极管(Organic Light-Emitting Diode,OLED)等形式来配置显示面板。The display 703 may be used to display information input by the user or information provided to the user, and various graphical user interfaces. These graphical user interfaces may be composed of graphics, text, icons, videos, and any combination thereof. The display 703 may include a display panel. In some embodiments, the display panel may be configured in the form of a liquid crystal display (LCD) or an organic light-emitting diode (OLED).
射频电路704可以用于收发射频信号,以通过无线通信与网络设备或其他电子设备建立无线通讯,与网络设备或其他电子设备之间收发信号。The radio frequency circuit 704 may be used to transmit and receive radio frequency signals to establish wireless communication with network equipment or other electronic equipment through wireless communication, and to transmit and receive signals with the network equipment or other electronic equipment.
音频模块705,包含双扬声器以及音频电路。音频电路可将接收到的音频数据转换后的电信号,传输到双扬声器,由双扬声器转换为声音信号输出;另一方面,传声器将收集的声音信号转换为电信号,由音频电路接收后转换为音频数据,再将音频数据输出处理器701处理后,经射频电路704以发送给比如另一终端,或者将音频数据输出至存储器702以便进一步处理。音频电路还可能包括耳塞插孔,以提供外设耳机与终端的通信。The audio module 705 includes dual speakers and audio circuits. The audio circuit can transmit the electrical signal converted from the received audio data to the dual speakers, which are converted into sound signals for output; on the other hand, the microphone converts the collected sound signals into electrical signals, which are converted after being received by the audio circuit The audio data is processed by the audio data output processor 701, and then sent to, for example, another terminal through the radio frequency circuit 704, or the audio data is output to the memory 702 for further processing. The audio circuit may also include an earplug jack to provide communication between the peripheral earphone and the terminal.
电源706可以用于给电子设备700的各个部件供电。在一些实施例中,电源706可以通过电源管理系统与处理器701逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。The power supply 706 can be used to power various components of the electronic device 700. In some embodiments, the power supply 706 may be logically connected to the processor 701 through a power management system, so that functions such as charging, discharging, and power consumption management can be managed through the power management system.
尽管图8中并未示出,电子设备700还可以包括摄像头、蓝牙模块等,在此不做赘述。Although not shown in FIG. 8, the electronic device 700 may also include a camera, a Bluetooth module, etc., which will not be described in detail here.
在本申请实施例中,存储介质可以是磁碟、光盘、只读存储器(Read Only Memory,ROM)、或者随机存取记忆体(Random Access Memory,RAM)等。In the embodiment of the present application, the storage medium may be a magnetic disk, an optical disc, a read only memory (Read Only Memory, ROM), or a random access memory (Random Access Memory, RAM), etc.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。In the above-mentioned embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail in an embodiment, reference may be made to related descriptions of other embodiments.
需要说明的是,对本申请实施例的视频字幕合成方法而言,本领域普通测试人员可以理解实现本申请实施例视频字幕合成方法的全部或部分流程,是可以通过计算机程序来控制相关的硬件来完成,计算机程序可存储于一计算机可读取存储介质中,如存储在电子设备的存储器中,并被该电子设备内的至少一个处理器执行,在执行过程中可包括如视频字幕合成方法的实施例的流程。其中,的存储介质可为磁碟、光盘、只读存储器、随机存取记忆体等。It should be noted that for the video caption synthesis method of the embodiment of the present application, ordinary testers in the field can understand that all or part of the process of implementing the video caption synthesis method of the embodiment of the present application can be controlled by a computer program. When completed, the computer program can be stored in a computer readable storage medium, such as stored in the memory of an electronic device, and executed by at least one processor in the electronic device. The execution process may include such as a video caption synthesis method. The flow of the embodiment. Among them, the storage medium can be magnetic disk, optical disk, read-only memory, random access memory, etc.
对本申请实施例的视频字幕合成装置而言,其各功能模块可以集成在一个处理芯片中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中,存储介质譬如为只读存储器,磁盘或光盘等。For the video caption synthesis device of the embodiment of the present application, its functional modules may be integrated in one processing chip, or each module may exist alone physically, or two or more modules may be integrated in one module. The above-mentioned integrated modules can be implemented in the form of hardware or software functional modules. If the integrated module is implemented in the form of a software function module and sold or used as an independent product, it can also be stored in a computer readable storage medium, such as a read-only memory, a magnetic disk, or an optical disk.
以上对本申请实施例所提供的一种视频字幕合成方法、装置、存储介质及电子设备 进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。The above provides a detailed introduction to the video caption synthesis method, device, storage medium, and electronic equipment provided by the embodiments of the present application. Specific examples are used in this article to illustrate the principles and implementations of the present application. The description of the above embodiments It is only used to help understand the methods and core ideas of this application; at the same time, for those skilled in the art, according to the ideas of this application, there will be changes in the specific implementation and the scope of application. In summary, this The content of the description should not be construed as a limitation on this application.

Claims (20)

  1. 一种视频字幕合成方法,应用于电子设备,其中,所述方法包括:A method for synthesizing video captions applied to an electronic device, wherein the method includes:
    获取视频当中的语音信息,根据所述语音信息的特征得到待识别语音;Acquire voice information in the video, and obtain the voice to be recognized according to the characteristics of the voice information;
    将所述待识别语音输入至d-vector声纹识别模型,以得到所述待识别语音所对应的声纹标识,所述声纹标识包含d-vector特征;Inputting the voice to be recognized into a d-vector voiceprint recognition model to obtain a voiceprint identifier corresponding to the voice to be recognized, the voiceprint identifier including d-vector features;
    对所述待识别语音进行语音识别以得到对应的文本信息;Performing voice recognition on the to-be-recognized voice to obtain corresponding text information;
    将所述声纹标识和文本信息进行合成,以生成所述待识别语音的字幕。The voiceprint identification and text information are synthesized to generate subtitles of the voice to be recognized.
  2. 根据权利要求1所述的视频字幕合成方法,其中,在所述获取视频当中的语音信息,根据所述语音信息的特征得到待识别语音之前,还包括:The method for synthesizing video captions according to claim 1, wherein, before acquiring the voice information in the video, before obtaining the voice to be recognized according to the characteristics of the voice information, the method further comprises:
    获取视频中的音轨信息;Obtain the audio track information in the video;
    删除无需视频字幕合成的音轨,得到需要视频字幕合成的音轨中的语音信息。Delete the audio track that does not require video subtitle synthesis, and obtain the voice information in the audio track that requires video subtitle synthesis.
  3. 根据权利要求1所述的视频字幕合成方法,其中,所述获取视频当中的语音信息,根据所述语音信息的特征得到待识别语音,包括:The method for synthesizing video captions according to claim 1, wherein the acquiring voice information in the video and obtaining the voice to be recognized according to the characteristics of the voice information comprises:
    判断所述语音信息是否符合预设语音信息条件;Determine whether the voice information meets preset voice information conditions;
    若是,则确认所述语音信息对应目标对象,提取所述目标对象的待识别语音。If so, confirm that the voice information corresponds to the target object, and extract the voice to be recognized of the target object.
  4. 根据权利要求3所述的视频字幕合成方法,其中,在将所述待识别语音输入至d-vector声纹识别模型,以得到所述待识别语音所对应的声纹标识之前,所述方法还包括:The method for synthesizing video captions according to claim 3, wherein, before inputting the voice to be recognized into a d-vector voiceprint recognition model to obtain the voiceprint identifier corresponding to the voice to be recognized, the method further include:
    提取所述视频中的语音的梅尔频率倒谱系数;Extracting the Mel frequency cepstrum coefficient of the speech in the video;
    将所述梅尔频率倒谱系数输入至所述d-vector声纹识别模型;Inputting the Mel frequency cepstrum coefficients into the d-vector voiceprint recognition model;
    将热编码形式的标签作为d-vector声纹识别模型的训练参照目标,训练完成所述d-vector声纹识别模型。The hot-coded label is used as the training reference target of the d-vector voiceprint recognition model, and the d-vector voiceprint recognition model is trained.
  5. 根据权利要求4所述的视频字幕合成的方法,其中,将所述待识别语音输入至d-vector声纹识别模型,以得到所述待识别语音所对应的声纹标识,所述声纹标识包含d-vector特征,包括:The method for synthesizing video captions according to claim 4, wherein the voice to be recognized is input into a d-vector voiceprint recognition model to obtain a voiceprint identifier corresponding to the voice to be recognized, and the voiceprint identifier Contains d-vector features, including:
    将所述目标对象的语音的梅尔频率倒谱系数输入至训练后的d-vector声纹识别模型;Input the Mel frequency cepstrum coefficients of the target object's voice into the trained d-vector voiceprint recognition model;
    通过d-vector声纹识别模型获取所述目标对象的语音的d-vector特征;Acquiring the d-vector feature of the target object's voice through a d-vector voiceprint recognition model;
    将目标对象的语音的d-vector特征求平均值,获取目标对象的语音的标准d-vector特征;Average the d-vector features of the voice of the target object to obtain the standard d-vector features of the voice of the target object;
    根据所述标准d-vector特征建立声纹数据库。A voiceprint database is established according to the standard d-vector features.
  6. 根据权利要求5所述的视频字幕合成方法,其中,在通过d-vector声纹识别模型获取所述目标对象的语音的d-vector特征之后,所述方法还包括:5. The method for synthesizing video captions according to claim 5, wherein after acquiring the d-vector feature of the target object's voice through a d-vector voiceprint recognition model, the method further comprises:
    使用WCCN方法对所述目标对象的语音的d-vector特征进行信道补偿。The WCCN method is used to perform channel compensation on the d-vector feature of the target object's voice.
  7. 根据权利要求5所述的视频字幕合成方法,其中,在得到所述语音所对应的声纹标识之前,还包括:The method for synthesizing video captions according to claim 5, wherein before obtaining the voiceprint identifier corresponding to the voice, the method further comprises:
    将所述待标识语音输入至d-vector声纹识别模型以得到所述待标识语音的d-vector特征;Input the voice to be identified into a d-vector voiceprint recognition model to obtain the d-vector feature of the voice to be identified;
    计算所述待标识语音的d-vector特征与所述标准d-vector特征的余弦距离;Calculating the cosine distance between the d-vector feature of the voice to be identified and the standard d-vector feature;
    判断所述余弦距离是否小于阈值;Judging whether the cosine distance is less than a threshold;
    若是,则所述待标识语音的d-vector特征为声纹数据库中的d-vector特征。If yes, the d-vector feature of the voice to be identified is the d-vector feature in the voiceprint database.
  8. 一种视频字幕合成装置,应用于电子设备,其中,包括:A video caption synthesis device applied to electronic equipment, including:
    语音获取模块,用于获取视频当中的语音信息,根据所述语音信息的特征得到待识别语音;The voice acquisition module is used to acquire voice information in the video, and obtain the voice to be recognized according to the characteristics of the voice information;
    声纹识别模块,用于将所述待识别语音输入至d-vector声纹识别模型,以得到所述待识别语音所对应的声纹标识,所述声纹标识包含d-vector特征;A voiceprint recognition module, configured to input the voice to be recognized into a d-vector voiceprint recognition model to obtain a voiceprint identifier corresponding to the voice to be recognized, the voiceprint identifier including d-vector features;
    语音识别模块,用于对所述待识别语音进行语音识别以得到对应的文本信息;A voice recognition module, configured to perform voice recognition on the voice to be recognized to obtain corresponding text information;
    字幕合成模块,用于将所述声纹标识和文本信息进行合成,以生成所述待识别语音的字幕。The subtitle synthesis module is used to synthesize the voiceprint identification and text information to generate the subtitle of the voice to be recognized.
  9. 根据权利要求8所述的视频字幕合成装置,其中,所述装置还包括:The video caption synthesis device according to claim 8, wherein the device further comprises:
    训练模块,用于提取所述视频中的语音的梅尔频率倒谱系数;A training module for extracting the Mel frequency cepstrum coefficient of the voice in the video;
    将所述梅尔频率倒谱系数输入至所述d-vector声纹识别模型;Inputting the Mel frequency cepstrum coefficients into the d-vector voiceprint recognition model;
    将热编码形式的标签作为d-vector声纹识别模型的训练参照目标,训练完成所述d-vector声纹识别模型。The hot-coded label is used as the training reference target of the d-vector voiceprint recognition model, and the d-vector voiceprint recognition model is trained.
  10. 根据权利要求8所述的视频字幕合成装置,其中,所述语音获取模块,包括:The video caption synthesis device according to claim 8, wherein the voice acquisition module comprises:
    提取模块,用于获取视频中的音轨信息,删除无需视频字幕合成的音轨,得到需要视频字幕合成的音轨中的语音信息。The extraction module is used to obtain the audio track information in the video, delete the audio track that does not require video subtitle synthesis, and obtain the voice information in the audio track that requires video subtitle synthesis.
    第一判断模块,用于判断所述语音信息是否符合预设语音信息条件,若是,则确认所述语音信息对应目标对象,提取所述目标对象的待识别语音。The first judging module is configured to judge whether the voice information meets the preset voice information condition, and if so, confirm that the voice information corresponds to the target object, and extract the voice to be recognized of the target object.
  11. 根据权利要求8所述的视频字幕合成的装置,其中,所述声纹识别模块,包括:The device for synthesizing video captions according to claim 8, wherein the voiceprint recognition module comprises:
    数据库模块,用于将所述目标对象的语音的梅尔频率倒谱系数输入至训练后的d-vector声纹识别模型;The database module is used to input the Mel frequency cepstrum coefficients of the target object's voice into the trained d-vector voiceprint recognition model;
    通过d-vector声纹识别模型获取所述目标对象的语音的d-vector特征;Acquiring the d-vector feature of the target object's voice through a d-vector voiceprint recognition model;
    将目标对象的语音的d-vector特征求平均值,获取目标对象的语音的标准d-vector特征;Average the d-vector features of the voice of the target object to obtain the standard d-vector features of the voice of the target object;
    根据所述标准d-vector特征建立声纹数据库。A voiceprint database is established according to the standard d-vector features.
  12. 根据权利要求8所述的视频字幕合成的装置,其中,所述声纹识别模块,包括:The device for synthesizing video captions according to claim 8, wherein the voiceprint recognition module comprises:
    第二判断模块,用于将所述待标识语音输入至d-vector声纹识别模型以得到所述待标识语音的d-vector特征;The second judgment module is configured to input the voice to be identified into a d-vector voiceprint recognition model to obtain the d-vector feature of the voice to be identified;
    计算所述待标识语音的d-vector特征与所述标准d-vector特征的余弦距离;Calculating the cosine distance between the d-vector feature of the voice to be identified and the standard d-vector feature;
    判断所述余弦距离是否小于阈值;Judging whether the cosine distance is less than a threshold;
    若是,则所述待标识语音的d-vector特征为声纹数据库中的d-vector特征。If yes, the d-vector feature of the voice to be identified is the d-vector feature in the voiceprint database.
  13. 一种存储介质,其上存储有计算机程序,其中,当所述计算机程序在计算机上执行时,使得所述计算机执行如权利要求1至7中任一项所述的方法。A storage medium having a computer program stored thereon, wherein, when the computer program is executed on a computer, the computer is caused to execute the method according to any one of claims 1 to 7.
  14. 一种视频字幕合成的电子设备,包括处理器及存储器,其中,所述处理器通过调用所述存储器中的计算机程序,用于执行:An electronic device for synthesizing video captions, including a processor and a memory, wherein the processor is used to execute: by calling a computer program in the memory:
    获取视频当中的语音信息,根据所述语音信息的特征得到待识别语音;Acquire voice information in the video, and obtain the voice to be recognized according to the characteristics of the voice information;
    将所述待识别语音输入至d-vector声纹识别模型,以得到所述待识别语音所对应的声纹标识,所述声纹标识包含d-vector特征;Inputting the voice to be recognized into a d-vector voiceprint recognition model to obtain a voiceprint identifier corresponding to the voice to be recognized, the voiceprint identifier including d-vector features;
    对所述待识别语音进行语音识别以得到对应的文本信息;Performing voice recognition on the to-be-recognized voice to obtain corresponding text information;
    将所述声纹标识和文本信息进行合成,以生成所述待识别语音的字幕。The voiceprint identification and text information are synthesized to generate subtitles of the voice to be recognized.
  15. 根据权利要求14所述的视频字幕合成设备,其中,在获取视频当中的语音信息,根据所述语音信息的特征得到待识别语音之前,所述处理器用于执行:The video caption synthesis device according to claim 14, wherein, before acquiring the voice information in the video and obtaining the voice to be recognized according to the characteristics of the voice information, the processor is configured to execute:
    获取视频中的音轨信息;Obtain the audio track information in the video;
    删除无需视频字幕合成的音轨,得到需要视频字幕合成的音轨中的语音信息。Delete the audio track that does not require video subtitle synthesis, and obtain the voice information in the audio track that requires video subtitle synthesis.
  16. 根据权利要求14所述的视频字幕合成装备,其中,在获取视频当中的语音信息,根据所述语音信息的特征得到待识别语音时,所述处理器用于执行:The video caption synthesis device according to claim 14, wherein, when the voice information in the video is obtained and the voice to be recognized is obtained according to the characteristics of the voice information, the processor is configured to execute:
    判断所述语音信息是否符合预设语音信息条件;Determine whether the voice information meets preset voice information conditions;
    若是,则确认所述语音信息对应目标对象,提取所述目标对象的待识别语音。If so, confirm that the voice information corresponds to the target object, and extract the voice to be recognized of the target object.
  17. 根据权利要求16所述的视频字幕合成装置,其中,在将所述待识别语音输入至d-vector声纹识别模型,以得到所述待识别语音所对应的声纹标识之前,所述处理器用于执行:The video caption synthesis device according to claim 16, wherein, before inputting the voice to be recognized into a d-vector voiceprint recognition model to obtain the voiceprint identifier corresponding to the voice to be recognized, the processor uses To execute:
    提取所述视频中的语音的梅尔频率倒谱系数;Extracting the Mel frequency cepstrum coefficient of the speech in the video;
    将所述梅尔频率倒谱系数输入至所述d-vector声纹识别模型;Inputting the Mel frequency cepstrum coefficients into the d-vector voiceprint recognition model;
    将热编码形式的标签作为d-vector声纹识别模型的训练参照目标,训练完成所述d-vector声纹识别模型。The hot-coded label is used as the training reference target of the d-vector voiceprint recognition model, and the d-vector voiceprint recognition model is trained.
  18. 根据权利要求17所述的视频字幕合成装置,其中,将所述待识别语音输入至d-vector声纹识别模型,以得到所述待识别语音所对应的声纹标识时,所述处理器用于执行:The video caption synthesis device according to claim 17, wherein, when the voice to be recognized is input into a d-vector voiceprint recognition model to obtain a voiceprint identifier corresponding to the voice to be recognized, the processor is configured to carried out:
    将所述目标对象的语音的梅尔频率倒谱系数输入至训练后的d-vector声纹识别模型;Input the Mel frequency cepstrum coefficients of the target object's voice into the trained d-vector voiceprint recognition model;
    通过d-vector声纹识别模型获取所述目标对象的语音的d-vector特征;Acquiring the d-vector feature of the target object's voice through a d-vector voiceprint recognition model;
    将目标对象的语音的d-vector特征求平均值,获取目标对象的语音的标准d-vector特征;Average the d-vector features of the voice of the target object to obtain the standard d-vector features of the voice of the target object;
    根据所述标准d-vector特征建立声纹数据库。A voiceprint database is established according to the standard d-vector features.
  19. 根据权利要求18所述的视频字幕合成装置,其中,在通过d-vector声纹识别模型获取所述目标对象的语音的d-vector特征之后,所述处理器用于执行:18. The video caption synthesis device according to claim 18, wherein, after obtaining the d-vector feature of the target object's voice through a d-vector voiceprint recognition model, the processor is configured to execute:
    使用WCCN方法对所述目标对象的语音的d-vector特征进行信道补偿。The WCCN method is used to perform channel compensation on the d-vector feature of the target object's voice.
  20. 根据权利要求18所述的视频字幕合成装置,其中,在得到所述语音所对应的声纹标识之前,所述处理器用于执行:18. The video caption synthesis device according to claim 18, wherein, before obtaining the voiceprint identifier corresponding to the voice, the processor is configured to execute:
    将所述待标识语音输入至d-vector声纹识别模型以得到所述待标识语音的d-vector特征;Input the voice to be identified into a d-vector voiceprint recognition model to obtain the d-vector feature of the voice to be identified;
    计算所述待标识语音的d-vector特征与所述标准d-vector特征的余弦距离;Calculating the cosine distance between the d-vector feature of the voice to be identified and the standard d-vector feature;
    判断所述余弦距离是否小于阈值;Judging whether the cosine distance is less than a threshold;
    若是,则所述待标识语音的d-vector特征为声纹数据库中的d-vector特征。If yes, the d-vector feature of the voice to be identified is the d-vector feature in the voiceprint database.
PCT/CN2019/073770 2019-01-29 2019-01-29 Video subtitle synthesis method and apparatus, storage medium, and electronic device WO2020154916A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201980076343.7A CN113056908B (en) 2019-01-29 2019-01-29 Video subtitle synthesis method and device, storage medium and electronic equipment
PCT/CN2019/073770 WO2020154916A1 (en) 2019-01-29 2019-01-29 Video subtitle synthesis method and apparatus, storage medium, and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/073770 WO2020154916A1 (en) 2019-01-29 2019-01-29 Video subtitle synthesis method and apparatus, storage medium, and electronic device

Publications (1)

Publication Number Publication Date
WO2020154916A1 true WO2020154916A1 (en) 2020-08-06

Family

ID=71840280

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/073770 WO2020154916A1 (en) 2019-01-29 2019-01-29 Video subtitle synthesis method and apparatus, storage medium, and electronic device

Country Status (2)

Country Link
CN (1) CN113056908B (en)
WO (1) WO2020154916A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115620310B (en) * 2022-11-30 2023-05-09 杭州网易云音乐科技有限公司 Image recognition method, model training method, medium, device and computing equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104123115A (en) * 2014-07-28 2014-10-29 联想(北京)有限公司 Audio information processing method and electronic device
US20160293167A1 (en) * 2013-10-10 2016-10-06 Google Inc. Speaker recognition using neural networks
CN106782545A (en) * 2016-12-16 2017-05-31 广州视源电子科技股份有限公司 A kind of system and method that audio, video data is changed into writing record
CN107911646A (en) * 2016-09-30 2018-04-13 阿里巴巴集团控股有限公司 The method and device of minutes is shared, is generated in a kind of meeting
CN108630207A (en) * 2017-03-23 2018-10-09 富士通株式会社 Method for identifying speaker and speaker verification's equipment

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3739883B1 (en) * 2010-05-04 2022-11-16 LG Electronics Inc. Method and apparatus for encoding and decoding a video signal
WO2017048008A1 (en) * 2015-09-17 2017-03-23 엘지전자 주식회사 Inter-prediction method and apparatus in video coding system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160293167A1 (en) * 2013-10-10 2016-10-06 Google Inc. Speaker recognition using neural networks
CN104123115A (en) * 2014-07-28 2014-10-29 联想(北京)有限公司 Audio information processing method and electronic device
CN107911646A (en) * 2016-09-30 2018-04-13 阿里巴巴集团控股有限公司 The method and device of minutes is shared, is generated in a kind of meeting
CN106782545A (en) * 2016-12-16 2017-05-31 广州视源电子科技股份有限公司 A kind of system and method that audio, video data is changed into writing record
CN108630207A (en) * 2017-03-23 2018-10-09 富士通株式会社 Method for identifying speaker and speaker verification's equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WU, MINGHUI: "Master Thesis", RESEARCH ON TEXT-INDEPENDENT SPEAKER VERIFICATION BASED ON DEEP LEARNING, 31 May 2016 (2016-05-31), CN, pages 1 - 91, XP009522417 *

Also Published As

Publication number Publication date
CN113056908A (en) 2021-06-29
CN113056908B (en) 2024-04-05

Similar Documents

Publication Publication Date Title
Czyzewski et al. An audio-visual corpus for multimodal automatic speech recognition
WO2020177190A1 (en) Processing method, apparatus and device
CN110853618B (en) Language identification method, model training method, device and equipment
CN107481720B (en) Explicit voiceprint recognition method and device
CN108346427A (en) A kind of audio recognition method, device, equipment and storage medium
WO2021004481A1 (en) Media files recommending method and device
JP2019212308A (en) Video service providing method and service server using the same
WO2020253128A1 (en) Voice recognition-based communication service method, apparatus, computer device, and storage medium
CN103024530A (en) Intelligent television voice response system and method
US10277834B2 (en) Suggestion of visual effects based on detected sound patterns
KR20200027331A (en) Voice synthesis device
WO2019242402A1 (en) Speech recognition model generation method and apparatus, and storage medium and electronic device
KR20190005103A (en) Electronic device-awakening method and apparatus, device and computer-readable storage medium
CN109346057A (en) A kind of speech processing system of intelligence toy for children
CN110310642A (en) Method of speech processing, system, client, equipment and storage medium
US20230298628A1 (en) Video editing method and apparatus, computer device, and storage medium
Chakraborty et al. Literature Survey
CN114095782A (en) Video processing method and device, computer equipment and storage medium
CN112581965A (en) Transcription method, device, recording pen and storage medium
WO2020154916A1 (en) Video subtitle synthesis method and apparatus, storage medium, and electronic device
KR102226427B1 (en) Apparatus for determining title of user, system including the same, terminal and method for the same
CN115798459A (en) Audio processing method and device, storage medium and electronic equipment
WO2019228140A1 (en) Instruction execution method and apparatus, storage medium, and electronic device
WO2023206928A1 (en) Speech processing method and apparatus, computer device, and computer-readable storage medium
WO2020154883A1 (en) Speech information processing method and apparatus, and storage medium and electronic device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19913085

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19913085

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 19913085

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 22.09.2021)