WO2020098115A1 - 字幕添加方法、装置、电子设备及计算机可读存储介质 - Google Patents

字幕添加方法、装置、电子设备及计算机可读存储介质 Download PDF

Info

Publication number
WO2020098115A1
WO2020098115A1 PCT/CN2018/125397 CN2018125397W WO2020098115A1 WO 2020098115 A1 WO2020098115 A1 WO 2020098115A1 CN 2018125397 W CN2018125397 W CN 2018125397W WO 2020098115 A1 WO2020098115 A1 WO 2020098115A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
subtitle
audio information
text information
audio
Prior art date
Application number
PCT/CN2018/125397
Other languages
English (en)
French (fr)
Inventor
都之夏
Original Assignee
北京微播视界科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京微播视界科技有限公司 filed Critical 北京微播视界科技有限公司
Publication of WO2020098115A1 publication Critical patent/WO2020098115A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/488Data services, e.g. news ticker
    • H04N21/4884Data services, e.g. news ticker for displaying subtitles
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/431Generation of visual interfaces for content selection or interaction; Content or additional data rendering
    • H04N21/4312Generation of visual interfaces for content selection or interaction; Content or additional data rendering involving specific graphical features, e.g. screen layout, special fonts or colors, blinking icons, highlights or animations
    • H04N21/4314Generation of visual interfaces for content selection or interaction; Content or additional data rendering involving specific graphical features, e.g. screen layout, special fonts or colors, blinking icons, highlights or animations for fitting data in a restricted space on the screen, e.g. EPG data in a rectangular grid
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/278Subtitling

Definitions

  • the present disclosure relates to the technical field of video processing, and in particular, the present disclosure relates to a method, device, electronic device, and computer-readable storage medium for adding subtitles.
  • the addition of video subtitle information is achieved by manual addition, that is, the subtitle adder watches the video, and manually records the text information corresponding to the watched video, and then adds the recorded text information to the video.
  • manual addition that is, the subtitle adder watches the video, and manually records the text information corresponding to the watched video, and then adds the recorded text information to the video.
  • the existing method of manually adding video subtitle information due to the faster speech speed of the corresponding person in the video and the slow text recording speed of the person adding captions, the person adding captions needs to replay and watch the video repeatedly, which takes a long time Only time can get the text information corresponding to the video, and the manually added subtitles only include text information, and the form is relatively simple. Therefore, the existing method of manually adding video subtitle information has the problems of low addition efficiency, high labor cost, and a relatively single form of subtitle addition.
  • a method for adding subtitles includes:
  • a device for adding subtitles includes:
  • the first extraction module is used to extract audio information in the video file to be added with subtitles
  • the first recognition module is used to perform voice recognition on the audio information extracted by the first extraction module to obtain text information and voice environment features corresponding to the audio information;
  • the generating module is used to generate corresponding subtitle information according to the text information and the characteristics of the voice environment recognized by the first recognition module;
  • the adding module is used to add the subtitle information generated by the generating module to the video file, so that the video file carries the subtitle information when playing.
  • an electronic device in a third aspect, includes:
  • One or more processors and memory are One or more processors and memory,
  • the memory is used to store one or more application programs, and the one or more processors are used to execute the subtitle adding method according to the first aspect by calling the one or more application programs.
  • a computer-readable storage medium for storing computer instructions, which when run on a computer, enables the computer to execute the subtitle adding method according to the first aspect.
  • the embodiments of the present disclosure obtain the corresponding text information and voice environment features by performing voice recognition on the audio information, and realize the automatic acquisition of the text information corresponding to the video, reducing the time for acquiring the text information corresponding to the video, thereby improving the addition of video subtitle information Efficiency; in addition, the corresponding subtitle information is generated based on the corresponding text information and voice environment features, that is, the corresponding subtitle display mode can be set based on the voice environment features, so as to meet the personalized demand for subtitle information, thereby improving video viewing The interest of the person.
  • FIG. 1 is a schematic flowchart of a method for adding captions according to an embodiment of the present disclosure
  • FIG. 2 is a schematic structural diagram of an apparatus for adding captions according to an embodiment of the present disclosure
  • FIG. 3 is a schematic structural diagram of another subtitle adding apparatus according to an embodiment of the present disclosure.
  • FIG. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
  • An embodiment of the present disclosure provides a method for adding subtitles. As shown in FIG. 1, the method may include the following steps:
  • Step S101 Extract audio information in a video file to be added with subtitles
  • the audio information in the video file to be added with subtitles is extracted through corresponding audio extraction technology, such as FFmpeg technology, where the video to be added with subtitles may be recorded TV program video, teaching course video, short video Wait, there is no limit here.
  • a non-compressed pure waveform file for processing such as a Windows PCM file, commonly known as a Wav file.
  • Step S102 Perform voice recognition on the audio information to obtain text information and voice environment features corresponding to the audio information;
  • the audio information may be pre-processed. Processing, such as enhancing the voice by eliminating noise and channel distortion, and mute cutting off the first and last sections by VAD (Voice Activity Activity Detection) technology.
  • VAD Voice Activity Activity Detection
  • Step S103 generating corresponding subtitle information according to the obtained text information and voice environment characteristics
  • different audio information corresponds to different voice environment features
  • the obtained text information is processed accordingly to generate subtitle information corresponding to the voice environment features
  • Step S104 Add subtitle information to the video file, so that the video file carries subtitle information during playback.
  • the subtitle information is added to the video file, so that the video file carries subtitle information during playback, wherein the subtitle information may be embedded in the video file, or may exist in the form of external subtitles, which contains
  • the format of the external file of subtitle information can be srt, smi, ssa, etc.
  • the external subtitle file may be obtained by performing playback control processing based on the subtitle information and the time information of the corresponding video, and the corresponding playback control processing is used to enable the subtitle information and the video to be played synchronously.
  • the embodiments of the present disclosure obtain the corresponding text information and voice environment features by performing voice recognition on the audio information, and realize the automatic acquisition of the text information corresponding to the video, reducing the time for acquiring the text information corresponding to the video, thereby improving the addition of video subtitle information Efficiency; in addition, the corresponding subtitle information is generated according to the corresponding text information and voice environment features, that is, the corresponding subtitle display mode can be set based on the voice environment features, so as to meet the personalized demand for subtitles, and thereby enhance the video viewers Of interest.
  • An embodiment of the present disclosure provides a possible implementation manner, in which voice recognition of audio information in step S102 to obtain text information corresponding to the audio information includes:
  • Step S1021 (not shown in the figure), perform speech recognition on the audio information based on the pre-trained speech recognition model to obtain text information corresponding to the audio information.
  • the speech recognition model is trained in advance through multiple audio samples and corresponding text information, and then the speech information is subjected to speech recognition through the pre-trained speech recognition model, thereby obtaining the text information corresponding to the audio information.
  • the pre-trained speech recognition model may be a speech recognition model based on RNN (Recurrent Neural Network) or a speech recognition model based on LSTM (Long short-term memory model) network, where The speech recognition model based on LSTM network can solve the problem of long-term information dependence in speech recognition.
  • the text information corresponding to the audio information is obtained through the pre-trained speech recognition model, which solves the problem of automatically acquiring the text information corresponding to the audio information, thereby saving the labor of manually performing the conversion of the audio information into the corresponding text information.
  • the cost and time cost provide a premise guarantee for the subsequent rapid addition of subtitle information.
  • An embodiment of the present disclosure provides a possible implementation manner, wherein the voice recognition of the audio information in step S102 to obtain the voice environment characteristics corresponding to the audio information includes:
  • step S1022 acoustic feature extraction is performed on the audio information to obtain a voice environment feature corresponding to the audio information.
  • the acoustic features in the audio information are extracted through corresponding acoustic feature extraction techniques, where the acoustic features may be PLP (Perceptual Linear Predictive) features, LLPC (Linear Prediction Cepstrum Coefficient, linear prediction inverted Spectral coefficient) feature and MFCC (Mel-scale Frequency Cepstral Coefficients) feature, and analyze the extracted acoustic features to obtain the audio environment features corresponding to the audio information, among which The analysis and processing of the acoustic features of can be to recognize the extracted acoustic features through a pre-trained speech environment feature recognition model.
  • PLP Perceptual Linear Predictive
  • LLPC Linear Prediction Cepstrum Coefficient, linear prediction inverted Spectral coefficient
  • MFCC Mel-scale Frequency Cepstral Coefficients
  • the speech environment features corresponding to the audio information are obtained, thereby solving the problem of acquiring the speech environment features.
  • the voice environment features include but are not limited to at least one of the following:
  • the characteristics of the speech environment include, but are not limited to, intonation (such as ascending, descending, lifting, descending, and flat), speech rate (such as fast speech rate, slow speech rate), and rhythm (such as light At least one of slow, hyperactive, low, dignified, etc.), speech intensity (such as stress, light reading).
  • intonation such as ascending, descending, lifting, descending, and flat
  • speech rate such as fast speech rate, slow speech rate
  • rhythm such as light At least one of slow, hyperactive, low, dignified, etc.
  • speech intensity such as stress, light reading.
  • step S103 may include the following steps:
  • Step S1031 (not shown in the figure), according to the characteristics of the voice environment, determine the caption display configuration information matching the characteristics of the voice environment;
  • Step S1032 (not shown in the figure) generates caption information corresponding to the text information according to the caption display configuration information.
  • different voice environment features correspond to different subtitle display configuration information (such as distinguishing between fast and slow speech speed, and set corresponding subtitle display configuration information respectively), and the voice environment characteristics and subtitle display configuration information may be preset Correspondence list of, based on the obtained characteristics of the voice environment, can determine the matching subtitle display configuration information based on the correspondence list, and then according to the subtitle display configuration information, corresponding processing of the obtained text information to obtain subtitle information.
  • the matching subtitle display configuration information is determined according to the obtained voice environment characteristics, and then the subtitle information corresponding to the text information is generated according to the subtitle display configuration information, thereby solving the problem of how to determine the subtitle information according to different voice environment characteristics problem.
  • step S103 may include the following steps:
  • Step 1033 based on the text information and the voice environment features, determine the emotional feature type and / or tone type corresponding to the audio information;
  • the emotional feature type and / or tone type corresponding to the audio information is determined according to the content of the text information and the voice environment features, where the emotional feature type may include but is not limited to at least one of happiness, sadness, anger, anger, etc.
  • the tone type may include, but is not limited to, at least one of statement, question, imperative, and exclamation.
  • the type of tone emotional feature corresponding to the audio information is determined to be exclamation.
  • Step 1034 (not shown in the figure), according to the emotional feature type and / or tone type, determine subtitle display configuration information matching the emotional feature type and / or tone type;
  • corresponding different caption display configuration information is set, and a corresponding list of emotional feature types and / or tone types and caption display configuration information may be set in advance, according to Based on the corresponding relationship list, the obtained emotional feature type and / or tone type may be used to determine matching subtitle display configuration information.
  • Step 1035 (not shown in the figure) generates caption information corresponding to the text information according to the caption display configuration information.
  • the text information can be processed correspondingly to obtain the corresponding subtitle information.
  • the emotional feature type and / or tone type corresponding to the audio information is determined based on the text information and the voice environment features, and then the matching caption display configuration information is determined according to the obtained emotional feature type and / or tone type, and then based on The subtitle display configuration information generates subtitle information corresponding to the text information, thereby solving the problem of how to determine the subtitle information according to the different types of emotional characteristics and / or tone types.
  • the subtitle display configuration information includes but is not limited to at least one of the following:
  • Subtitle text attribute information includes subtitle special effect information; subtitle display position.
  • the subtitle display configuration information includes but is not limited to at least one of the following: attribute information of subtitle text (such as font, color, size, thickness, etc. of subtitle text); subtitle special effect information (such as subtitle fade-in and fade-out effects) , Flashing display, etc.), subtitles display position (such as displaying at the bottom of the video, centering display, etc.).
  • attribute information of subtitle text such as font, color, size, thickness, etc. of subtitle text
  • subtitle special effect information such as subtitle fade-in and fade-out effects
  • Flashing display etc.
  • subtitles display position such as displaying at the bottom of the video, centering display, etc.
  • setting different caption display configuration information improves the personalization of caption information display, thereby enhancing the interest of video viewers.
  • An embodiment of the present disclosure provides another possible implementation manner, and the method further includes,
  • Step S105 extracting image frames of the video file
  • Step S106 the image frame is recognized by image recognition technology, so as to obtain the human body part information of the corresponding person in the image frame;
  • step S107 the caption display position of the caption information is adjusted based on the body part information.
  • the image frames in the extracted video file can be identified by image recognition technology to obtain the human body part information of the corresponding person in the image frame, and then the subtitle display of the subtitle information can be adjusted based on the obtained human body part information of the corresponding person Position, for example, the position information of the head of the corresponding person is identified through image recognition technology, and then the caption display position of the caption information is adjusted according to the position information of the head.
  • the human body part information of the corresponding person in the video is identified through image recognition technology, and then the subtitle display position of the subtitle information is adjusted to realize the associated display of the subtitle information and the human body part information of the corresponding person in the video, improving the subtitle Personalized information display.
  • the device 20 includes: a first extraction module 201, a first recognition module 202, a generation module 203, and an addition module 204, wherein,
  • the first extraction module 201 is used to extract audio information in a video file to be added with subtitles
  • the first recognition module 202 is used to perform speech recognition on the audio information extracted by the first extraction module 201 to obtain text information and speech environment features corresponding to the audio information;
  • the generating module 203 is configured to generate corresponding subtitle information according to the text information and the characteristics of the voice environment recognized by the first recognition module 202;
  • the adding module 204 is used to add the subtitle information generated by the generating module 203 to the video file, so that the video file carries the subtitle information when playing.
  • the device for adding subtitles in this embodiment can execute a method for adding subtitles provided in the above-mentioned embodiments of the present disclosure, and the implementation principles are similar, and are not repeated here.
  • An embodiment of the present disclosure provides another apparatus for adding subtitles.
  • the apparatus 30 includes: a first extraction module 301, a first recognition module 302, a generation module 303, and an addition module 304, wherein,
  • the first extraction module 301 is used to extract audio information in a video file to be added with subtitles
  • the functions of the first extraction module 301 in FIG. 3 and the first extraction module 201 in FIG. 2 are the same or similar.
  • the first recognition module 302 is used to perform speech recognition on the audio information extracted by the first extraction module 301 to obtain text information and speech environment features corresponding to the audio information;
  • the functions of the first identification module 302 in FIG. 3 and the first identification module 202 in FIG. 2 are the same or similar.
  • the generating module 303 is configured to generate corresponding subtitle information according to the text information and the characteristics of the voice environment recognized by the first recognition module 302;
  • the functions of the generating module 303 in FIG. 3 and the generating module 203 in FIG. 2 are the same or similar.
  • the adding module 304 is used to add the subtitle information generated by the generating module 303 to the video file, so that the video file carries the subtitle information when playing.
  • the functions of the adding module 304 in FIG. 3 and the adding module 204 in FIG. 2 are the same or similar.
  • An embodiment of the present disclosure provides a possible implementation manner, specifically,
  • the first recognition module 302 is used to perform speech recognition on audio information based on a pre-trained speech recognition model to obtain text information corresponding to the audio information.
  • the text information corresponding to the audio information is obtained through the pre-trained speech recognition model, which solves the problem of automatically acquiring the text information corresponding to the audio information, thereby saving the labor of manually performing the conversion of the audio information into the corresponding text information.
  • the cost and time cost provide a premise guarantee for the subsequent rapid addition of subtitle information.
  • the first recognition module 302 is configured to perform acoustic feature extraction on audio information to obtain a voice environment feature corresponding to the audio information.
  • the speech environment features corresponding to the audio information are obtained, thereby solving the problem of acquiring the speech environment features.
  • the voice environment features include at least one of the following:
  • the generation module 303 includes a first determination unit 3031 and a first generation unit 3032;
  • the first determining unit 3031 is configured to determine subtitle display configuration information matching the voice environment characteristics according to the voice environment characteristics;
  • the first generating unit 3032 is configured to generate subtitle information corresponding to the text information according to the subtitle display configuration information determined by the first determining unit 3031.
  • the matching subtitle display configuration information is determined according to the obtained voice environment characteristics, and then the subtitle information corresponding to the text information is generated according to the subtitle display configuration information, which solves the problem of how to determine the subtitle information according to the different voice environment characteristics .
  • the generation module 303 includes a second determination unit 3033, a third determination unit 3034, and a second generation unit 3035;
  • the second determining unit 3033 is configured to determine the emotional feature type and / or tone type corresponding to the audio information based on the text information and the voice environment features;
  • the third determining unit 3034 is configured to determine subtitle display configuration information matching the emotional feature type and / or tone type according to the emotional feature type and / or tone type determined by the second determining unit 3033;
  • the second generating unit 3035 is configured to generate subtitle information corresponding to the text information according to the subtitle display configuration information determined by the third determining unit 3034.
  • the emotional feature type and / or tone type corresponding to the audio information is determined based on the text information and the voice environment features, and then the matching caption display configuration information is determined according to the obtained emotional feature type and / or tone type, and then based on The caption display configuration information generates caption information corresponding to the text information, which solves the problem of how to determine the caption information according to the different types of emotional features and / or tone types.
  • the subtitle display configuration information includes at least one of the following:
  • Subtitle text attribute information includes subtitle special effect information; subtitle display position.
  • setting different caption display configuration information improves the personalization of caption information display, thereby enhancing the interest of video viewers.
  • the device 30 further includes a second extraction module 305, a second identification module 306, and an adjustment module 307;
  • the second extraction module 305 is used to extract image frames of the video file
  • the second recognition module 306 is used to recognize the image frame extracted by the second extraction module 305 through image recognition technology to obtain the body part information of the corresponding person in the image frame;
  • the adjustment module 307 is used to adjust the display position of the subtitle information based on the body part information recognized by the second recognition module 306.
  • the human body part information of the corresponding person in the video is identified through image recognition technology, and then the subtitle display position of the subtitle information is adjusted to realize the associated display of the subtitle information and the human body part information of the corresponding person in the video, improving the subtitle Personalized information display.
  • An embodiment of the present disclosure provides a device for adding captions, which is suitable for the method shown in the above embodiments, and details are not described herein again.
  • FIG. 4 shows a schematic structural diagram of an electronic device (eg, terminal device or server) 40 suitable for implementing the embodiment of the present disclosure.
  • the terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), and in-vehicle terminals ( For example, mobile terminals such as car navigation terminals) and fixed terminals such as digital TVs, desktop computers, and the like.
  • the electronic device shown in FIG. 4 is only an example, and should not bring any limitation to the functions and use scope of the embodiments of the present disclosure.
  • the electronic device 40 may include a processing device (such as a central processing unit, a graphics processor, etc.) 401, which may be loaded into random access according to a program stored in a read-only memory (ROM) 402 or from the storage device 408
  • the program in the memory (RAM) 403 performs various appropriate operations and processes.
  • various programs and data necessary for the operation of the electronic device 40 are also stored.
  • the processing device 401, ROM 402, and RAM 403 are connected to each other via a bus 404.
  • An input / output (I / O) interface 405 is also connected to the bus 404.
  • the following devices can be connected to the I / O interface 405: including input devices 406 such as touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc .; including, for example, liquid crystal display (LCD), speaker, vibration
  • input devices 406 such as touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc .
  • LCD liquid crystal display
  • An output device 407 such as a storage device
  • a storage device 408 including, for example, a magnetic tape, a hard disk, etc .
  • the communication device 409 may allow the electronic device 40 to perform wireless or wired communication with other devices to exchange data.
  • FIG. 4 shows an electronic device 40 having various devices, it should be understood that it is not required to implement or have all the devices shown. More or fewer devices may be implemented or provided instead.
  • An embodiment of the present disclosure provides an electronic device, which is suitable for the method shown in the above embodiment. I will not repeat them here.
  • the process described above with reference to the flowchart may be implemented as a computer software program.
  • embodiments of the present disclosure include a computer program product that includes a computer program carried on a computer-readable medium, the computer program containing program code for performing the method shown in the flowchart.
  • the computer program may be downloaded and installed from the network through the communication device 409, or from the storage device 408, or from the ROM 402.
  • the processing device 401 When the computer program is executed by the processing device 401, the above-mentioned functions defined in the method of the embodiments of the present disclosure are executed.
  • the above-mentioned computer-readable medium in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium or any combination of the two.
  • the computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination of the above. More specific examples of computer-readable storage media may include, but are not limited to: electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the foregoing.
  • the computer-readable storage medium may be any tangible medium containing or storing a program, and the program may be used by or in combination with an instruction execution system, apparatus, or device.
  • the computer-readable signal medium may include a data signal that is propagated in baseband or as part of a carrier wave, in which computer-readable program code is carried. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • the computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, and the computer-readable signal medium may send, propagate, or transmit a program for use by or in combination with an instruction execution system, apparatus, or device .
  • the program code contained on the computer-readable medium may be transmitted using any appropriate medium, including but not limited to: electric wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
  • the computer-readable medium may be included in the above-mentioned electronic device; or it may exist alone without being assembled into the electronic device.
  • the computer-readable medium carries one or more programs, and when the one or more programs are executed by the electronic device, the electronic device is caused to: acquire at least two Internet protocol addresses; send the node evaluation device including the at least two A node evaluation request for an Internet protocol address, wherein the node evaluation device selects and returns an Internet protocol address from the at least two Internet protocol addresses; receives the Internet protocol address returned by the node evaluation device; wherein, the obtained The Internet Protocol address indicates the edge node in the content distribution network.
  • the computer-readable medium carries one or more programs, and when the one or more programs are executed by the electronic device, the electronic device is caused to: receive a node evaluation request including at least two Internet protocol addresses; Among the at least two internet protocol addresses, select the internet protocol address; return the selected internet protocol address; wherein, the received internet protocol address indicates an edge node in the content distribution network.
  • the computer program code for performing the operations of the present disclosure can be written in one or more programming languages or a combination thereof.
  • the above programming languages include object-oriented programming languages such as Java, Smalltalk, C ++, as well as conventional Procedural programming language-such as "C" language or similar programming language.
  • the program code may be executed entirely on the user's computer, partly on the user's computer, as an independent software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (for example, through an Internet service provider Internet connection).
  • LAN local area network
  • WAN wide area network
  • Internet service provider Internet connection for example, AT&T, MCI, Sprint, EarthLink, MSN, GTE, etc.
  • the embodiments of the present disclosure provide a computer-readable storage medium, which is suitable for the method shown in the above embodiments. I will not repeat them here.
  • each block in the flowchart or block diagram may represent a module, program segment, or part of code that contains one or more logic functions Executable instructions.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks represented in succession may actually be executed in parallel, and they may sometimes be executed in reverse order, depending on the functions involved.
  • each block in the block diagram and / or flowchart, and a combination of blocks in the block diagram and / or flowchart can be implemented with a dedicated hardware-based system that performs the specified function or operation Or, it can be realized by a combination of dedicated hardware and computer instructions.
  • the units described in the embodiments of the present disclosure may be implemented in software or hardware. Among them, the name of the unit does not constitute a limitation on the unit itself under certain circumstances.

Abstract

本公开实施例提供了一种字幕添加方法、装置、电子设备及计算机可读存储介质。该方法包括:提取待添加字幕的视频文件中的音频信息,并对音频信息进行语音识别,得到音频信息对应的文本信息及语音环境特征,然后依据得到的文本信息及语音环境特征,生成相应的字幕信息,继而将字幕信息添加至视频文件中,以使得视频文件在播放时携带字幕信息。本公开实现了视频对应的文本信息的自动获取,减少了获取视频对应的文本信息的时间,从而提升了添加视频字幕信息的效率;此外,根据得到的对应的文本信息及语音环境特征生成相应的字幕信息,即基于语音环境特征可设定对应的字幕显示方式,从而满足对字幕的个性化需求。

Description

字幕添加方法、装置、电子设备及计算机可读存储介质
相关文件的交叉引用
本申请要求2018年11月16日在中国知识产权局提交的中国专利申请No.201811367918.4的优先权,通过引用将其全文并入本文。
技术领域
本公开涉及视频处理技术领域,具体而言,本公开涉及一种字幕添加方法、装置、电子设备及计算机可读存储介质。
背景技术
随着视频拍摄技术的成熟发展,电视娱乐节目视频、教学课程视频、短视频等不同类型的视频,由于其传播的信息内容的直观性、丰富性而成为了一种重要的信息传递媒介。在视频中,视频拍摄制作者通常会同步加上字幕信息,使视频观看者能更好的理解、把握视频传递的信息内容。
目前,视频字幕信息的添加是通过人工添加的方式实现的,即字幕添加人员通过观看视频,同时人工记录观看的视频对应的文字信息,然后将记录的文字信息添加至视频中。然而,根据现有的人工添加视频字幕信息的方式,由于视频中相应人物的语速较快、字幕添加人员的文字记录速度慢等原因,字幕添加人员需要不断重复地回放观看视频,花费较长时间才能得到视频对应的文字信息,且人工添加的字幕仅包括文字信息,形式较单一。因此,现有的人工添加视频字幕信息的方式存在添加效率低、人工成本高以及添加的字幕形式较单一的问题。
发明内容
第一方面,提供了一种字幕添加方法,该方法包括:
提取待添加字幕的视频文件中的音频信息;
对音频信息进行语音识别,得到音频信息对应的文本信息及语音环境特征;
依据得到的文本信息及语音环境特征,生成相应的字幕信息;
将字幕信息添加至视频文件中,以使得视频文件在播放时携带字幕信息。
第二方面,提供了一种字幕添加装置,该装置包括:
第一提取模块,用于提取待添加字幕的视频文件中的音频信息;
第一识别模块,用于对第一提取模块提取的音频信息进行语音识别,得到音频信息对应的文本信息及语音环境特征;
生成模块,用于依据第一识别模块识别得到的文本信息及语音环境特征,生成相应的字幕信息;
添加模块,用于将生成模块生成的字幕信息添加至视频文件中,以使得视频文件在播放时携带字幕信息。
第三方面,提供了一种电子设备,该电子设备包括:
一个或多个处理器和存储器,
其中,所述存储器用于存储一个或多个应用程序,所述一个或多个处理器用于通过调用所述一个或多个应用程序,执行根据第一方面的字幕添加方法。
第四方面,提供了一种计算机可读存储介质,计算机存储介质用于存储计算机指令,当其在计算机上运行时,使得计算机可以执行根据第一方面的字幕添加方法。
本公开实施例通过对音频信息进行语音识别得到对应的文本信息及语音环境特征,实现了视频对应的文本信息的自动获取,减少了获取视频对应的文本信息的时间,从而提升了添加视频字幕信息的效率;此外,根据得到的对应的文本信息及语音环境特征生成相应的字幕信息,即基于语音环境特征可设定对应的字幕显示方式,从而满足对字幕信息的个性化需求,进而提升视频观看者的兴趣度。
本公开附加的方面和优点将在下面的描述中部分给出,这些将从下面的描述中变得明显,或通过本公开的实践了解到。
附图说明
本公开上述的和/或附加的方面和优点从下面结合附图对实施例的描述中将变得明显和容易理解,其中:
图1为本公开实施例的一种字幕添加方法的流程示意图;
图2为本公开实施例的一种字幕添加装置的结构示意图;
图3为本公开实施例的另一种字幕添加装置的结构示意图;
图4为本公开实施例的一种电子设备的结构示意图。
具体实施方式
下面详细描述本公开的实施例,各实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,仅用于解释本公开,而不能解释为对本公开的限制。
本技术领域技术人员可以理解,除非特意声明,这里使用的单数形式“一”、“一个”和“该”也可包括复数形式。应该进一步理解的是,本公开的说明书中使用的措辞“包括”是指存在特征、整数、步骤、操作、元件和/或组件,但是并不排除存在或添加一个或多个其他特征、整数、步骤、操作、元件、组件和/或它们的组。这里使用的措辞“和/或”包括一个或更多个相关联的列出项的全部或任一单元和全部组合。
为使本公开的目的、技术方案和优点更加清楚,下面将结合附图对本公开实施方式作进一步地详细描述。
下面以具体地实施例对本公开的技术方案以及本公开的技术方案如何解决上述技术问题进行详细说明。下面这几个具体的实施例可以相互结合,对于相同或相似的概念或过程可能在某些实施例中不再赘述。下面将结合附图,对本公开的实施例进行描述。
本公开实施例提供了一种字幕添加方法,如图1所示,该方法可以包括以下步骤:
步骤S101,提取待添加字幕的视频文件中的音频信息;
对于本公开实施例,通过相应的音频提取技术,如FFmpeg技术,提取待添加字幕的视频文件中的音频信息,其中,待添加字幕的视频可以是录制的电视节目视频、教学课程视频、短视频等,此处不做限定。
其中,还可以对提取到的音频信息进行相应转换处理,转换成非压缩的纯波形文件来处理,比如Windows PCM文件,即俗称的Wav文件。
步骤S102,对音频信息进行语音识别,得到音频信息对应的文本信息及语音环境特征;
对于本公开实施例,通过相应的语音识别技术对提取到的音频信息进行语音识别,得到音频信息对应的文本信息及语音环境特征,其中,对音频信息进行语音识别之前,可以对音频信息进行预处理,如通过消除噪声和信道失真对语音进行增强、通过VAD(Voice Activity Detection,语音活动检测)技术进行首尾段的静音切除等。
步骤S103,依据得到的文本信息及语音环境特征,生成相应的字幕信息;
对于本公开实施例,不同的音频信息对应有不同的语音环境特征,基于得到的语音环境特征,对得到的文本信息进行相应处理,生成与语音环境特征相对应的字幕信息。
步骤S104,将字幕信息添加至视频文件中,以使得视频文件在播放时携带字幕信息。
对于本公开实施例,将字幕信息添加至视频文件中,以使得视频文件在播放时携带字幕信息,其中字幕信息可以是内嵌到视频文件中,也可以是以外挂字幕的形式存在,其中包含字幕信息的外挂文件的格式可以是srt、smi、ssa等。
其中,外挂字幕文件可以是基于字幕信息与对应的视频的时间信息,进行播放控制处理后得到的,相应的播放控制处理用于使得字幕信息与视 频能够同步播放。
本公开实施例通过对音频信息进行语音识别得到对应的文本信息及语音环境特征,实现了视频对应的文本信息的自动获取,减少了获取视频对应的文本信息的时间,从而提升了添加视频字幕信息的效率;此外,根据得到的对应的文本信息及语音环境特征生成相应的字幕信息,即基于语音环境特征可设定对应的字幕显示方式,从而满足对字幕的个性化需求,进而提升视频观看者的兴趣度。
本公开实施例提供了一种可能的实现方式,其中,步骤S102中的对音频信息进行语音识别,得到音频信息对应的文本信息,包括:
步骤S1021(图中未示出),基于预训练的语音识别模型对音频信息进行语音识别,得到音频信息对应的文本信息。
对于本公开实施例,预先通过多个音频样本及对应的文本信息来训练语音识别模型,然后通过预先训练的语音识别模型对音频信息进行语音识别,从而得到音频信息对应的文本信息。其中,预训练的语音识别模型可以是基于RNN(Recurrent Neural Network,循环神经网络)的语音识别模型,也可以是基于LSTM(Long short-term memory,长短期记忆模型)网络的语音识别模型,其中基于LSTM网络的语音识别模型能够很好地解决语音识别中的长期信息依赖问题。
对于本公开实施例,通过预训练的语音识别模型得到音频信息对应的文本信息,解决了音频信息对应的文本信息的自动获取问题,从而节省了人工执行将音频信息转换成对应的文本信息的人力成本及时间成本,为后续快速进行字幕信息添加提供了前提保证。
本公开实施例提供了一种可能的实现方式,其中,步骤S102中的对音频信息进行语音识别,得到音频信息对应的语音环境特征,包括:
步骤S1022(图中未示出),对音频信息进行声学特征提取,得到音频信息对应的语音环境特征。
对于本公开实施例,通过相应的声学特征提取技术提取音频信息中的 声学特征,其中,该声学特征可以是PLP(Perceptual Linear Predictive,感知线性预测)特征、LLPC(Linear Prediction Cepstrum Coefficient,线性预测倒谱系数)特征与MFCC(Mel-scale Frequency Cepstral Coefficients,美尔频率倒谱系数)特征中的任一种,并对提取的声学特征进行分析处理得到音频信息对应的语音环境特征,其中,对提取的声学特征的分析处理可以是通过预训练的语音环境特征识别模型对提取的声学特征进行识别。
对于本公开实施例,通过提取音频信息的声学特征,得到音频信息对应的语音环境特征,从而解决了语音环境特征的获取问题。
其中,语音环境特征包括但不限于以下至少一项:
语调;语速;节奏;语音强度。
对于本公开实施例,语音环境特征包括但不限于语调(如升调、降调、升降调、降升调以及平调)、语速(如快语速、慢语速)、节奏(如轻缓、高亢、低沉、凝重等)、语音强度(如重读、轻读)等的至少一项。
对于本公开实施例,实现了可基于不同的应用需求,设定获取不同的语音环境特征。
本公开实施例提供了一种可能的实现方式,其中,步骤S103可以包括以下步骤:
步骤S1031(图中未示出),依据语音环境特征,确定与语音环境特征相匹配的字幕显示配置信息;
步骤S1032(图中未示出),依据字幕显示配置信息,生成与文本信息相应的字幕信息。
对于本公开实施例,不同的语音环境特征对应不同的字幕显示配置信息(如区别语速的快与慢,分别设定相应的字幕显示配置信息),可以预先设置语音环境特征与字幕显示配置信息的对应关系列表,根据得到的语音环境特征,可以基于该对应关系列表,确定相匹配的字幕显示配置信息,然后依据字幕显示配置信息,对得到的文本信息进行相应处理得到字幕信息。
对于本公开实施例,依据得到的语音环境特征确定相匹配的字幕显示 配置信息,然后依据字幕显示配置信息生成与文本信息相应的字幕信息,从而解决了如何根据语音环境特征的不同确定字幕信息的问题。
本公开实施例提供了一种可能的实现方式,其中,步骤S103可以包括以下步骤:
步骤1033(图中未示出),基于文本信息及语音环境特征确定音频信息对应的情感特征类型和/或语气类型;
对于本公开实施例,根据文本信息的内容与语音环境特征确定音频信息对应的情感特征类型和/或语气类型,其中,情感特征类型可以包括但不限于高兴、伤心、生气、愤怒等的至少一种,语气类型可以包括但不限于陈述、疑问、祈使、感叹中的至少一种。
例如,根据音频信息对应的文本信息中的“对于这件事,我非常生气”与对应的语音强度(语音环境特征),确定音频信息对应的情感特征类型为生气;根据音频信息对应的文本信息中的“我今天实在是太开心了”与对应的语音强度、节奏等语音环境特征,确定音频信息对应的语气情感特征类型为感叹。
步骤1034(图中未示出),依据情感特征类型和/或语气类型,确定与情感特征类型和/或语气类型相匹配的字幕显示配置信息;
对于本公开实施例,对于不同的情感特征类型和/或语气类型设定相应的不同的字幕显示配置信息,可以预先设置情感特征类型和/或语气类型与字幕显示配置信息的对应关系列表,根据得到的情感特征类型和/或语气类型,可以基于该对应关系列表,确定相匹配的字幕显示配置信息。
步骤1035(图中未示出),依据字幕显示配置信息,生成与文本信息相应的字幕信息。
对于本公开实施例,依据字幕显示配置信息,可以对文本信息进行相应的处理操作,得到相应的字幕信息。
对于本公开实施例,基于文本信息及语音环境特征确定音频信息对应的情感特征类型和/或语气类型,然后依据得到的情感特征类型和/或语气类型确定相匹配的字幕显示配置信息,继而依据字幕显示配置信息生成与 文本信息相应的字幕信息,从而解决了如何依据情感特征类型和/或语气类型的不同确定字幕信息的问题。
其中,字幕显示配置信息包括但不限于以下至少一项:
字幕文字属性信息;字幕特效信息;字幕显示位置。
对于本公开实施例,字幕显示配置信息包括但不限于以下至少一项:字幕文字的属性信息(如字幕文字的字体、颜色、大小、粗细等);字幕特效信息(如字幕的淡入、淡出效果、闪烁显示等)、字幕显示位置(如在视频的靠下位置显示、居中显示等)。
对于本公开实施例,设定不同的字幕显示配置信息,提升了字幕信息显示的个性化,从而增强了视频观看者的兴趣度。
本公开实施例提供了另一种可能的实现方式,该方法还包括,
步骤S105(图中未示出),提取视频文件的图像帧;
步骤S106(图中未示出),通过图像识别技术对图像帧进行识别,从而得到图像帧中相应人物的人体部位信息;
步骤S107(图中未示出),基于人体部位信息,调整字幕信息的字幕显示位置。
对于本公开实施例,可以通过图像识别技术对提取的视频文件中的图像帧进行识别,得到图像帧中相应人物的人体部位信息,然后基于得到的相应人物的人体部位信息调整字幕信息的字幕显示位置,如通过图像识别技术识别确定出相应人物的头部的位置信息,然后根据该头部的位置信息对字幕信息的字幕显示位置进行调整。
对于本公开实施例,通过图像识别技术识别确定出视频中相应人物的人体部位信息,然后调整字幕信息的字幕显示位置,实现了字幕信息与视频中相应人物人体部位信息的关联显示,提升了字幕信息显示的个性化。
图2为本公开实施例提供的一种字幕添加装置,该装置20包括:第一提取模块201、第一识别模块202、生成模块203及添加模块204,其中,
第一提取模块201,用于提取待添加字幕的视频文件中的音频信息;
第一识别模块202,用于对第一提取模块201提取的音频信息进行语音识别,得到音频信息对应的文本信息及语音环境特征;
生成模块203,用于依据第一识别模块202识别得到的文本信息及语音环境特征,生成相应的字幕信息;
添加模块204,用于将生成模块203生成的字幕信息添加至视频文件中,以使得视频文件在播放时携带字幕信息。
本实施例的字幕添加装置可执行本公开上述实施例中提供的一种字幕添加方法,其实现原理相类似,此处不再赘述。
本公开实施例提供了另一种字幕添加装置,该装置30包括:第一提取模块301、第一识别模块302、生成模块303及添加模块304,其中,
第一提取模块301,用于提取待添加字幕的视频文件中的音频信息;
其中,图3中的第一提取模块301与图2中的第一提取模块201的功能相同或者相似。
第一识别模块302,用于对第一提取模块301提取的音频信息进行语音识别,得到音频信息对应的文本信息及语音环境特征;
其中,图3中的第一识别模块302与图2中的第一识别模块202的功能相同或者相似。
生成模块303,用于依据第一识别模块302识别得到的文本信息及语音环境特征,生成相应的字幕信息;
其中,图3中的生成模块303与图2中的生成模块203的功能相同或者相似。
添加模块304,用于将生成模块303生成的字幕信息添加至视频文件中,以使得视频文件在播放时携带字幕信息。
其中,图3中的添加模块304与图2中的添加模块204的功能相同或者相似。
本公开实施例提供了一种可能的实现方式,具体地,
第一识别模块302用于基于预训练的语音识别模型对音频信息进行语音识别,得到音频信息对应的文本信息。
对于本公开实施例,通过预训练的语音识别模型得到音频信息对应的文本信息,解决了音频信息对应的文本信息的自动获取问题,从而节省了人工执行将音频信息转换成对应的文本信息的人力成本及时间成本,为后续快速进行字幕信息添加提供了前提保证。
本公开实施例提供了一种可能的实现方式,具体地,第一识别模块302用于对音频信息进行声学特征提取,得到音频信息对应的语音环境特征。
对于本公开实施例,通过提取音频信息的声学特征,得到音频信息对应的语音环境特征,从而解决了语音环境特征的获取问题。
其中,语音环境特征包括以下至少一项:
语调;语速;节奏;语音强度。
对于本公开实施例,实现了可基于不同的应用需求,设定获取不同的语音环境特征。
本公开实施例提供了一种可能的实现方式,其中,生成模块303包括第一确定单元3031及第一生成单元3032;
第一确定单元3031,用于依据语音环境特征,确定与语音环境特征相匹配的字幕显示配置信息;
第一生成单元3032,用于依据第一确定单元3031确定的字幕显示配置信息,生成与文本信息相应的字幕信息。
对于本公开实施例,依据得到的语音环境特征确定相匹配的字幕显示配置信息,然后依据字幕显示配置信息生成与文本信息相应的字幕信息,解决了如何根据语音环境特征的不同确定字幕信息的问题。
本公开实施例提供了一种可能的实现方式,其中,生成模块303包括第二确定单元3033、第三确定单元3034及第二生成单元3035;
第二确定单元3033,用于基于文本信息及语音环境特征确定音频信息对应的情感特征类型和/或语气类型;
第三确定单元3034,用于依据第二确定单元3033确定的情感特征类型和/或语气类型,确定与情感特征类型和/或语气类型相匹配的字幕显示配置信息;
第二生成单元3035,用于依据第三确定单元3034确定的字幕显示配置信息,生成与文本信息相应的字幕信息。
对于本公开实施例,基于文本信息及语音环境特征确定音频信息对应的情感特征类型和/或语气类型,然后依据得到的情感特征类型和/或语气类型确定相匹配的字幕显示配置信息,继而依据字幕显示配置信息生成与文本信息相应的字幕信息,解决了如何依据情感特征类型和/或语气类型的不同确定字幕信息的问题。
其中,字幕显示配置信息包括以下至少一项:
字幕文字属性信息;字幕特效信息;字幕显示位置。
对于本公开实施例,设定不同的字幕显示配置信息,提升了字幕信息显示的个性化,从而增强了视频观看者的兴趣度。
本公开实施例提供了一种可能的实现方式,该装置30还包括第二提取模块305、第二识别模块306及调整模块307;
第二提取模块305,用于提取视频文件的图像帧;
第二识别模块306,用于通过图像识别技术对第二提取模块305提取得到的图像帧进行识别得到图像帧中相应人物的人体部位信息;
调整模块307,用于基于第二识别模块306识别得到的人体部位信息调整字幕信息的字幕显示位置。
对于本公开实施例,通过图像识别技术识别确定出视频中相应人物的人体部位信息,然后调整字幕信息的字幕显示位置,实现了字幕信息与视频中相应人物人体部位信息的关联显示,提升了字幕信息显示的个性化。
本公开实施例提供了一种字幕添加装置,适用于上述实施例所示的方法,在此不再赘述。
本公开实施例提供了一种电子设备,如图4所示,其示出了适于用来实现本公开实施例的电子设备(例如终端设备或服务器)40的结构示意图。本公开实施例中的终端设备可以包括但不限于诸如移动电话机、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、车载终端(例如车载导航终端)等的移动 终端以及诸如数字TV、台式计算机等的固定终端。图4示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图4所示,电子设备40可以包括处理装置(例如中央处理器、图形处理器等)401,其可以根据存储在只读存储器(ROM)402中的程序或者从存储装置408加载到随机访问存储器(RAM)403中的程序而执行各种适当的动作和处理。在RAM 403中,还存储有电子设备40操作所需的各种程序和数据。处理装置401、ROM 402以及RAM 403通过总线404彼此相连。输入/输出(I/O)接口405也连接至总线404。
通常,以下装置可以连接至I/O接口405:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置406;包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置407;包括例如磁带、硬盘等的存储装置408;以及通信装置409。通信装置409可以允许电子设备40与其他设备进行无线或有线通信以交换数据。虽然图4示出了具有各种装置的电子设备40,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。
本公开实施例提供了一种电子设备,适用于上述实施例所示的方法。在此不再赘述。
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置409从网络上被下载和安装,或者从存储装置408被安装,或者从ROM 402被安装。在该计算机程序被处理装置401执行时,执行本公开实施例的方法中限定的上述功能。
需要说明的是,本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更 具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:获取至少两个网际协议地址;向节点评价设备发送包括所述至少两个网际协议地址的节点评价请求,其中,所述节点评价设备从所述至少两个网际协议地址中,选取网际协议地址并返回;接收所述节点评价设备返回的网际协议地址;其中,所获取的网际协议地址指示内容分发网络中的边缘节点。
或者,上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:接收包括至少两个网际协议地址的节点评价请求;从所述至少两个网际协议地址中,选取网际协议地址;返回选取出的网际协议地址;其中,接收到的网际协议地址指示内容分发网络中的边缘节点。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括面向对象的程序设计语言 —诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
本公开实施例提供了一种计算机可读存储介质,适用于上述实施例所示的方法。在此不再赘述。
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,单元的名称在某种情况下并不构成对该单元本身的限定。
以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开中所涉及的公开范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述公开构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。

Claims (11)

  1. 一种字幕添加方法,包括:
    提取待添加字幕的视频文件中的音频信息;
    对所述音频信息进行语音识别,得到所述音频信息对应的文本信息及语音环境特征;
    依据得到的所述文本信息及语音环境特征,生成相应的字幕信息;
    将所述字幕信息添加至所述视频文件中,以使得所述视频文件在播放时携带所述字幕信息。
  2. 根据权利要求1所述的方法,其中,对所述音频信息进行语音识别,得到所述音频信息对应的文本信息,包括:
    基于预训练的语音识别模型对所述音频信息进行语音识别,得到所述音频信息对应的文本信息。
  3. 根据权利要求1所述的方法,其中,对所述音频信息进行语音识别,得到所述音频信息对应的语音环境特征,包括:
    对所述音频信息进行声学特征提取,得到所述音频信息对应的语音环境特征。
  4. 根据权利要求3所述的方法,其中,所述语音环境特征包括以下至少一项:
    语调;语速;节奏;语音强度。
  5. 根据权利要求1所述的方法,其中,依据得到的所述文本信息及语音环境特征,生成相应的字幕信息,包括:
    依据所述语音环境特征,确定与所述语音环境特征相匹配的字幕显示配置信息;
    依据所述字幕显示配置信息,生成与所述文本信息相应的字幕信息。
  6. 根据权利要求1所述的方法,其中,依据得到的所述文本信息及语音环境特征,生成相应的字幕信息,包括:
    基于所述文本信息及语音环境特征确定所述音频信息对应的情感特 征类型和/或语气类型;
    依据所述情感特征类型和/或语气类型,确定与所述情感特征类型和/或语气类型相匹配的字幕显示配置信息;
    依据所述字幕显示配置信息,生成与所述文本信息相应的字幕信息。
  7. 根据权利要求1所述的方法,其中,所述字幕显示配置信息包括以下至少一项:
    字幕文字属性信息;字幕特效信息;字幕显示位置。
  8. 根据权利要求7所述的方法,其中,该方法还包括:
    提取所述视频文件的图像帧;
    通过图像识别技术对所述图像帧进行识别以得到所述图像帧中相应人物的人体部位信息;
    基于所述人体部位信息调整所述字幕信息的字幕显示位置。
  9. 一种字幕添加装置,包括:
    第一提取模块,用于提取待添加字幕的视频文件中的音频信息;
    第一识别模块,用于对所述第一提取模块提取的所述音频信息进行语音识别,得到所述音频信息对应的文本信息及语音环境特征;
    生成模块,用于依据所述第一识别模块识别得到的所述文本信息及语音环境特征,生成相应的字幕信息;
    添加模块,用于将所述生成模块生成的所述字幕信息添加至所述视频文件中,以使得所述视频文件在播放时携带所述字幕信息。
  10. 一种电子设备,包括:
    一个或多个处理器和存储器,
    其中,所述存储器用于存储一个或多个应用程序;
    所述一个或多个处理器用于通过调用所述一个或多个应用程序,执行权利要求1至8中任一项所述的字幕添加方法。
  11. 一种计算机可读存储介质,所述计算机存储介质用于存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机可以执行权利要求1至8中任一项所述的字幕添加方法。
PCT/CN2018/125397 2018-11-16 2018-12-29 字幕添加方法、装置、电子设备及计算机可读存储介质 WO2020098115A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811367918.4A CN109257659A (zh) 2018-11-16 2018-11-16 字幕添加方法、装置、电子设备及计算机可读存储介质
CN201811367918.4 2018-11-16

Publications (1)

Publication Number Publication Date
WO2020098115A1 true WO2020098115A1 (zh) 2020-05-22

Family

ID=65043671

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/125397 WO2020098115A1 (zh) 2018-11-16 2018-12-29 字幕添加方法、装置、电子设备及计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN109257659A (zh)
WO (1) WO2020098115A1 (zh)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111818279A (zh) * 2019-04-12 2020-10-23 阿里巴巴集团控股有限公司 字幕的生成方法、展示方法及交互方法
CN110297941A (zh) * 2019-07-10 2019-10-01 北京中网易企秀科技有限公司 一种音频文件处理方法及装置
CN110798636B (zh) * 2019-10-18 2022-10-11 腾讯数码(天津)有限公司 字幕生成方法及装置、电子设备
CN112752047A (zh) * 2019-10-30 2021-05-04 北京小米移动软件有限公司 视频录制方法、装置、设备及可读存储介质
CN110827825A (zh) * 2019-11-11 2020-02-21 广州国音智能科技有限公司 语音识别文本的标点预测方法、系统、终端及存储介质
CN111970577B (zh) * 2020-08-25 2023-07-25 北京字节跳动网络技术有限公司 字幕编辑方法、装置和电子设备
CN112579826A (zh) * 2020-12-07 2021-03-30 北京字节跳动网络技术有限公司 视频显示及处理方法、装置、系统、设备、介质
CN115150631A (zh) * 2021-03-16 2022-10-04 北京有竹居网络技术有限公司 字幕处理方法、装置、电子设备和存储介质
CN112714355B (zh) * 2021-03-29 2021-08-31 深圳市火乐科技发展有限公司 音频可视化的方法、装置、投影设备及存储介质
CN115312032A (zh) * 2021-05-08 2022-11-08 京东科技控股股份有限公司 语音识别训练集的生成方法及装置
CN113660536A (zh) * 2021-09-28 2021-11-16 北京七维视觉科技有限公司 一种字幕显示方法和装置
CN114007145A (zh) * 2021-10-29 2022-02-01 青岛海信传媒网络技术有限公司 一种字幕显示方法及显示设备
CN114095782A (zh) * 2021-11-12 2022-02-25 广州博冠信息科技有限公司 一种视频处理方法、装置、计算机设备及存储介质
CN116916085A (zh) * 2023-09-12 2023-10-20 飞狐信息技术(天津)有限公司 一种端到端字幕生成方法、装置、电子设备和存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106328146A (zh) * 2016-08-22 2017-01-11 广东小天才科技有限公司 一种视频的字幕生成方法及装置
CN106504754A (zh) * 2016-09-29 2017-03-15 浙江大学 一种根据音频输出的实时字幕生成方法
CN107172485A (zh) * 2017-04-25 2017-09-15 北京百度网讯科技有限公司 一种用于生成短视频的方法与装置
US20180108356A1 (en) * 2016-10-17 2018-04-19 Honda Motor Co., Ltd. Voice processing apparatus, wearable apparatus, mobile terminal, and voice processing method
CN108289244A (zh) * 2017-12-28 2018-07-17 努比亚技术有限公司 视频字幕处理方法、移动终端及计算机可读存储介质
CN108401192A (zh) * 2018-04-25 2018-08-14 腾讯科技(深圳)有限公司 视频流处理方法、装置、计算机设备及存储介质
CN108419141A (zh) * 2018-02-01 2018-08-17 广州视源电子科技股份有限公司 一种字幕位置调整的方法、装置、存储介质及电子设备

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012003407A (ja) * 2010-06-15 2012-01-05 Sony Corp 情報処理装置、同一性判定システム、同一性判定方法およびコンピュータプログラム
WO2015163555A1 (ko) * 2014-04-22 2015-10-29 주식회사 뱁션 자막 삽입 시스템 및 방법
CN105245917B (zh) * 2015-09-28 2018-05-04 徐信 一种多媒体语音字幕生成的系统和方法
CN106506335B (zh) * 2016-11-10 2019-08-30 北京小米移动软件有限公司 分享视频文件的方法及装置
CN108063722A (zh) * 2017-12-20 2018-05-22 北京时代脉搏信息技术有限公司 视频数据生成方法、计算机可读存储介质和电子设备
CN108184135B (zh) * 2017-12-28 2020-11-03 泰康保险集团股份有限公司 字幕生成方法及装置、存储介质及电子终端
CN108259971A (zh) * 2018-01-31 2018-07-06 百度在线网络技术(北京)有限公司 字幕添加方法、装置、服务器及存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106328146A (zh) * 2016-08-22 2017-01-11 广东小天才科技有限公司 一种视频的字幕生成方法及装置
CN106504754A (zh) * 2016-09-29 2017-03-15 浙江大学 一种根据音频输出的实时字幕生成方法
US20180108356A1 (en) * 2016-10-17 2018-04-19 Honda Motor Co., Ltd. Voice processing apparatus, wearable apparatus, mobile terminal, and voice processing method
CN107172485A (zh) * 2017-04-25 2017-09-15 北京百度网讯科技有限公司 一种用于生成短视频的方法与装置
CN108289244A (zh) * 2017-12-28 2018-07-17 努比亚技术有限公司 视频字幕处理方法、移动终端及计算机可读存储介质
CN108419141A (zh) * 2018-02-01 2018-08-17 广州视源电子科技股份有限公司 一种字幕位置调整的方法、装置、存储介质及电子设备
CN108401192A (zh) * 2018-04-25 2018-08-14 腾讯科技(深圳)有限公司 视频流处理方法、装置、计算机设备及存储介质

Also Published As

Publication number Publication date
CN109257659A (zh) 2019-01-22

Similar Documents

Publication Publication Date Title
WO2020098115A1 (zh) 字幕添加方法、装置、电子设备及计算机可读存储介质
US10614803B2 (en) Wake-on-voice method, terminal and storage medium
US20220115019A1 (en) Method and system for conversation transcription with metadata
TWI425500B (zh) 以數位語音中表現的單字索引數位語音
CN108831437B (zh) 一种歌声生成方法、装置、终端和存储介质
CN108012173B (zh) 一种内容识别方法、装置、设备和计算机存储介质
JP6681450B2 (ja) 情報処理方法および装置
WO2020113733A1 (zh) 动画生成方法、装置、电子设备及计算机可读存储介质
US20200058288A1 (en) Timbre-selectable human voice playback system, playback method thereof and computer-readable recording medium
WO2016165334A1 (zh) 一种语音处理方法及装置、终端设备
CN111919249A (zh) 词语的连续检测和相关的用户体验
US9569168B2 (en) Automatic rate control based on user identities
CN110310642B (zh) 语音处理方法、系统、客户端、设备和存储介质
US20130246061A1 (en) Automatic realtime speech impairment correction
WO2023029904A1 (zh) 文本内容匹配方法、装置、电子设备及存储介质
KR20200027331A (ko) 음성 합성 장치
WO2020173211A1 (zh) 图像特效的触发方法、装置和硬件装置
CN111883107B (zh) 语音合成、特征提取模型训练方法、装置、介质及设备
CN111640434A (zh) 用于控制语音设备的方法和装置
CN112908292A (zh) 文本的语音合成方法、装置、电子设备及存储介质
WO2020124754A1 (zh) 多媒体文件的翻译方法、装置及翻译播放设备
CN110379406B (zh) 语音评论转换方法、系统、介质和电子设备
WO2018120820A1 (zh) 一种演示文稿的制作方法和装置
WO2021169825A1 (zh) 语音合成方法、装置、设备和存储介质
CN111369968A (zh) 声音复制方法、装置、可读介质及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18940265

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 19.08.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18940265

Country of ref document: EP

Kind code of ref document: A1