CN115396690A - Audio and text combination method and device, electronic equipment and storage medium - Google Patents
Audio and text combination method and device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN115396690A CN115396690A CN202211049871.3A CN202211049871A CN115396690A CN 115396690 A CN115396690 A CN 115396690A CN 202211049871 A CN202211049871 A CN 202211049871A CN 115396690 A CN115396690 A CN 115396690A
- Authority
- CN
- China
- Prior art keywords
- text
- subtitle
- audio
- target
- video file
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 65
- 238000001514 detection method Methods 0.000 claims abstract description 29
- 238000012795 verification Methods 0.000 claims description 43
- 238000012549 training Methods 0.000 claims description 33
- 238000012937 correction Methods 0.000 claims description 16
- 238000001914 filtration Methods 0.000 claims description 13
- 238000004590 computer program Methods 0.000 claims description 12
- 238000012015 optical character recognition Methods 0.000 description 26
- 238000005516 engineering process Methods 0.000 description 11
- 238000010586 diagram Methods 0.000 description 10
- 238000012545 processing Methods 0.000 description 10
- 230000008569 process Effects 0.000 description 9
- 238000004891 communication Methods 0.000 description 8
- 230000006870 function Effects 0.000 description 8
- 238000012986 modification Methods 0.000 description 6
- 230000004048 modification Effects 0.000 description 6
- 238000004422 calculation algorithm Methods 0.000 description 5
- 238000000605 extraction Methods 0.000 description 5
- 238000004458 analytical method Methods 0.000 description 4
- PCHJSUWPFVWCPO-UHFFFAOYSA-N gold Chemical compound [Au] PCHJSUWPFVWCPO-UHFFFAOYSA-N 0.000 description 4
- 239000010931 gold Substances 0.000 description 4
- 229910052737 gold Inorganic materials 0.000 description 4
- 230000003993 interaction Effects 0.000 description 4
- 238000013473 artificial intelligence Methods 0.000 description 3
- 238000010276 construction Methods 0.000 description 3
- 230000003203 everyday effect Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000003672 processing method Methods 0.000 description 3
- 238000013136 deep learning model Methods 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 230000000873 masking effect Effects 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 210000003484 anatomy Anatomy 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000011282 treatment Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/233—Processing of audio elementary streams
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7834—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using audio features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7844—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/60—Type of objects
- G06V20/62—Text, e.g. of license plates, overlay texts or captions on TV images
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/234—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
- H04N21/23418—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/235—Processing of additional data, e.g. scrambling of additional data or processing content descriptors
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/439—Processing of audio elementary streams
- H04N21/4394—Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/47—End-user applications
- H04N21/488—Data services, e.g. news ticker
- H04N21/4884—Data services, e.g. news ticker for displaying subtitles
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Theoretical Computer Science (AREA)
- Library & Information Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- Studio Circuits (AREA)
Abstract
本公开实施例提供一种音频与文本组合方法、装置、电子设备及存储介质。该音频与文本组合方法包括:对视频文件中的关键帧进行文本识别,得到第一字幕集;对所述第一字幕集进行字幕切换检测,得到目标字幕和所述目标字幕发生字幕切换的时间戳;基于所述时间戳,从所述视频文件的音频文件中截取音频,得到目标音频;基于所述目标字幕和所述目标音频,构建文本与音频的组合对。本公开实施例,可以提高音频与文本之间匹配程度。
Embodiments of the present disclosure provide an audio and text combination method, device, electronic equipment, and storage medium. The method for combining audio and text includes: performing text recognition on key frames in a video file to obtain a first subtitle set; performing subtitle switching detection on the first subtitle set to obtain target subtitles and the time when subtitle switching occurs in the target subtitles stamp; based on the time stamp, intercept the audio from the audio file of the video file to obtain the target audio; based on the target subtitle and the target audio, construct a combination pair of text and audio. In the embodiment of the present disclosure, the matching degree between audio and text can be improved.
Description
技术领域technical field
本公开涉及人工智能技术领域,尤其涉及一种音频与文本组合方法、装置、电子设备及存储介质。The present disclosure relates to the technical field of artificial intelligence, and in particular to an audio and text combination method, device, electronic equipment and storage medium.
背景技术Background technique
语音识别技术是人工智能技术中的一种,目前广泛应用于语音输入、语音交互、语音搜索、语音控制、语音字幕等众多领域。特别在当下的疫情影响,无接触式语音交互与控制更凸显语音识别技术的价值。由于语音识别技术是基于深度学习算法,因此,其依托于海量且优质的训练数据来提高算法的准确性,使得语音识别的质量达到理想效果。Speech recognition technology is a kind of artificial intelligence technology, which is widely used in many fields such as voice input, voice interaction, voice search, voice control, and voice subtitles. Especially under the impact of the current epidemic, contactless voice interaction and control highlights the value of voice recognition technology. Since speech recognition technology is based on deep learning algorithms, it relies on massive and high-quality training data to improve the accuracy of the algorithm, making the quality of speech recognition ideal.
现有的语音识别算法的训练数据的生成方法一般为:获取带有字幕的视频资源,例如有声读物、现场解说、纪录片、戏剧、采访、新闻、综艺、自媒体等众多领域的视频资源;利用OCR(optical character recognition)文字识别技术从视频中提取字幕以及字幕的时间戳;通过时间戳从视频中匹配字幕对应的音频片段,并将字幕与该音频片段组成<文本,音频>对的训练样本。The training data generation method of existing speech recognition algorithms is generally as follows: obtain video resources with subtitles, such as audiobooks, on-site commentary, documentaries, dramas, interviews, news, variety shows, self-media and many other video resources; OCR (optical character recognition) text recognition technology extracts subtitles and subtitle timestamps from the video; matches the audio segment corresponding to the subtitle from the video through the timestamp, and forms the subtitle and the audio segment into a pair of training samples for <text, audio> .
但是,现有的字幕提取方案都是采用OCR文字识别技术来提取视频中的字幕,这种方式仅依靠视频内容单方面的信息,使得字幕提取的精度依赖于OCR识别技术的精度。当视频中存在复杂的背景或字幕位置不固定的情况下,OCR识别效果欠佳,从而在匹配<文本,音频>对时,容易出现字幕与音频时间信息错位、字幕内容与音频内容不对齐的情况。进而,影响语音识别算法的识别准确度。However, the existing subtitle extraction schemes all use OCR text recognition technology to extract subtitles in videos. This method only relies on unilateral information of video content, so that the accuracy of subtitle extraction depends on the accuracy of OCR recognition technology. When there is a complex background in the video or the position of the subtitle is not fixed, the OCR recognition effect is not good, so when matching the <text, audio> pair, it is easy to misplace the subtitle and audio time information, and the content of the subtitle and the audio content are not aligned. Happening. Furthermore, it affects the recognition accuracy of the speech recognition algorithm.
发明内容Contents of the invention
本公开实施例提供一种音频与文本组合方法、装置、电子设备及存储介质,以解决或缓解现有技术中的一项或更多项技术问题。Embodiments of the present disclosure provide an audio and text combination method, device, electronic device, and storage medium, so as to solve or alleviate one or more technical problems in the prior art.
作为本公开实施例的第一个方面,本公开实施例提供一种音频与文本组合方法,包括:As a first aspect of the embodiments of the present disclosure, the embodiments of the present disclosure provide a method for combining audio and text, including:
对视频文件中的关键帧进行文本识别,得到第一字幕集;Carry out text recognition to the key frame in the video file, obtain the first subtitle set;
对所述第一字幕集进行字幕切换检测,得到目标字幕和所述目标字幕发生字幕切换的时间戳;Performing subtitle switching detection on the first subtitle set to obtain the target subtitle and the timestamp when subtitle switching occurs in the target subtitle;
基于所述时间戳,从所述视频文件的音频文件中截取音频,得到目标音频;Based on the timestamp, intercept the audio from the audio file of the video file to obtain the target audio;
基于所述目标字幕和所述目标音频,构建文本与音频的组合对。Based on the target subtitle and the target audio, a combined pair of text and audio is constructed.
作为本公开实施例的第二个方面,本公开实施例提供一种音频与文本组合装置,包括:As a second aspect of the embodiments of the present disclosure, the embodiments of the present disclosure provide an audio and text combination device, including:
文本识别模块,用于对视频文件中的关键帧进行文本识别,得到第一字幕集;A text recognition module is used to carry out text recognition to key frames in the video file to obtain the first subtitle set;
字幕切换检测模块,用于对所述第一字幕集进行字幕切换检测,得到目标字幕和所述目标字幕发生字幕切换的时间戳;A subtitle switch detection module, configured to perform subtitle switch detection on the first subtitle set, to obtain the target subtitle and the timestamp when the target subtitle is switched;
目标音频截取模块,用于基于所述时间戳,从所述视频文件的音频文件中截取音频,得到目标音频;The target audio interception module is used to intercept the audio from the audio file of the video file based on the timestamp to obtain the target audio;
音频文本组合模块,用于基于所述目标字幕和所述目标音频,构建文本与音频的组合对。An audio-text combination module, configured to construct a combination pair of text and audio based on the target subtitle and the target audio.
作为本公开实施例的第三个方面,本公开实施例提供一种电子设备,包括:As a third aspect of the embodiments of the present disclosure, the embodiments of the present disclosure provide an electronic device, including:
至少一个处理器;以及at least one processor; and
与所述至少一个处理器通信连接的存储器;其中,a memory communicatively coupled to the at least one processor; wherein,
所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行本公开实施例提供的音频与文本组合方法。The memory stores instructions executable by the at least one processor, the instructions are executed by the at least one processor, so that the at least one processor can execute the audio and text combination method provided by the embodiments of the present disclosure .
作为本公开实施例的第四个方面,本公开实施例提供一种存储有计算机指令的非瞬时计算机可读存储介质,所述计算机指令用于使计算机执行本公开实施例提供的音频与文本组合方法。As the fourth aspect of the embodiments of the present disclosure, the embodiments of the present disclosure provide a non-transitory computer-readable storage medium storing computer instructions, the computer instructions are used to make the computer execute the combination of audio and text provided by the embodiments of the present disclosure method.
作为本公开实施例的第四个方面,本公开实施例提供一种计算机程序产品,包括计算机程序,该计算机程序在被处理器执行时实现本公开实施例提供的音频与文本组合方法。As a fourth aspect of the embodiments of the present disclosure, the embodiments of the present disclosure provide a computer program product, including a computer program. When the computer program is executed by a processor, the method for combining audio and text provided by the embodiments of the present disclosure is implemented.
本公开实施例提供的技术方案,对视频文件中的关键帧进行文本识别,得到第一字幕集,并对所述第一字幕集进行字幕切换检测,得到目标字幕和所述目标字幕发生字幕切换的时间戳。从而,基于所述时间戳,从所述视频文件的音频文件中截取音频,可以得到准确定的目标音频,使得基于所述目标字幕和所述目标音频所构建的文本与音频的组合对的匹配程度高,避免出现字幕与音频时间信息错位、字幕内容与音频内容不对齐的情况。In the technical solution provided by the embodiments of the present disclosure, text recognition is performed on key frames in a video file to obtain a first subtitle set, and subtitle switching detection is performed on the first subtitle set to obtain target subtitles and subtitle switching occurs in the target subtitles the timestamp of . Thereby, based on the time stamp, the audio is intercepted from the audio file of the video file, and the accurately determined target audio can be obtained, so that the matching of the combined pair of text and audio constructed based on the target subtitle and the target audio High degree, to avoid misalignment of subtitle and audio time information, misalignment of subtitle content and audio content.
上述概述仅仅是为了说明书的目的,并不意图以任何方式进行限制。除上述描述的示意性的方面、实施方式和特征之外,通过参考附图和以下的详细描述,本公开进一步的方面、实施方式和特征将会是容易明白的。The above summary is for illustrative purposes only and is not intended to be limiting in any way. In addition to the illustrative aspects, embodiments and features described above, further aspects, embodiments and features of the present disclosure will be readily apparent by referring to the drawings and the following detailed description.
附图说明Description of drawings
在附图中,除非另外规定,否则贯穿多个附图相同的附图标记表示相同或相似的部件或元素。这些附图不一定是按照比例绘制的。应该理解,这些附图仅描绘了根据本公开的一些实施方式,而不应将其视为是对本公开范围的限制。In the drawings, unless otherwise specified, the same reference numerals designate the same or similar parts or elements throughout the several drawings. The drawings are not necessarily drawn to scale. It should be understood that these drawings depict only some embodiments according to the present disclosure and should not be taken as limiting the scope of the present disclosure.
图1是本公开一实施例的音频与文本组合方法的流程图;Fig. 1 is a flowchart of an audio and text combination method according to an embodiment of the present disclosure;
图2是本公开一实施例的字幕处理方法的流程图;FIG. 2 is a flowchart of a subtitle processing method according to an embodiment of the present disclosure;
图3A是本公开一实施例的关键帧的示意图;FIG. 3A is a schematic diagram of a key frame according to an embodiment of the present disclosure;
图3B是本公开另一实施例的关键帧的示意图;FIG. 3B is a schematic diagram of a key frame according to another embodiment of the present disclosure;
图4是本公开另一实施例的关键帧的示意图;Fig. 4 is a schematic diagram of a key frame according to another embodiment of the present disclosure;
图5是本公开一实施例的语音识别模型的训练样本的构建过程的流程图;Fig. 5 is a flow chart of the construction process of the training sample of the speech recognition model of an embodiment of the present disclosure;
图6是本公开一实施例的字幕文件的生成过程的流程图;FIG. 6 is a flowchart of a subtitle file generation process according to an embodiment of the present disclosure;
图7是本公开一实施例的字幕校验、切换检测及纠错的流程图;FIG. 7 is a flow chart of subtitle verification, switching detection and error correction according to an embodiment of the present disclosure;
图8是本公开一实施例的象形字解剖的示意图;Fig. 8 is a schematic diagram of an anatomy of a pictographic character according to an embodiment of the present disclosure;
图9是本公开一实施例的字幕文件的示意图;FIG. 9 is a schematic diagram of a subtitle file according to an embodiment of the present disclosure;
图10是本公开一实施例的音频与文本组合装置的示意图;Fig. 10 is a schematic diagram of an audio and text combination device according to an embodiment of the present disclosure;
图11是本公开一实施例的电子设备的示意图;FIG. 11 is a schematic diagram of an electronic device according to an embodiment of the present disclosure;
具体实施方式Detailed ways
在下文中,仅简单地描述了某些示例性实施例。正如本领域技术人员可认识到的那样,在不脱离本公开的精神或范围的情况下,可通过各种不同方式修改所描述的实施例。因此,附图和描述被认为本质上是示例性的而非限制性的。In the following, only some exemplary embodiments are briefly described. As those skilled in the art would realize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present disclosure. Accordingly, the drawings and descriptions are to be regarded as illustrative in nature and not restrictive.
图1是本公开一实施例提供的音频与文本组合方法的流程图。如图1白发示,音频与文本组合方法可以包括如下步骤:Fig. 1 is a flowchart of a method for combining audio and text provided by an embodiment of the present disclosure. As shown in Figure 1, the audio and text combination method may include the following steps:
S110,对视频文件中的关键帧进行文本识别,得到第一字幕集;S110, performing text recognition on key frames in the video file to obtain a first subtitle set;
S120,对第一字幕集进行字幕切换检测,得到目标字幕和目标字幕发生字幕切换的时间戳;S120, performing subtitle switching detection on the first subtitle set, and obtaining the target subtitle and the time stamp when subtitle switching occurs in the target subtitle;
S130,基于目标字幕发生字幕切换的时间戳,从视频文件的音频文件中截取音频,得到目标音频;S130, based on the time stamp when subtitle switching occurs in the target subtitle, intercept the audio from the audio file of the video file to obtain the target audio;
S140,基于目标字幕和目标音频,构建文本与音频的组合对。S140. Construct a combined pair of text and audio based on the target subtitle and the target audio.
在本示例中,对视频文件中的关键帧进行文本识别,得到第一字幕集,并对所述第一字幕集进行字幕切换检测,得到目标字幕和所述目标字幕发生字幕切换的时间戳。从而,基于所述时间戳,从所述视频文件的音频文件中截取音频,可以得到准确定的目标音频,使得基于所述目标字幕和所述目标音频所构建的文本与音频的组合对的匹配程度高,避免出现字幕与音频时间信息错位、字幕内容与音频内容不对齐的情况。In this example, text recognition is performed on key frames in the video file to obtain a first subtitle set, and subtitle switch detection is performed on the first subtitle set to obtain a target subtitle and a time stamp when the target subtitle is switched. Thereby, based on the time stamp, the audio is intercepted from the audio file of the video file, and the accurately determined target audio can be obtained, so that the matching of the combined pair of text and audio constructed based on the target subtitle and the target audio High degree, to avoid misalignment of subtitle and audio time information, misalignment of subtitle content and audio content.
示例性地,视频文件包括按时序播放的多个图像帧以及对应的音频。视频文件可以包括有声读物、现场解说、纪录片、戏剧、采访、新闻、综艺、自媒体等视频资源。视频文件可以是短视频,也可以时间较长的视频。Exemplarily, the video file includes multiple image frames played in time sequence and corresponding audio. Video files can include audiobooks, on-site commentary, documentaries, dramas, interviews, news, variety shows, self-media and other video resources. Video files can be short videos or longer videos.
示例性地,在上述步骤S110中,可以对视频文件中的所有关键帧进行文本识别,也可以是对部分关键帧进行文本识别。关键帧可以是视频文件中的任意图像帧,也可以是带有文本的图像帧。Exemplarily, in the above step S110, text recognition may be performed on all key frames in the video file, or text recognition may be performed on some key frames. A keyframe can be any image frame in a video file, or an image frame with text.
示例性地,文本识别可以采用OCR识别技术。Exemplarily, the text recognition may adopt OCR recognition technology.
示例性地,第一字幕集包括多个字幕,每个字幕可以来自不同关键帧。字幕与字幕之间可以是具有相同内容的字幕,也可以是具有不相同内容的字幕。Exemplarily, the first subtitle set includes multiple subtitles, and each subtitle may come from a different key frame. Subtitles may be subtitles with the same content or subtitles with different content.
示例性地,对于一个字幕来说,其可以出现在连续的多个关键帧中,当其与前一帧的字幕不相同,或者与后一帧的字幕不相同时,可认为是发生字幕切换。因此,字幕切换检测可以是,利用相邻两帧的字幕相似程度来检测是否发生字幕切换。如果相邻两帧之间的字幕相似程度大于设定的字幕相似度阈值,则确定为未发生字幕切换。如果相邻两帧之间的字幕相似程度小于设定的字幕相似度阈值,则确定为发生字幕切换。如果发生字幕切换,则可以基于相邻两帧的时间戳来确定发生字幕切换的时间戳。例如,以相邻两帧的第一帧的时间戳作为发生字幕切换的时间戳,或者以相邻两帧的第二帧的时间戳作为发生字幕切换的时间戳。Exemplarily, for a subtitle, it may appear in multiple consecutive key frames. When it is different from the subtitle of the previous frame, or different from the subtitle of the next frame, it can be considered that subtitle switching occurs . Therefore, subtitle switch detection may be to use the subtitle similarity of two adjacent frames to detect whether a subtitle switch occurs. If the subtitle similarity between two adjacent frames is greater than the set subtitle similarity threshold, it is determined that subtitle switching has not occurred. If the subtitle similarity between two adjacent frames is smaller than the set subtitle similarity threshold, it is determined that subtitle switching occurs. If subtitle switching occurs, the time stamp at which subtitle switching occurs may be determined based on the time stamps of two adjacent frames. For example, the time stamp of the first frame of the two adjacent frames is used as the time stamp of subtitle switching, or the time stamp of the second frame of the two adjacent frames is used as the time stamp of subtitle switching.
示例性地,字幕发生字幕切换的时间戳包括起始时间戳和终止时间戳。从起始时间戳至终止时间戳之间的每个关键帧中的字幕具有相同的字幕内容,即相邻两帧的字幕相似程度大于设定的字幕相似度阈值。Exemplarily, the time stamp at which subtitle switching occurs includes a start time stamp and an end time stamp. The subtitles in each key frame from the start time stamp to the end time stamp have the same subtitle content, that is, the subtitle similarity of two adjacent frames is greater than the set subtitle similarity threshold.
示例性地,所构建的文本与音频的组合对可以应用于语音识别模型的训练。Exemplarily, the constructed combination pair of text and audio can be applied to the training of the speech recognition model.
在一些实施例中,在上述步骤S110,对视频文件中的关键帧进行文本识别,得到第一字幕集的过程包括校验文本是否为字幕、过滤非字幕文本等操作。In some embodiments, in the above step S110, the process of performing text recognition on key frames in the video file to obtain the first subtitle set includes operations such as checking whether the text is a subtitle, filtering non-subtitle text, and the like.
示例性地,图2是本公开一实施例提供的字幕处理方法的流程图。如图2所示,该字幕处理方法可以包括如下步骤:Exemplarily, FIG. 2 is a flowchart of a subtitle processing method provided by an embodiment of the present disclosure. As shown in Figure 2, the subtitle processing method may include the following steps:
S210,对视频文件中的关键帧进行文本识别,得到候选字幕文本;S210, performing text recognition on key frames in the video file to obtain candidate subtitle texts;
S220,在视频文件中,确定候选字幕文本对应的校验音频;S220, in the video file, determine the verification audio corresponding to the candidate subtitle text;
S230,在候选字幕文本与校验音频之间的关系符合设定条件的情况下,确定候选字幕文本为视频文件的字幕文本;S230, when the relationship between the candidate subtitle text and the verification audio meets the set condition, determine that the candidate subtitle text is the subtitle text of the video file;
S240,获取视频文件的字幕文本,得到第一字幕集。S240. Obtain the subtitle text of the video file to obtain a first subtitle set.
在本实施例中,利用语音识别的校验文本,对利用文本识别方式获得的候选字幕文本进行校验,确定该候选字幕文本是否为真正的字幕,从而提高字幕识别的精度,以在后续构建由文本与音频组成的组合时,能够避免出现文本与音频时间信息错位、文本内容与音频内容不对齐的情况。In this embodiment, the verification text of speech recognition is used to verify the candidate subtitle text obtained by text recognition, and determine whether the candidate subtitle text is a real subtitle, thereby improving the accuracy of subtitle recognition, so that in the subsequent construction When the combination is composed of text and audio, it can avoid the misalignment of text and audio time information, and the misalignment of text content and audio content.
示例性地,在上述步骤S210中,可以对视频文件中的所有关键帧进行文本识别,也可以是对部分关键帧进行文本识别。关键帧可以是视频文件中的任意图像帧,也可以是带有文本的图像帧。Exemplarily, in the above step S210, text recognition may be performed on all key frames in the video file, or text recognition may be performed on some key frames. A keyframe can be any image frame in a video file, or an image frame with text.
示例性地,上述步骤S210对关键帧进行文本识别可以是采用OCR文本识别技术。在对关键帧进行文本识别,可以得到关键帧中的一个或多个文本,例如文本块、文本行。在这种情况下,有些文本未必是字幕,可以将这些文本作为候选字幕。进而,针对候选字幕进行音频校验,以准确地确定候选字幕是否为真正的字幕。Exemplarily, performing text recognition on the key frame in the above step S210 may be an OCR text recognition technology. When performing text recognition on the key frame, one or more texts in the key frame can be obtained, such as a text block or a text line. In this case, some texts may not be subtitles, and these texts can be used as candidate subtitles. Furthermore, an audio verification is performed on the candidate subtitles to accurately determine whether the candidate subtitles are real subtitles.
示例性地,对视频文件中的关键帧进行文本识别,得到至少一个文本,根据视频文件的字幕区域以及关键帧中的每个文本的位置信息,在关键帧中的多个文本中确定候选字幕文本,以进一步确定候选字幕文本是否为字幕。例如,文本的位置信息与字幕区域相匹配,则认为这些文本是字幕,文本的位置信息与字幕区域不匹配,则认为这些文本未必是字幕,需要进一步校验其是否为字幕。位置信息可以包括文本在关键帧中的位置区域,例如以一个矩形框框住文本,该矩形框的四个角的坐标则为文本在关键帧中的位置区域。Exemplarily, text recognition is performed on key frames in the video file to obtain at least one text, and according to the subtitle area of the video file and the position information of each text in the key frame, candidate subtitles are determined among multiple texts in the key frame text to further determine whether the candidate subtitle text is a subtitle. For example, if the position information of the text matches the subtitle area, then these texts are considered to be subtitles, and if the position information of the text does not match the subtitle area, then these texts are not necessarily considered to be subtitles, and it is necessary to further check whether they are subtitles. The position information may include the position area of the text in the key frame, for example, the text is framed by a rectangle, and the coordinates of the four corners of the rectangle are the position area of the text in the key frame.
示例性地,候选字幕文本是未确定其是否为字幕的文本,例如有可能是字幕,也有可能是关键帧中的背景文本或者水印文本。Exemplarily, the candidate subtitle text is text that has not been determined whether it is a subtitle, for example, it may be a subtitle, or it may be background text or watermark text in a key frame.
示例性地,如图3A和图3B所示,这是一个教师Actor1在教学的场景。图中的教师Actor1所指的文本区域中的文本是背景文本,这该文本区域的下方的两个文本块“这唐代诗人李白的诗”和“诗人用潭水深千尺比喻汪伦与他的友情运用了夸张的手法”是字幕。如果采用OCR识别,背景文本和字幕均同时识别到的,此时无法得哪一些是字幕文本哪一些是背景文本。因此,采用本公开实施例提供的方法,可以准确地识别字幕,滤除背景文本。Exemplarily, as shown in FIG. 3A and FIG. 3B , this is a scene where the teacher Actor1 is teaching. The text in the text area pointed by the teacher Actor1 in the figure is the background text, and the two text blocks below the text area are "This is the poem of Li Bai, a poet of the Tang Dynasty" and "The poet uses the deep lake as a metaphor for Wang Lun and his Friendship uses exaggeration" is the subtitle. If OCR recognition is used, the background text and the subtitles are recognized at the same time. At this time, it is impossible to know which ones are subtitle texts and which ones are background texts. Therefore, by using the method provided by the embodiment of the present disclosure, subtitles can be accurately identified and background text can be filtered out.
示例性地,在上述步骤220中,可以基于候选字幕文本对应的时间戳,在视频文件中的音频中截取对应的音频段,得到校验音频。该时间戳可以是,该候选字幕文本在视频文件中发生文本切换的时间戳,即候选字幕文本在时序连续的图像帧中的首次出现时间和末次出现时间。从而,可以准确地截取候选字幕文本对应的目标音频,提高后续校验字幕的准确程度。Exemplarily, in the above step 220, based on the time stamp corresponding to the candidate subtitle text, the corresponding audio segment can be intercepted from the audio in the video file to obtain the verification audio. The time stamp may be the time stamp when the candidate subtitle text is switched in the video file, that is, the time when the candidate subtitle text appears for the first time and the last time it appears in the sequential image frames. Therefore, the target audio corresponding to the candidate subtitle text can be accurately intercepted, and the accuracy of subsequent subtitle verification can be improved.
在一些实施例中,可以利用候选字幕文本与其对应的校验文本之间的重合度或置信度等来确定其是否为字幕。In some embodiments, whether the candidate subtitle text is a subtitle can be determined by using the degree of coincidence or confidence between the candidate subtitle text and its corresponding verification text.
示例性地,上述步骤S230可以包括:对校验音频进行语音识别,得到校验文本;在候选字幕文本与校验文本之间的重合度大于设定的重合度阈值的情况下,确定候选字幕文本为视频文件的字幕文本。Exemplarily, the above step S230 may include: performing speech recognition on the verification audio to obtain the verification text; when the degree of overlap between the candidate subtitle text and the verification text is greater than the set coincidence degree threshold, determine the candidate subtitle Text is the subtitle text of the video file.
示例性地,上述语音识别可以采用ASR(Automatic Speech Recognition,自动语音识别)技术。Exemplarily, the foregoing speech recognition may adopt an ASR (Automatic Speech Recognition, automatic speech recognition) technology.
示例性地,重合度是指两个字符串之间的相同字符与所有字符之间的比值,或者相同字符与其中一字符串之间的比值。该比值可以是字符数量之间的比值,也可以是字符长度之间的比值。Exemplarily, the coincidence degree refers to the ratio between the same character and all characters between two character strings, or the ratio between the same character and one of the character strings. The ratio can be a ratio between the number of characters, or a ratio between the lengths of characters.
示例性地,候选字幕文本与其对应的校验文本之间的重合度m为m=len(string(ocr)&string(asr))/len(string(ocr),其中,string(ocr)为候选字幕文本,tring(asr)为候选字幕文本对应的校验文本,len(string(ocr)&string(asr))为候选字幕文本与校验文本之间的字符交集的长度,len(string(ocr)为候选字幕文本的长度。字符交集包括候选字幕文本与校验文本之间的相同字符的集合。Exemplarily, the coincidence degree m between the candidate subtitle text and its corresponding verification text is m=len(string(ocr)&string(asr))/len(string(ocr), where string(ocr) is the candidate subtitle text, tring(asr) is the verification text corresponding to the candidate subtitle text, len(string(ocr)&string(asr)) is the length of the character intersection between the candidate subtitle text and the verification text, len(string(ocr) is The length of the candidate subtitle text. The character intersection includes the set of identical characters between the candidate subtitle text and the verification text.
在本示例中,可以利用候选字幕文本与其对应的音频所识别到的校验文本之间的重合度,来确定该候选字幕文本是否字幕。从而,提高字幕识别的准确度。In this example, it may be determined whether the candidate subtitle text is a subtitle by using the coincidence degree between the candidate subtitle text and the verification text identified by the corresponding audio. Therefore, the accuracy of subtitle recognition is improved.
在一些实施例中,在对视频文件进行大量的字幕识别的情况下,可以预先过滤一些较为明显的例如水印之类的非字幕文本。In some embodiments, when a large number of subtitles are identified on a video file, some obvious non-subtitle texts such as watermarks may be pre-filtered.
示例性,在上述步骤S210中,对视频文件中的关键帧进行文本识别,得到候选字幕文本,包括:Exemplarily, in the above step S210, text recognition is performed on key frames in the video file to obtain candidate subtitle texts, including:
对视频文件中的关键帧进行文本识别,得到第一文本集;performing text recognition on key frames in the video file to obtain a first text set;
基于第一文本集,确定非字幕文本;determining non-subtitle text based on the first set of text;
在第一文本集中过滤非字幕文本,得到第二文本集;Filter the non-subtitle text in the first text set to obtain the second text set;
基于第二文本集,确定候选字幕文本。Based on the second set of text, candidate subtitle text is determined.
在本示例中,预先过滤非字幕文本,有利于提高后续字幕识别的效率。In this example, pre-filtering non-subtitle text is beneficial to improve the efficiency of subsequent subtitle recognition.
示例性地,在对视频文件中的每个关键帧进行文本识别之前,从视频文件中提取关键帧。针对视频文件中的每一个图像帧进行文本检测,并通过帧差法比较前后两帧是否重复,如果重复则删除冗余帧,最后得到多个关键帧,且每个关键帧包含有文本。可以通过比较前后两帧图像中的文本之间的相似度,例如编辑距离,来确定前后两帧图像是否为重复帧。而对图像帧进行文本检测的模型可以是采用RCNN深度学习模型来实现。此外,为了提高提取关键帧的速度,可以按照设定间隔时间来获取关键帧,例如每秒提取3帧作为关键帧,每间隔8秒提取一帧作为关键帧。Exemplarily, before text recognition is performed on each key frame in the video file, key frames are extracted from the video file. Text detection is performed for each image frame in the video file, and the frame difference method is used to compare whether the two frames before and after are repeated. If it is repeated, the redundant frame is deleted, and finally multiple key frames are obtained, and each key frame contains text. Whether the two frames of images are repeated frames can be determined by comparing the similarity between the texts in the two frames of images before and after, such as the edit distance. The model for text detection on image frames can be implemented using the RCNN deep learning model. In addition, in order to increase the speed of extracting key frames, key frames can be acquired at set intervals, for example, 3 frames per second are extracted as key frames, and one frame is extracted as key frames every 8 seconds.
示例性地,对关键帧进行文本识别,可以得到该关键帧的每个文本的文本信息和位置信息。每个文本可以文本块或文本行的形式截取得到。例如,一个关键帧的文本识别结果可以标记为:Exemplarily, by performing text recognition on a key frame, the text information and position information of each text in the key frame can be obtained. Each text can be intercepted in the form of text block or text line. For example, a text recognition result for a keyframe can be labeled as:
[(text1,(xmin,xmax,ymin,ymax)),(text2,(xmin,xmax,ymin,ymax)),……];[(text1,(xmin,xmax,ymin,ymax)),(text2,(xmin,xmax,ymin,ymax)),...];
其中,text1为一个文本,text2为另一个文本;xmin,xmax表示文本块在当前的关键帧中的最小横坐标和最大横坐标;ymin,ymax表示文本块在当前的关键帧中的最小纵坐标和最大纵坐标。如图4所示,其示出了文本块“好好学习,天天向上”在当前帧中的位置信息。Among them, text1 is a text, text2 is another text; xmin, xmax represent the minimum abscissa and maximum abscissa of the text block in the current key frame; ymin, ymax represent the minimum ordinate of the text block in the current key frame and the maximum ordinate. As shown in FIG. 4 , it shows the position information of the text block "study hard, improve every day" in the current frame.
在一些实施例中,非字幕文本包括水印、台标或者广告等背景文本。例如,在一些电视剧、综艺节点中会存在台标或广告等背景信息。如图4所示,图4中的“XXTV电视剧频道”是一个台标。这些背景信息往往会出现在视频中的固定位置,并在视频的整个播放时长内都存在,即在视频的所有关键帧中都存在。因此,可以基于第一文本集中的每个文本的位置信息,统计位置相同的文本的出现频次,将出现频次高于设定阈值的文本确定为非字幕文本。In some embodiments, the non-subtitle text includes background text such as watermarks, logos, or advertisements. For example, in some TV dramas and variety show nodes, there will be background information such as station logos or advertisements. As shown in Fig. 4, "XXTV drama channel" in Fig. 4 is a station logo. These background information often appear in a fixed position in the video, and exist in the entire playing time of the video, that is, in all key frames of the video. Therefore, based on the position information of each text in the first text set, the frequency of occurrence of texts with the same position can be counted, and the text with the frequency of occurrence higher than the set threshold can be determined as non-subtitle text.
示例性地,上述基于第一文本集中的每个文本的位置信息,确定非字幕文本,可以包括:基于第一文本块集中的每个文本的位置信息,确定位置相同的文本的出现频次;在出现频次大于设定的频次阈值的情况下,将位置相同的文本确定为非字幕文本。Exemplarily, the above-mentioned determination of the non-subtitle text based on the position information of each text in the first text set may include: determining the occurrence frequency of text with the same position based on the position information of each text in the first text block set; When the frequency of occurrence is greater than the set frequency threshold, the text with the same position is determined as non-subtitle text.
需要要注意的是,同一文本在前后帧或者不同帧之间的位置坐标信息会存在微小的差异。例如,图4中的““好好学习,天天向上”在前后两帧所识别得到的位置坐标信息可以为(125,512,678,780)和(126,513,679,781)。因此,可以将这两个位置坐标信息归一化,视为相同位置。It should be noted that there may be slight differences in the position coordinate information of the same text between frames before and after or between different frames. For example, the position coordinate information identified in the two frames before and after "" study hard and improve every day" in Fig. 4 can be (125,512,678,780) and (126,513,679,781). Therefore, it can The coordinate information of these two positions is normalized and regarded as the same position.
示例性地,在基于属于不同关键帧的两个文本的位置信息确定这两个文本之间的位置距离小于预定的距离阈值时,确定这两个文本为位置相同的文本。Exemplarily, when it is determined based on the position information of two texts belonging to different key frames that the position distance between the two texts is less than a predetermined distance threshold, it is determined that the two texts are texts in the same position.
示例性地,文本块1在关键帧a的位置为(x1,y1);文本块2在关键帧b的位置为(x2,y2)。如果||x1-x2||<a,||y1-y2||<b时,则文本块1与文本块2为位置相同的文本。Exemplarily, the position of
示例性地,统计每个位置相同的文本的出现频次,得到每个位置相同的文本的出现频次。将出现频次最高的文本确定为非字幕文本,或者,出现频次大于视频文件的关键帧的总数的80%或70%的文本,确定为非字幕文本。Exemplarily, the frequency of occurrence of each text with the same position is counted to obtain the frequency of occurrence of each text with the same position. The text with the highest frequency of occurrence is determined as the non-subtitle text, or the text with the frequency of occurrence greater than 80% or 70% of the total number of key frames in the video file is determined as the non-subtitle text.
在一些实施例中,在第一文本集中过滤了非字幕文本之后,得到第二文本集。此时,可以将第二文本集中的所有文本作为候选字幕文本,以进行音频校验,得到视频文件的字幕文本。或者,对第二文本集中的文本进行筛选,将筛选得到的文本作为候选字幕文本。In some embodiments, the second text set is obtained after filtering the non-subtitle text in the first text set. At this time, all the texts in the second text set may be used as candidate subtitle texts for audio verification to obtain the subtitle text of the video file. Alternatively, the texts in the second text set are screened, and the screened texts are used as candidate subtitle texts.
在一些实施例中,可以利用字幕区域从第二文本集中筛选出候选字幕文本。In some embodiments, candidate subtitle texts may be screened out from the second text set by using the subtitle area.
在一些实施例中,对字幕区域的确定方式,其可以为:在过滤了非字幕文本之后,可以将剩下的文本按位置的出现频次进行排序,将出现频次大于设定阈值,例如频次最高的文本,所在位置区域确定为视频文本的字幕区域。In some embodiments, the method of determining the subtitle area can be: after filtering the non-subtitle text, the remaining text can be sorted by the frequency of occurrence of the position, and the frequency of occurrence is greater than the set threshold, for example, the highest frequency text, the location area is determined as the subtitle area of the video text.
在一些实施例中,可以利用字幕区域来确定过滤了非字幕文本之后的第二文本集中的文本是否为字幕。例如,文本的位置与字幕区域相匹配,则该文本可以确定为字幕。文本的位置与字幕区域不匹配,则该文本不能确定是否为字幕,需要进一步校验。In some embodiments, the subtitle area may be used to determine whether the text in the second text set after filtering the non-subtitle text is a subtitle. For example, if the position of the text matches the subtitle area, the text can be determined as a subtitle. If the position of the text does not match the subtitle area, it cannot be determined whether the text is a subtitle, and further verification is required.
示例性地,上述基于第二文本集,确定候选字幕文本,包括:Exemplarily, the above-mentioned determination of candidate subtitle text based on the second text set includes:
基于第二文本集中的每个文本的位置信息,确定第一字幕区域;determining a first subtitle area based on the position information of each text in the second text set;
针对第二文本集中的每个文本,在文本的位置信息与第一字幕区域的匹配程度小于设定的匹配阈值的情况下,确定文本为候选字幕文本。For each text in the second text set, if the matching degree between the location information of the text and the first subtitle area is less than a set matching threshold, the text is determined to be a candidate subtitle text.
在本示例中,候选字幕文本是不能确定其是否为字幕的文本,需要进一步利用语音识别得到的校验文本来进一校验,从而提高字幕识别的准确率。In this example, the candidate subtitle text is text for which it is not possible to determine whether it is a subtitle, and it needs to be further verified by using the verification text obtained through speech recognition, so as to improve the accuracy of subtitle recognition.
示例性地,上述基于第二文本集中的每个文本的位置信息,确定第一字幕区域,可以包括:针对第二文本集中的位置相同文本的出现频次进行排序,将出现频次最高的文本所在的位置区域确定为第一字幕区域。Exemplarily, the above-mentioned determination of the first subtitle area based on the position information of each text in the second text set may include: sorting the frequency of occurrence of the text with the same position in the second text set, and sorting the text with the highest frequency of occurrence The location area is determined as the first subtitle area.
示例性地,如图3A和图3B所示,其为老师授课的视频中的两个关键帧。在关键帧中存在有背景字幕且为多行字幕。教师Actor1所指的文本区域中的文本是背景文本。利用上述确定的字幕区域对教师Actor1所指的文本区域进行校验,可以发现字幕区域与教师Actor1所指的文本区域是不匹配的。假设字幕区域记为(Xmin,Xmax,Ymin,Ymax),师Actor1所指的文本区域为(Xminc,Xmaxc,Yminc,Ymaxc),如果Yminc<Ymin或者Ymaxc>Ymax,该文本区域对应的文本为字幕区域以外的文本,不能将其确定为字幕。但是,如果直接将该文本区域认为不是字幕,则有可能丢失字幕。因此,将这种文本确定为候选字幕文本,利用语音识别对应的音频所得到的校验文本来对该候字幕文本进行校验,以确定该文本是否为字幕。从而,一方面,可以避免丢失字幕,另一方面还可以提高字幕识别的准确率。Exemplarily, as shown in FIG. 3A and FIG. 3B , they are two key frames in a video of a teacher giving a lecture. Background subtitles exist in key frames and are multi-line subtitles. The text in the text area that the teacher Actor1 is pointing at is the background text. Using the subtitle area determined above to verify the text area pointed by the teacher Actor1, it can be found that the subtitle area does not match the text area pointed by the teacher Actor1. Suppose the subtitle area is recorded as (Xmin, Xmax, Ymin, Ymax), and the text area pointed by Actor1 is (Xminc, Xmaxc, Yminc, Ymaxc), if Yminc<Ymin or Ymaxc>Ymax, the text corresponding to the text area is a subtitle Text outside the region cannot be identified as subtitles. However, subtitles may be lost if the text area is directly considered not to be subtitles. Therefore, this kind of text is determined as a candidate subtitle text, and the candidate subtitle text is verified by using the verification text obtained by speech recognition corresponding to the audio, so as to determine whether the text is a subtitle. Therefore, on the one hand, the loss of subtitles can be avoided, and on the other hand, the accuracy of subtitle recognition can be improved.
由于利用字幕区域即可识别得到字幕,因此,利用字幕区域对上述第一文本集或第二文本集进行识别,可以快速地得到第一字幕集。Since subtitles can be identified by using the subtitle area, the first subtitle set can be quickly obtained by using the subtitle area to identify the first text set or the second text set.
示例性地,上述基于第二文本集,确定候选字幕文本,还包括:Exemplarily, the above-mentioned determination of candidate subtitle texts based on the second text set further includes:
针对第二文本集中的每个文本,在文本的位置信息与第一字幕区域的匹配程度大于设定的匹配阈值的情况下,确定文本为视频文件的字幕文本。For each text in the second text set, if the matching degree between the location information of the text and the first subtitle area is greater than a set matching threshold, it is determined that the text is the subtitle text of the video file.
示例性地,在上述步骤S210中,对视频文件的关键帧进行文本识别,得到第一字幕集,可以包括:Exemplarily, in the above step S210, text recognition is performed on the key frames of the video file to obtain the first subtitle set, which may include:
对视频文件中的每个关键帧进行文本识别,得到第三文本集;performing text recognition on each key frame in the video file to obtain a third text set;
基于第三文本集中的每个文本的位置信息,确定第二字幕区域;determining a second subtitle area based on the position information of each text in the third text set;
针对第三文本集中的每个文本,在文本的位置信息与第二字幕区域的匹配程度大于设定的匹配阈值的情况下,确定文本为视频文件的字幕文本。For each text in the third text set, if the matching degree between the location information of the text and the second subtitle area is greater than a set matching threshold, it is determined that the text is the subtitle text of the video file.
在本示例中,利用字幕区域来确定字幕,可以快速准确地确定字幕,提高字幕识别的效率。In this example, by using the subtitle area to determine the subtitle, the subtitle can be determined quickly and accurately, and the efficiency of subtitle recognition can be improved.
示例性地,对视频文件中的每个关键帧进行文本识别,得到第三文本集,可以采用上述过滤非字幕文本的方法,过滤第三文本集中的非字幕文本,得到过滤后的第三文本集。具体可参考前述第二文本集的非字幕文本的过滤过程以及非字幕文本的识别过程,在此不详述。Exemplarily, text recognition is performed on each key frame in the video file to obtain a third text set, and the above-mentioned method for filtering non-subtitle text can be used to filter the non-subtitle text in the third text set to obtain the filtered third text set. For details, reference may be made to the filtering process of the non-subtitle text and the recognition process of the non-subtitle text in the aforementioned second text set, which will not be described in detail here.
示例性地,确定第二字幕区域的过程可以与前述确定第一字幕区域的过程相同或相似,具体可参考前述内容,在此不详述。Exemplarily, the process of determining the second subtitle area may be the same as or similar to the aforementioned process of determining the first subtitle area. For details, reference may be made to the aforementioned content, which will not be described in detail here.
示例性地,第一字幕区域与第二字幕区域可以是相同区域,也可以是不相同区域。Exemplarily, the first subtitle area and the second subtitle area may be the same area, or may be different areas.
在一些实施例中,在得到第一字幕集之后,可以对第一字幕集进行字幕切换检测、去重等操作,可以避免出现重复的文本与音频组合对。In some embodiments, after the first subtitle set is obtained, operations such as subtitle switching detection and deduplication may be performed on the first subtitle set, so as to avoid duplicate text and audio combination pairs.
示例性地,在上述步骤S120中,其可以包括:Exemplarily, in the above step S120, it may include:
基于第一字幕集中任意两个属于相邻关键帧的字幕文本之间的相似度,确定字幕切换的时间戳,并利用该时间戳对第一字幕集进行划分,得到多个第二字幕集;Based on the similarity between any two subtitle texts belonging to adjacent key frames in the first subtitle set, determine the timestamp of subtitle switching, and use the timestamp to divide the first subtitle set to obtain a plurality of second subtitle sets;
对第二字幕集进行去重,得到目标字幕;Deduplicating the second subtitle set to obtain the target subtitle;
基于第二字幕集的时间戳,确定目标字幕发生字幕切换的时间戳。Based on the time stamp of the second subtitle set, determine the time stamp at which subtitle switching occurs for the target subtitle.
在本示例中,对第一字幕集进行字幕切换检测得到多个第二字幕集,每个字幕集包括多个具有相同内容的字幕,因此,可以对每个第二字幕集进行去重,得到用于组合文本与音频对的目标字幕,避免出现重复的文本与音频组合对。In this example, subtitle switch detection is performed on the first subtitle set to obtain multiple second subtitle sets, and each subtitle set includes multiple subtitles with the same content. Therefore, each second subtitle set can be deduplicated to obtain Target subtitles for combined text and audio pairs, avoiding duplicate text and audio combined pairs.
示例性地,在对第一字幕集进行字幕切换检测之前,将同属一个关键帧的字幕进行合并。例如,一个关键帧可能出现双行字幕,可以将这两行字幕合并为一个字幕。这样,可以方便后续的字幕切换检测。Exemplarily, before subtitle switch detection is performed on the first subtitle set, subtitles belonging to the same key frame are combined. For example, a key frame may have two lines of subtitles, and these two lines of subtitles can be combined into one subtitle. In this way, subsequent subtitle switching detection can be facilitated.
示例性地,在第一字幕集中,计算任意两个属于相邻关键帧的字幕文本的相似度,在相似度大于设定的第一相似度阈值的情况下,将这相邻关键帧的字幕文本确定为内容相同的字幕;在相似度小于设定的相似度阈值的情况下,将这相邻关键帧的两个字幕文本确定为内容不相同的字幕,即发生字幕切换,将这两个关键帧其中的一者所对应的时间戳,确定为字幕切换的时间戳。Exemplarily, in the first subtitle set, the similarity between any two subtitle texts belonging to adjacent key frames is calculated, and when the similarity is greater than the set first similarity threshold, the subtitles of the adjacent key frames are The text is determined as subtitles with the same content; when the similarity is less than the set similarity threshold, the two subtitle texts of the adjacent key frames are determined as subtitles with different content, that is, subtitle switching occurs, and the two The time stamp corresponding to one of the key frames is determined as the time stamp of subtitle switching.
在得到第一字集中发生字幕切换的所有时间戳之后,对时间戳作为分界线,将时间戳属于两分界线之间的关键帧所对应的字幕划分为同一字幕集,得到多个第二字幕集。After obtaining all the time stamps in which subtitle switching occurs in the first subtitle set, use the time stamp as a dividing line, divide the subtitles corresponding to the key frames whose time stamps belong to the two dividing lines into the same subtitle set, and obtain multiple second subtitles set.
示例性地,第二字幕集包括多个字幕。这多个字幕为具有相同内容的字幕。Exemplarily, the second subtitle set includes multiple subtitles. The plurality of subtitles are subtitles having the same content.
示例性地,第二字幕集对应的起始时间戳和终止时间戳,为目标字幕发生字幕切换的时间戳。Exemplarily, the start time stamp and end time stamp corresponding to the second subtitle set are time stamps when subtitle switching occurs for the target subtitle.
在一些实施例中,虽然第二字幕集中的属于相邻关键帧的字幕文本之间的相似度均大于第一相似度阈值,但是未必是完全相同的字幕,这表明这两个字幕可能存在文本识别错误的情况,即存在有某个词语为错误的词语。因此,在对第二字幕集进行去重之前,可以对第二字幕集中的存在错字的字幕进行纠正。In some embodiments, although the similarities between subtitle texts belonging to adjacent key frames in the second subtitle set are greater than the first similarity threshold, they are not necessarily identical subtitles, which indicates that there may be text in the two subtitles. In the case of a recognition error, that is, there is a word that is a wrong word. Therefore, before performing deduplication on the second subtitle set, the typo-existing subtitles in the second subtitle set can be corrected.
为了提高字幕纠错的效率,可以利用第二字幕集中的任意两个字幕之间的相似度来确定存在错字的字幕,并对存在错字的字幕进行纠错。In order to improve the efficiency of subtitle error correction, the similarity between any two subtitles in the second subtitle set can be used to determine the subtitle with typo, and correct the subtitle with typo.
示例性地,在对第二字幕集进行去重之前,上述方法还包括:Exemplarily, before deduplication is performed on the second subtitle set, the above method further includes:
在第二字幕集中第一字幕文本与第二字幕文本的相似度大于第一相似度阈值且小于第二相似度阈值的情况下,对第一字幕文本和第二字幕文本进行文本纠错。When the similarity between the first subtitle text and the second subtitle text in the second subtitle set is greater than the first similarity threshold and smaller than the second similarity threshold, text error correction is performed on the first subtitle text and the second subtitle text.
在本示例中,利用字幕切换的时间戳来划分同属于一个字幕内容的字幕集,并基于每个字幕集中的字幕与字幕之间的相似度确定需要进行文本纠错的字幕,提高字幕纠错的效率。In this example, the time stamp of subtitle switching is used to divide the subtitle sets that belong to the same subtitle content, and based on the similarity between the subtitles in each subtitle set, the subtitles that need text error correction are determined to improve subtitle error correction. s efficiency.
示例性地,上述相似度可以采用编辑距离来计算两个文本的相似度。编辑距离,也叫莱文斯坦距离(Levenshtein),是针对两个字符串(例如英文字)的差异程度的量化量测。量测方式是至少需要经过多少次的处理才能将一个字符串变成另一个字符串。Exemplarily, the above similarity may use edit distance to calculate the similarity of two texts. Edit distance, also called Levenshtein distance, is a quantitative measurement of the degree of difference between two character strings (such as English characters). The measure is the minimum number of operations required to turn one string into another.
示例性地,第一相似度阈值可以是90%,第二相似度阈值可以是100%。Exemplarily, the first similarity threshold may be 90%, and the second similarity threshold may be 100%.
示例性地,第一相似度阈值可以是95%,第二相似度阈值可以是99.9%。Exemplarily, the first similarity threshold may be 95%, and the second similarity threshold may be 99.9%.
在一些实施例中,可以对进行文本纠错的字幕文本进行错词屏蔽,并基于屏蔽后的字幕文本以及屏蔽词,确定目标预测词,然后利用目标预测词对字幕文本中的屏蔽词进行修正。In some embodiments, the subtitle text for text error correction can be masked with wrong words, and based on the masked subtitle text and masked words, the target predicted word can be determined, and then the masked word in the subtitle text can be corrected by using the target predicted word .
示例性地,上述对第一字幕文本和第二字幕文本进行文本纠错,包括:Exemplarily, the text error correction of the first subtitle text and the second subtitle text includes:
在第一字幕文本和第二字幕文本中,将位置相同且词语不相同的词语确定为屏蔽词;In the first subtitle text and the second subtitle text, the words with the same position and different words are determined as shielding words;
在第一字幕文本和第二字幕文本中,对屏蔽词进行屏蔽,得到第三字幕文本;In the first subtitle text and the second subtitle text, the masked words are shielded to obtain the third subtitle text;
基于第三字幕文本以及屏蔽词,确定目标预测词;Based on the third subtitle text and masked words, determine the target prediction word;
基于目标预测词,对第一字幕文本和第二字幕文本中的屏蔽词进行修正。Based on the target predicted words, the masked words in the first subtitle text and the second subtitle text are corrected.
在本示例中,利用对错词进行屏蔽后得到字幕文本以及屏蔽词,确定目标预测词,可以准确地得到正确的词语。In this example, the subtitle text and the masked words are obtained after masking the wrong words, and the target prediction word is determined, so that the correct words can be obtained accurately.
示例性地,上述第一字幕文本和第二字幕文本可以是相邻两帧的字幕文本。Exemplarily, the above-mentioned first subtitle text and second subtitle text may be subtitle texts of two adjacent frames.
示例性地,假设第一字幕文本为“当前路口的人流量很大,还是要小心一点”,第二字幕文本为“当前路口的入流量很大,还是要小心一点”,则“人”和“入”为位置相同但词语音不相同的两个词语,即为屏蔽词。Exemplarily, assuming that the first subtitle text is "There is a lot of traffic at the current intersection, you should be careful", and the second subtitle text is "The traffic at the current intersection is heavy, you should be careful", then "people" and "Enter" is two words with the same position but different phonetics, which are shielding words.
示例性地,目标预测词可以包括一个或多个。Exemplarily, the target prediction word may include one or more.
示例性地,可以对第三字幕文本进行语义分析,得到预测词集以及每个预测词集中的每个预测词的置信度,然后利用每个预测词的置信度、每个预测词与屏蔽词之间的关系,确定目标预测词。Exemplarily, semantic analysis may be performed on the third subtitle text to obtain the predicted word set and the confidence degree of each predicted word in each predicted word set, and then use the confidence degree of each predicted word, each predicted word and the masked word The relationship between to determine the target prediction word.
示例性地,可以采用自然语言处理中的预训练语言模型BERT(BidirectionalEncoder Representation from Transformers)对第三字幕文本进行语义分析,得到预测词集以及每个预测词集中的每个预测词的置信度。Exemplarily, the pre-trained language model BERT (BidirectionalEncoder Representation from Transformers) in natural language processing can be used to perform semantic analysis on the third subtitle text to obtain the predicted word set and the confidence of each predicted word in each predicted word set.
在实际应用中,BERT模型被称为MASK(遮盖)语言模型,即它在训练阶段的主要任务为预测被MASK的元素。如在将文本输入BERT模型之前,将疑似错误的字进行MASK。例如,对上述“人”和“入”进行MASK,得到第三字幕文本为“当前的[MASK]流量很大,还是要小心一点”。然后,利用BERT模型来预测被MASK的词是什么。通过BERT预测出的MASK位置的候选词的集合、每个词的置信度。In practical applications, the BERT model is called a MASK (covered) language model, that is, its main task in the training phase is to predict the elements that are masked. For example, before entering the text into the BERT model, mask the suspected wrong words. For example, MASK is performed on the above-mentioned "person" and "entry", and the third subtitle text is obtained as "the current [MASK] traffic is very heavy, so be careful". Then, use the BERT model to predict what the masked word is. The set of candidate words in the MASK position predicted by BERT, and the confidence of each word.
在一些实施例中,可以结合每个预测词的置信度以及每个预测词与屏蔽词之间编辑距离,确定目标预测词。In some embodiments, the target predicted word may be determined in combination with the confidence of each predicted word and the edit distance between each predicted word and the masked word.
示例性地,上述基于第三字幕文本以及屏蔽词,确定目标预测词,包括:Exemplarily, the determination of the target prediction word based on the third subtitle text and the masked word includes:
对第三字幕文本进行语义分析,得到预测词集以及预测词集中的每个预测词的置信度;Carrying out semantic analysis to the third subtitle text to obtain the predicted word set and the confidence degree of each predicted word in the predicted word set;
基于每个预测词的置信度、以及每个预测词与屏蔽词之间的编辑距离,确定每个预测词的质量评分;Determine the quality score of each predicted word based on the confidence of each predicted word and the edit distance between each predicted word and the masked word;
基于每个预测词的质量评分,在预测词集中确定目标预测词。Based on the quality score of each predicted word, the target predicted word is determined in the predicted word set.
在本示例中,结合每个预测词的置信度以及每个预测词与屏蔽词之间编辑距离,在多个预测词中确定目标预测词,可以进一步提高预测准确率。In this example, combining the confidence of each predicted word and the edit distance between each predicted word and the masked word, determining the target predicted word among multiple predicted words can further improve the prediction accuracy.
示例性地,预测词以及预测词的置信度可以由模型来预测得到。置信度是指预测词是第三字幕文本中被屏蔽位置的真正文本的可信程度。Exemplarily, the predicted word and the confidence of the predicted word can be predicted by a model. Confidence refers to the degree of credibility that the predicted word is the real text in the masked position in the third subtitle text.
示例性地,上述对第三字幕文本进行语义分析可以采用BERT模型。Exemplarily, the above-mentioned semantic analysis of the third subtitle text may use a BERT model.
示例性地,假设BERT模块预测出来的MASK位置的字为:“车”和“人”,其中,“车”的置信度p为0.8123,与屏蔽词“入”之间编辑距离s为0.012,“人”置信度p为0.2313,与“入”的IDS的编辑距离s为0.9342。然后,计对置度度和编辑距离进行求和,得到“车”的质量评分为0.8243,“人”的质量评分为1.1655。因此,可以选预测词“人”作为目标预测词。For example, assume that the words in the MASK position predicted by the BERT module are: "car" and "person", where the confidence p of "car" is 0.8123, and the edit distance s between the masked word "in" is 0.012, The confidence p of "person" is 0.2313, and the edit distance s from the IDS of "in" is 0.9342. Then, summing the disposition degree and edit distance, the quality score of "car" is 0.8243, and the quality score of "person" is 1.1655. Therefore, the predicted word "person" can be selected as the target predicted word.
在一些实施例中,可以基于目标预测词的质量评分和该目标预测词与屏蔽词之间的编辑距离来决定是否进行修改,以避免误修改。In some embodiments, it is possible to decide whether to modify based on the quality score of the target predicted word and the edit distance between the target predicted word and the masked word, so as to avoid erroneous modification.
示例性地,基于目标预测词,对第一字幕文本和第二字幕文本中的屏蔽词进行修正,包括:Exemplarily, based on the target predicted word, the masked words in the first subtitle text and the second subtitle text are corrected, including:
在目标预测词的质量评分大于设定的质量评分阈值,且目标预测词与屏蔽词之间的编辑距离大于设定的编辑距离阈值的情况下,将第一字幕文本和第二字幕文本中的屏蔽词修改为目标预测词。When the quality score of the target predicted word is greater than the set quality score threshold, and the edit distance between the target predictive word and the masked word is greater than the set edit distance threshold, the first subtitle text and the second subtitle text The masked words are changed to target predicted words.
示例性地,在目标预测词的质量评分小于设定的质量评分阈值或者目标预测词与屏蔽词之间的编辑距离小于设定的编辑距离阈值的情况下,对第一字幕文本和第二字幕文本中的屏蔽词不作修改。Exemplarily, when the quality score of the target predicted word is less than the set quality score threshold or the edit distance between the target predictive word and the masked word is less than the set edit distance threshold, the first subtitle text and the second subtitle text Blocked words in the text are not modified.
在本示例中,基于目标预测词的质量评分和该目标预测词与屏蔽词之间的编辑距离来决定是否进行修改,能够避免误修改。In this example, whether to modify is decided based on the quality score of the target predicted word and the edit distance between the target predicted word and the masked word, which can avoid mistaken modification.
示例性地,在目标预测词的质量评分大于1,且目标预测词与屏蔽词之间的编辑距离大于0.5的情况下,才对屏蔽词进行修正,否则不作修改,字幕文本保持不变。Exemplarily, when the quality score of the target predicted word is greater than 1 and the edit distance between the target predicted word and the masked word is greater than 0.5, the masked word is corrected; otherwise, no modification is made and the subtitle text remains unchanged.
在一些实施例中,在对上述第二字幕集进行文本纠错之后,可以对第二字幕集进行去重,得到目标字幕。In some embodiments, after performing text error correction on the second subtitle set, deduplication may be performed on the second subtitle set to obtain the target subtitle.
示例性地,上述对第二字幕集进行去重,得到目标字幕,可以包括:Exemplarily, the above-mentioned deduplication of the second subtitle set to obtain the target subtitle may include:
在第二字幕集中任意两个字幕文本之间的相似度均大于第三相似度阈值的情况下,对第二字幕集进行去重,得到目标字幕。When the similarity between any two subtitle texts in the second subtitle set is greater than the third similarity threshold, deduplication is performed on the second subtitle set to obtain the target subtitle.
在本示例性中,如何第二字幕集中任意两个字幕文本之间的相似度均大于第三相似度阈值,说明第二字幕集中的字幕均为相同字幕,且不存在错字字幕,此时可以对第二字幕集进行去重,提高得到目标字幕的准确度。In this example, if the similarity between any two subtitle texts in the second subtitle set is greater than the third similarity threshold, it means that the subtitles in the second subtitle set are all the same subtitles, and there are no typo subtitles. Deduplication is performed on the second subtitle set to improve the accuracy of obtaining target subtitles.
示例性地,第三相似度阈值可以大于或等于前述第二相似度阈值。Exemplarily, the third similarity threshold may be greater than or equal to the aforementioned second similarity threshold.
示例性地,第三相似度阈值可以是99%或100%等。Exemplarily, the third similarity threshold may be 99% or 100% or the like.
在一些实施例中,在通过上述步骤得到文本与音频的组合对之后,可以得用组合对中的音频对文本进行校验,以确定组合对是否可以作为语音识别模型的训练样本,从而提高利用训练样本所训练得到的语音识别模型的识别准确度。In some embodiments, after the combined pair of text and audio is obtained through the above steps, the audio in the combined pair can be used to check the text to determine whether the combined pair can be used as a training sample for the speech recognition model, thereby improving the utilization The recognition accuracy of the speech recognition model trained by the training samples.
示例性地,在得到文本与音频的组合对之后,还包括:Exemplarily, after obtaining the combined pair of text and audio, it further includes:
基于语音识别模型对目标音频进行语音识别,得到目标字幕对应的校验文本;Perform speech recognition on the target audio based on the speech recognition model, and obtain the verification text corresponding to the target subtitle;
基于目标字幕及其对应的校验文本,确定组合对的置信度;Determine the confidence of the combination pair based on the target subtitle and its corresponding verification text;
在组合对的置信度满足设定的置信度阈值的情况下,确定组合对为语音识别模型的训练样本。When the confidence of the combination pair satisfies the set confidence threshold, it is determined that the combination pair is a training sample for the speech recognition model.
在本示例性中,利用预先训练好的语音识别模型对目标音频进行语音识别,并将识别得到的校验文本对目标字幕进行校验,得到由目标音频和目标字幕所组成的组合对的置信度,基于置信度来确定是否将该组合对确定为训练样本,能够提高语音识别模型的训练精度。In this example, the pre-trained speech recognition model is used to perform speech recognition on the target audio, and the verified verification text obtained by the recognition is verified against the target subtitle to obtain the confidence of the combined pair consisting of the target audio and the target subtitle degree, and determining whether to determine the combination pair as a training sample based on the confidence degree can improve the training accuracy of the speech recognition model.
示例性地,可以利用目标字幕及其校验文本之间的编缉距离、目标字幕的字符长度以及校验文本的字符长度,确定组合对的置信度。Exemplarily, the confidence of the combination pair can be determined by using the editing distance between the target subtitle and its verification text, the character length of the target subtitle and the character length of the verification text.
示例性地,假设语音识别模型ASR识别到的结果为string1,文本识别OCR出来的字幕结果为string2,则该训练样本的置信度C可表示为:Exemplarily, assuming that the result recognized by the speech recognition model ASR is string1, and the subtitle result obtained by text recognition OCR is string2, then the confidence C of the training sample can be expressed as:
C=1-EditDistance(string1,string2)/max(len(string1),len(string2))C=1-EditDistance(string1,string2)/max(len(string1),len(string2))
其中,EditDistance(string1,string2)表示string1与string2之间的编辑距离,len(string1)为string1的长度,len(string2)为string2的长度,max()函数为取最大值函数。Among them, EditDistance(string1, string2) indicates the edit distance between string1 and string2, len(string1) is the length of string1, len(string2) is the length of string2, and the max() function is a function for obtaining the maximum value.
当ASR识别结果和OCR识别结果一致性高时,即string1和string2一致性高,则训练样本的置信度C的数值高,反之,当两者一致性低时,置信度C的数值低。从而,通过置信度C可以过滤掉标注质量低的样本。When the consistency between the ASR recognition result and the OCR recognition result is high, that is, the consistency between string1 and string2 is high, the value of the confidence degree C of the training sample is high; otherwise, when the consistency between the two is low, the value of the confidence degree C is low. Therefore, samples with low labeling quality can be filtered out through the confidence C.
示例性地,上述训练样本可以应用于任意的语音识别模型的训练。其中,目标音频作为语音识别模型的输入,目标字幕作为目标音频的标注,其与语音识别模型的输出进行比较,进而根据比较结果调整语音识别模型的参数。Exemplarily, the above training samples can be applied to the training of any speech recognition model. Among them, the target audio is used as the input of the speech recognition model, and the target subtitle is used as the annotation of the target audio, which is compared with the output of the speech recognition model, and then the parameters of the speech recognition model are adjusted according to the comparison result.
在一些实施例中,上述生成的训练样本还不能直接应用于模型训练,对于训练样本中的文本会出现一些数字或者函数符号等,需要对其进行正则化处理。In some embodiments, the training samples generated above cannot be directly applied to model training, and there may be some numbers or function symbols in the text in the training samples, which need to be regularized.
示例性地,在得到训练样本之后,上述方法还包括:Exemplarily, after obtaining the training samples, the above method further includes:
在目标字幕存在目标字符的情况下,对目标字幕进行正则化处理。In the case that the target subtitles have target characters, regularization processing is performed on the target subtitles.
在本示例中,对存在目标字符的目标字幕进行正则化处理,所得到的训练样本可以应用于模型训练。In this example, regularization is performed on target subtitles with target characters, and the obtained training samples can be applied to model training.
示例性地,目标字符可以包括数字、函数符号、公式、时间日期,单位等。将目标字符转换为对应的中文汉字。一般可以通过字典映射与正则匹配等方法即可实现。Exemplarily, the target characters may include numbers, function symbols, formulas, time and date, units and the like. Convert the target character to the corresponding Chinese character. Generally, it can be realized through methods such as dictionary mapping and regular matching.
例如,这块黄金重达324.75克,正则化后为:这块黄金重达三百二十四点七五克;她出生于86年8月18日她弟弟出生于1995年3月1日,正则化后为:她出生于八六年八月十八日她弟弟出生于一九九五年三月一日;明天有62%的概率降雨,正则化后为:明天有百分之六十二的概率降雨;这是手机8618544139121,正则化后为:这是手机八六一八五四四一三九一二一。For example, this piece of gold weighs 324.75 grams, after regularization: this piece of gold weighs 324.75 grams; she was born on August 18, 1986 and her brother was born on March 1, 1995, After regularization: she was born on August 18, 1986 and her brother was born on March 1, 1995; there is a 62% probability of rain tomorrow, and after regularization, it is: 60% tomorrow The probability of rain is two; this is the mobile phone 8618544139121, after regularization: this is the mobile phone 8618544139121.
以下将描述基于OCR识别与ASR识别相结合的方式构建语音识别模型的训练样本集的一个应用示例,具体如下:The following will describe an application example of constructing a training sample set of a speech recognition model based on the combination of OCR recognition and ASR recognition, as follows:
如图5所示,语音识别模型的训练样本集的构建过程主要包括:视频关键帧提取,OCR文本识别,水印和场景的非字幕文本过滤,字幕切换检测、文本纠错与生成,文本正则化以及样本标注校验。As shown in Figure 5, the construction process of the training sample set of the speech recognition model mainly includes: video key frame extraction, OCR text recognition, watermark and scene non-subtitle text filtering, subtitle switching detection, text error correction and generation, and text regularization And sample annotation verification.
(1)视频关键帧提取(1) Video key frame extraction
遍历视频中的每一帧以进行文本检测,并通过帧差法比较前后两帧是否重复,如果重复就删除冗余帧,最终得到包含字幕的关键帧。判断前后两帧是否重复的方法是比较两帧图像中提取的文本字符串的相似度。相似度的计算方法可以为编辑距离等。对图像帧进行文本检测与识别模可以用RCNN深度学习模型来实现。此外,还可以基于FPS(每秒的帧数)来快速的提取关键帧,比如当FPS为24时,可以每秒提取3帧作为关键帧,并且每间隔8帧提取一个关键帧,从而得到关键帧集。这种提取方法较快。具体提取关键帧的方法可以根据视频数据的特点或者资源条件来确定。Traverse each frame in the video for text detection, and compare whether the two frames before and after are repeated by the frame difference method. If it is repeated, delete the redundant frame, and finally get the key frame containing the subtitle. The method of judging whether the two frames before and after are repeated is to compare the similarity of the text strings extracted from the two frames of images. The calculation method of the similarity may be an edit distance or the like. The text detection and recognition module for image frames can be implemented with the RCNN deep learning model. In addition, key frames can be extracted quickly based on FPS (frames per second). For example, when the FPS is 24, 3 frames per second can be extracted as key frames, and a key frame can be extracted every 8 frames, so as to get the key frame set. This extraction method is faster. A specific method for extracting key frames may be determined according to characteristics of video data or resource conditions.
(2)OCR文本识别(2) OCR text recognition
在获得视频文件的关键帧之后,可以利用OCR技术从关键帧中识别出文本块的文本信息以及每个文本块在关键帧中的位置信息。一般情况下,OCR得到的结果为[(text1,(xmin,xmax,ymin,ymax)),(text2,(xmin,xmax,ymin,ymax)),……]。其中,text1和text2为识别得到的文本信息,xmin,xmax表示文本块在关键帧中的最小横坐标和最大横坐标,ymin,ymax表示文本块在关键帧中的最小中纵坐标和最大纵坐标,具体如图4所示。After the key frame of the video file is obtained, the text information of the text block and the position information of each text block in the key frame can be identified from the key frame by using the OCR technology. In general, the results obtained by OCR are [(text1,(xmin,xmax,ymin,ymax)),(text2,(xmin,xmax,ymin,ymax)),……]. Among them, text1 and text2 are the recognized text information, xmin, xmax represent the minimum abscissa and maximum abscissa of the text block in the key frame, ymin, ymax represent the minimum middle ordinate and maximum ordinate of the text block in the key frame , specifically as shown in Figure 4.
(3)水印和背景的非字幕文本过滤(3) Non-subtitle text filtering for watermark and background
在一些电视剧、综艺节目中会存在台标或者广告等背景信息,例如图4中的“XXTV电视剧频道”是一个台标。这些背景信息往往会在视频中的固定位置,并在视频文件的整个播放时间段内都存在。因此,可以通过统计OCR识别结果中所有文本的坐标信息出现的频次,通过频次阈值过滤掉这些背景内容,避免这些信息被误认为是字幕信息。需要注意的是,同一文本在不同帧画面被OCR识别出来的位置坐标信息会存在微小的差异,例如图3中,在前后两帧中识别出来的“好好学习,天天向上”的位置坐标可能为(125,512,678,780)和(126,513,679,781),因此需要将他们的坐标进行“归一化”。即当两个不同帧中两个文本块的坐标差值小于预定的差值阈值时,可以将其看作为同一坐标。例如,文本块(x1,y1)与文本块(x2,y2)之间的坐标差值为||x1-x2||<a,||y1-y2||<b时,则两个文本块的位置认为是相同。其中a和b为预定阈值,可以根据实际情况进行设定,例如a为50个像素值,b为100个像素值。在过滤掉水印和背景内容后,按照文本的出频次排序对剩下的文本进行排序,将出现频次最高的文本的坐标区域确定为字幕区域,可以记作(Xmin,Xmax,Ymin,Ymax),位于该区域中的文本,或者文本所在区域与字幕区域相匹配,则将该文本确定为字幕。Background information such as station logos or advertisements may exist in some TV dramas and variety shows. For example, "XXTV TV drama channel" in FIG. 4 is a station logo. This contextual information tends to be at a fixed location in the video and exists throughout the duration of the video file. Therefore, by counting the frequency of the coordinate information of all texts in the OCR recognition result, these background contents can be filtered out through the frequency threshold, so as to prevent the information from being mistaken for subtitle information. It should be noted that there will be slight differences in the position coordinate information of the same text recognized by OCR in different frames. For example, in Figure 3, the position coordinates of "study hard and improve every day" recognized in the two frames before and after may be (125, 512, 678, 780) and (126, 513, 679, 781), so their coordinates need to be "normalized". That is, when the coordinate difference of two text blocks in two different frames is smaller than a predetermined difference threshold, they can be regarded as the same coordinate. For example, when the coordinate difference between the text block (x1, y1) and the text block (x2, y2) is ||x1-x2||<a,||y1-y2||<b, then the two text blocks The positions are considered to be the same. Wherein a and b are predetermined thresholds, which can be set according to actual conditions, for example, a is 50 pixel values, and b is 100 pixel values. After filtering out the watermark and background content, sort the remaining text according to the frequency of the text, and determine the coordinate area of the text with the highest frequency as the subtitle area, which can be recorded as (Xmin, Xmax, Ymin, Ymax), If the text is located in the area, or the area where the text is located matches the subtitle area, then the text is determined as the subtitle.
(4)字幕切换检测、文本纠错与生成(4) Subtitle switching detection, text error correction and generation
如图6所示,在得到视频的字幕后,我们需要将其中同一帧的文本块进行合并,比如双行字幕情况。同时,在关键帧冗余的情况下会存在字幕重复的问题,需要将重复字幕删除。As shown in Figure 6, after obtaining the subtitles of the video, we need to merge the text blocks of the same frame, such as the case of double-line subtitles. At the same time, in the case of redundant key frames, there will be a problem of repeated subtitles, and the repeated subtitles need to be deleted.
如图7所示,在字幕位置校验模块中,在利用文本识别得到的字幕中,可能会存在因多行字幕或者背景过滤不充分导致匹配适配的问题,这种情况往往存在与自媒体所制作的视频当中。如图3A和图3B所示,其为老师授课时的场景,视频中存在背景字幕,且多行字幕。如果上述得到的字幕位置来校验当前的文本的位置,就会存在字幕位置与文本的位置不匹配的情况,即文本所在的位置区域(Xminc,Xmaxc,Yminc,Ymaxc)大于字幕位置,例如Yminc<Ymin,Ymaxc>Ymax,这种情况下仅基于字幕位置来确定字幕则会丢失字幕内容,但是又不能默认将字幕区域外的文本信息直接当成字幕。因此,可以标记当前文本的位置区域,利用当前文本发生文本切换的时间戳获取对应的音频,然后利用ASR识别该音频得到校验文本,并利用校验文本对当前OCR识别的文本进行召唤。具体召回方法为:As shown in Figure 7, in the subtitle position verification module, in the subtitles obtained by text recognition, there may be problems of matching and adaptation due to multi-line subtitles or insufficient background filtering. in the video produced. As shown in Figure 3A and Figure 3B, it is the scene when the teacher is teaching, there are background subtitles in the video, and there are multiple lines of subtitles. If the subtitle position obtained above is used to verify the current text position, there will be a situation where the subtitle position does not match the text position, that is, the position area where the text is located (Xminc, Xmaxc, Yminc, Ymaxc) is greater than the subtitle position, such as Yminc <Ymin,Ymaxc>Ymax, in this case, if the subtitle is determined only based on the subtitle position, the subtitle content will be lost, but the text information outside the subtitle area cannot be directly regarded as subtitles by default. Therefore, the position area of the current text can be marked, the corresponding audio can be obtained by using the time stamp of the text switching in the current text, and then the audio can be recognized by ASR to obtain the verification text, and the text recognized by the current OCR can be summoned by the verification text. The specific recall method is as follows:
计算OCR识别文本行string(ocr)与ASR识别结果string(asr)之间的重合度m=len(string(ocr)&string(asr))/len(string(ocr);Calculate the coincidence degree m=len(string(ocr)&string(asr))/len(string(ocr) between the OCR recognition text line string(ocr) and the ASR recognition result string(asr);
其中,len()是指取字符长度的函数,string(ocr)&string(asr)提指字符string(ocr)与字符string(asr)之间的交集字符;Among them, len() refers to the function of taking the character length, and string(ocr)&string(asr) refers to the intersection character between the character string(ocr) and the character string(asr);
当重合度大于设定的重合度阈值时,将OCR识别文本行确定为字幕内容。When the coincidence degree is greater than the set coincidence degree threshold, the OCR-recognized text line is determined as the subtitle content.
如图6所示,在得到视频文件的所有字幕之后,对同一帧的所有字幕进行合并。然后,再进行图7所示的字幕切换检测。As shown in FIG. 6, after all subtitles of the video file are obtained, all subtitles of the same frame are combined. Then, subtitle switching detection shown in FIG. 7 is performed.
本应用示例利用不同帧之间的字幕信息相似度进行字幕切换的检测。如图7所示,依次遍历字幕行,记录当前帧号的起始帧frame_s,比较当前帧的字幕与下一帧的字幕之间的字幕相似度,当相邻帧的字幕相似度大于预设定的阈值(如0.9),则默认这两帧的字幕内容相同,则跳过下一帧,继续跟下一帧进行比较,直到比较到不同字幕帧,记录当前帧号为字幕的结束帧号frame_e,以此类推,直到遍历完所有字幕行。从而,得到当前字幕发生字幕切换的时间戳为:frame_s和frame_e。This application example uses the similarity of subtitle information between different frames to detect subtitle switching. As shown in Figure 7, traverse the subtitle line in turn, record the start frame frame_s of the current frame number, compare the subtitle similarity between the subtitle of the current frame and the subtitle of the next frame, when the subtitle similarity of the adjacent frame is greater than the preset If the specified threshold (such as 0.9), the subtitle content of the two frames is the same by default, then skip the next frame and continue to compare with the next frame until a different subtitle frame is compared, and record the current frame number as the end frame number of the subtitle frame_e, and so on, until all subtitle lines are traversed. Thus, the time stamps at which subtitle switching occurs in the current subtitle are: frame_s and frame_e.
上文提到的相似度的方案可以采用字符串的编辑距离计算得到。例如,可以基于IDS(Ideographic Description Sequence)的编辑距离进行相似度的比较。IDS只是将象形文字用序列的方式进行表征,类似英文中用字母序列表示,中文可以用笔画、以及汉字结构进行表示,如图8所示,贫字的IDS表示。这一表征方式比单纯的形近字表征更为强大且更具鲁棒性。The similarity scheme mentioned above can be calculated by using the edit distance of the string. For example, the similarity comparison can be performed based on the edit distance of IDS (Ideographic Description Sequence). IDS only characterizes hieroglyphs in a sequence, similar to the sequence of letters in English, and Chinese can be represented by strokes and Chinese character structures. As shown in Figure 8, the IDS representation of poor characters. This representation method is more powerful and robust than pure form and near-character representation.
如果在字幕切换检测时,存在相邻帧的字幕相似度大于预定阈值,但是又不完全相同时,则表明当前字幕中可行存在OCR识别错误的情况。例如,相邻两帧识别结果为“当前路口的人流量很大,还是要小心一点”和“当前路口的入流量很大,还是要小心一点”。在这个示例中,OCR将“人”识别成“入”。OCR在进行文本识别时,会给出识别文本的结果,文本的位置信息,以及文本的置信度。因此,也可以利用OCR识别结果中的置信度来判断当前字符是否存在错误,需要被修改。If the subtitle similarity of adjacent frames is greater than a predetermined threshold but not completely the same during subtitle switching detection, it indicates that there may be an OCR recognition error in the current subtitle. For example, the recognition results of two adjacent frames are "the traffic at the current intersection is heavy, so be careful" and "the traffic at the current intersection is heavy, you should be careful". In this example, OCR recognizes "person" as "in". When OCR performs text recognition, it will give the result of recognizing the text, the location information of the text, and the confidence of the text. Therefore, the confidence in the OCR recognition result can also be used to determine whether the current character has an error and needs to be modified.
因此,本应用示例考虑对存在错误的字幕内容进行文本纠错。可以利用自然语言处理中的预训练语言模型BERT来进行纠错。BERT被称为MASK(遮盖)语言模型,即它在训练阶段的主要任务为预测被MASK的元素。例如,在文本输入BERT模型时,将疑似错误的字进行MASK:“当前的[MASK]流量很大,还是要小心一点”,然后用BERT模型预测被MASK的词是什么。通过BERT预测出的MASK位置的候选词的集合、每个词的置信度p以及候选词与被MASK字之间的IDS编辑距离s,确定最终MASK位置正确的词是什么。例如,BERT预测出来的MASK位置的词为:Therefore, this application example considers text error correction for erroneous subtitle content. Error correction can be performed using the pre-trained language model BERT in natural language processing. BERT is called a MASK (covered) language model, that is, its main task in the training phase is to predict the elements that are masked. For example, when text is input into the BERT model, mask suspected wrong words: "The current [MASK] traffic is very heavy, so be careful", and then use the BERT model to predict what the masked words are. Through the set of candidate words in the MASK position predicted by BERT, the confidence p of each word, and the IDS edit distance s between the candidate word and the masked word, determine what the final word with the correct MASK position is. For example, the word in the MASK position predicted by BERT is:
1)“车”,置信度p为0.8123,与“入”的IDS编辑距离s为0.012;1) For "vehicle", the confidence p is 0.8123, and the IDS edit distance s from "entry" is 0.012;
2)“人”置信度p为0.2313,与“入”的IDS的编辑距离s为0.9342;2) The confidence p of "person" is 0.2313, and the edit distance s with the IDS of "in" is 0.9342;
最终的评分规则为Score=p+s值最高的值为最终的候选词。The final scoring rule is that the highest value of Score=p+s is the final candidate word.
在一些实施例中,可以加额外的限制条件,提高精度避免误修改。例如最终的形似度s要大于0.5,且Score值大于1才对原字进行修改,否则保持不变等。In some embodiments, additional restrictive conditions can be added to improve accuracy and avoid erroneous modification. For example, the final shape similarity s must be greater than 0.5, and the Score value must be greater than 1 to modify the original character, otherwise it remains unchanged.
(5)文本正则化以及样本标注校验(5) Text regularization and sample label verification
最终生成的视频文件的字幕信息是一个srt格式的字幕文件,它包含时间代码+字幕信息,如图9所示。The subtitle information of the finally generated video file is a subtitle file in srt format, which includes time code + subtitle information, as shown in FIG. 9 .
原始音频文件可以通过ffmpeg等音视频处理工具从原始视频中提取出来,结合字幕文件可以将音频切分出来,并将音频与字幕匹配组成<text,audio>文本音频对,也就是语音识别训练中所需的文本和语音的训练样本。The original audio file can be extracted from the original video through audio and video processing tools such as ffmpeg, combined with the subtitle file, the audio can be segmented, and the audio and subtitle can be matched to form a <text, audio> text-audio pair, which is the speech recognition training Desired training samples of text and speech.
但是,上述过程生成的数据不能直接用于ASR模型的训练,需要对文本进行正则化处理。文本正则化处理是指将文字中出现的埃利伯数字、时间日期、单位等用中文汉字表述出来,一般可通过字典映射与正则匹配等方法实现,如下所示:However, the data generated by the above process cannot be directly used for the training of the ASR model, and the text needs to be regularized. Text regularization refers to the expression of Elibe numbers, time and date, units, etc. that appear in the text with Chinese characters. Generally, it can be realized by dictionary mapping and regular matching, as shown below:
1)这块黄金重达324.75克——>正则化——>这块黄金重达三百二十四点七五克;1) This piece of gold weighs 324.75 grams --> regularization --> this piece of gold weighs 324.75 grams;
2)她出生于86年8月18日她弟弟出生于1995年3月1日——>正则化——>她出生于八六年八月十八日她弟弟出生于一九九五年三月一日;2) She was born on August 18, 1986 and her brother was born on March 1, 1995 --> regularization --> she was born on August 18, 1986. Her brother was born on March 1995 the first day of the month;
3)明天有62%的概率降雨——>正则化——>明天有百分之六十二的概率降雨;3) There is a 62% probability of rain tomorrow --> regularization --> there is a 62% probability of rain tomorrow;
4)这是手机8618544139121——>正则化——>这是手机八六一八五四四一三九一二一。4) This is the mobile phone 8618544139121 --> regularization --> this is the mobile phone 8618544139121.
此外,为保证训练样本的质量,可以利用预训练好的ASR模型对训练样本中的OCR标注的文本进行校验。假设ASR得到的音频识别结果为string1,OCR识别出来的字幕结果为string2,则该条数据样本标注的置信度可表示为:In addition, in order to ensure the quality of the training samples, the pre-trained ASR model can be used to verify the OCR-marked text in the training samples. Assuming that the audio recognition result obtained by ASR is string1, and the subtitle result recognized by OCR is string2, the confidence level of the data sample label can be expressed as:
C=1-EditDistance(string1,string2)/max(len(string1),len(string2))C=1-EditDistance(string1,string2)/max(len(string1),len(string2))
当ASR识别结果和OCR识别结果一致性高时,string1和string2一致性高,置信度C高,反之,当ASR识别结果和OCR识别结果一致性低时,置信度C低。因此,通过置信度C值可以过滤掉标注质量低的训练样本。When the consistency between the ASR recognition result and the OCR recognition result is high, the consistency between string1 and string2 is high, and the confidence degree C is high. On the contrary, when the consistency between the ASR recognition result and the OCR recognition result is low, the confidence degree C is low. Therefore, the training samples with low labeling quality can be filtered out through the confidence C value.
图10是本公开一实施例的音频与文本组合装置的框图。FIG. 10 is a block diagram of an audio and text combining device according to an embodiment of the present disclosure.
如图10所示,本公开实施例提供一种音频与文本组合装置,包括:As shown in FIG. 10, an embodiment of the present disclosure provides an audio and text combination device, including:
文本识别模块110,用于对视频文件中的关键帧进行文本识别,得到第一字幕集;Text recognition module 110, is used for carrying out text recognition to the key frame in video file, obtains the first subtitle set;
字幕切换检测模块120,用于对所述第一字幕集进行字幕切换检测,得到目标字幕和所述目标字幕发生字幕切换的时间戳;A subtitle switching detection module 120, configured to perform subtitle switching detection on the first subtitle set, to obtain the target subtitle and the timestamp when the subtitle switch occurs in the target subtitle;
目标音频截取模块130,用于基于所述时间戳,从所述视频文件的音频文件中截取音频,得到目标音频;The target audio interception module 130 is used to intercept the audio from the audio file of the video file based on the timestamp to obtain the target audio;
音频文本组合模块140,用于基于所述目标字幕和所述目标音频,构建文本与音频的组合对。The audio-text combination module 140 is configured to construct a combination pair of text and audio based on the target subtitle and the target audio.
在一些可能的实现方式,上述文本识别模块110,可以包括:In some possible implementations, the above-mentioned text recognition module 110 may include:
候选字幕识别单元,用于对视频文件中的关键帧进行文本识别,得到候选字幕文本;A candidate subtitle recognition unit is used to perform text recognition on key frames in the video file to obtain candidate subtitle text;
校验音频获取单元,用于在所述视频文件中,确定所述候选字幕文本对应的校验音频;A verification audio acquisition unit, configured to determine the verification audio corresponding to the candidate subtitle text in the video file;
字幕校验单元,用于在所述候选字幕文本与所述校验音频之间的关系符合设定条件的情况下,确定所述候选字幕文本为所述视频文件的字幕文本;A subtitle verification unit, configured to determine that the candidate subtitle text is the subtitle text of the video file when the relationship between the candidate subtitle text and the verification audio meets a set condition;
第一字幕集获取单元,用于获取所述视频文件的字幕文本,得到第一字幕集。The first subtitle set acquiring unit is configured to acquire the subtitle text of the video file to obtain the first subtitle set.
示例性地,所述字幕校验单元具体用于:Exemplarily, the subtitle checking unit is specifically used for:
对所述校验音频进行语音识别,得到校验文本;performing speech recognition on the verification audio to obtain the verification text;
在所述候选字幕文本与所述校验文本之间的重合度大于设定的重合度阈值的情况下,确定所述候选字幕文本为所述视频文件的字幕文本。If the coincidence degree between the candidate subtitle text and the verification text is greater than a set coincidence degree threshold, it is determined that the candidate subtitle text is the subtitle text of the video file.
在一些可能的实现方式中,所述候选字幕识别单元具体用于:In some possible implementation manners, the candidate subtitle identification unit is specifically configured to:
对视频文件中的每个关键帧进行文本识别,得到第一文本集;performing text recognition on each key frame in the video file to obtain the first text set;
基于所述第一文本集,确定非字幕文本;determining non-subtitle text based on the first text set;
在所述第一文本集中过滤所述非字幕文本,得到第二文本集;filtering the non-subtitle text in the first text set to obtain a second text set;
基于所述第二文本集,确定候选字幕文本。Based on the second text set, candidate subtitle texts are determined.
在一些可能的实现方式中,所述基于所述第一文本集,确定非字幕文本,包括:In some possible implementation manners, the determining the non-subtitle text based on the first text set includes:
基于所述第一文本块集中的每个文本的位置信息,确定位置相同的文本的出现频次;Based on the position information of each text in the first text block set, determine the frequency of occurrence of the text with the same position;
在所述出现频次大于设定的频次阈值的情况下,将所述位置相同的文本确定为非字幕文本。If the frequency of occurrence is greater than a set frequency threshold, the text with the same position is determined as non-subtitle text.
在一些可能的实现方式中,所述基于所述第二文本集,确定候选字幕文本,包括:In some possible implementation manners, the determining candidate subtitle text based on the second text set includes:
基于所述第二文本集中的每个文本的位置信息,确定第一字幕区域;determining a first subtitle area based on the location information of each text in the second text set;
针对所述第二文本集中的每个文本,在所述文本的位置信息与所述第一字幕区域的匹配程度小于设定的匹配阈值的情况下,确定所述文本为候选字幕文本。For each text in the second text set, if the matching degree between the position information of the text and the first subtitle area is less than a set matching threshold, determine that the text is a candidate subtitle text.
在一些可能的实现方式,上述基于所述第二文本集,确定候选字幕文本,还包括:In some possible implementation manners, the above-mentioned determination of candidate subtitle texts based on the second text set further includes:
针对所述第二文本集中的每个文本,在所述文本的位置信息与所述第一字幕区域的匹配程度大于设定的匹配阈值的情况下,确定所述文本为所述视频文件的字幕文本。For each text in the second text set, when the matching degree of the position information of the text and the first subtitle area is greater than a set matching threshold, it is determined that the text is the subtitle of the video file text.
在一些可能的实现方式中,上述候选字幕识别单元具体用于:In some possible implementation manners, the above candidate subtitle identification unit is specifically configured to:
对视频文件中的每个关键帧进行文本识别,得到第三文本集;performing text recognition on each key frame in the video file to obtain a third text set;
基于所述第三文本集中的每个文本的位置信息,确定第二字幕区域;determining a second subtitle area based on the position information of each text in the third text set;
针对所述第三文本集中的每个文本,在所述文本的位置信息与所述第二字幕区域的匹配程度大于设定的匹配阈值的情况下,确定所述文本为所述视频文件的字幕文本;For each text in the third text set, when the matching degree between the position information of the text and the second subtitle area is greater than a set matching threshold, determine that the text is the subtitle of the video file text;
获取所述视频文本的字幕文本,得到第一字幕集。Obtain the subtitle text of the video text to obtain the first subtitle set.
在一些可能的实现方式中,上述字幕切换检测模块120包括:In some possible implementations, the subtitle switching detection module 120 includes:
字幕划分单元,用于基于所述第一字幕集中任意两个属于相邻关键帧的字幕文本之间的相似度,确定字幕切换的时间戳,并利用所述时间戳对所述第一字幕集进行划分,得到多个第二字幕集;The subtitle division unit is used to determine the timestamp of subtitle switching based on the similarity between any two subtitle texts belonging to adjacent key frames in the first subtitle set, and use the timestamp to update the first subtitle set performing division to obtain multiple second subtitle sets;
字幕去重单元,用于对所述第二字幕集进行去重,得到目标字幕;a subtitle deduplication unit, configured to deduplicate the second subtitle set to obtain target subtitles;
时间戳确定单元,用于基于所述第二字幕集的时间戳,确定所述目标字幕发生字幕切换的时间戳。A timestamp determining unit, configured to determine, based on the timestamp of the second subtitle set, the timestamp at which subtitle switching occurs for the target subtitle.
在一些可能的实现方式中,上述字幕切换检测模块120还包括:In some possible implementations, the subtitle switching detection module 120 further includes:
字幕纠错单凶,用于在对所述第二字幕集进行去重之前,在所述第二字幕集中第一字幕文本与第二字幕文本的相似度大于第一相似度阈值且小于第二相似度阈值的情况下,对所述第一字幕文本和所述第二字幕文本进行文本纠错。The subtitle error correction unit is used to ensure that the similarity between the first subtitle text and the second subtitle text in the second subtitle set is greater than the first similarity threshold and less than the second subtitle text before deduplicating the second subtitle set. In the case of a similarity threshold, text error correction is performed on the first subtitle text and the second subtitle text.
在一些可能的实现方式中,所述字幕纠错单元具体用于:In some possible implementation manners, the subtitle error correction unit is specifically configured to:
在所述第一字幕文本和所述第二字幕文本中,将位置相同且词语不相同的词语确定为屏蔽词;In the first subtitle text and the second subtitle text, determine words with the same position but different words as masked words;
在所述第一字幕文本和所述第二字幕文本中,对所述屏蔽词进行屏蔽,得到第三字幕文本;In the first subtitle text and the second subtitle text, masking the masked words to obtain a third subtitle text;
基于所述第三字幕文本以及所述屏蔽词,确定目标预测词;Determine target prediction words based on the third subtitle text and the masked words;
基于所述目标预测词,对所述第一字幕文本和所述第二字幕文本中的屏蔽词进行修正。Based on the target predicted word, the masked words in the first subtitle text and the second subtitle text are corrected.
在一些可能的实现方式中,上述字幕去重单元具体用于:In some possible implementation manners, the above subtitle deduplication unit is specifically used for:
在所述第二字幕集中任意两个字幕文本之间的相似度均大于第三相似度阈值的情况下,对所述第二字幕集进行去重,得到目标字幕。When the similarity between any two subtitle texts in the second subtitle set is greater than a third similarity threshold, deduplication is performed on the second subtitle set to obtain a target subtitle.
在一些可能的实现方式中,上述装置还包括:In some possible implementations, the above device also includes:
校验文本获取模块,用于基于所述语音识别模型对所述目标音频进行语音识别,得到所述目标字幕对应的校验文本;A verification text acquisition module, configured to perform speech recognition on the target audio based on the speech recognition model, to obtain the verification text corresponding to the target subtitle;
置信度确定模块,用于基于所述目标字幕及其对应的校验文本,确定所述组合对的置信度;A confidence degree determination module, configured to determine the confidence degree of the combination pair based on the target subtitle and its corresponding verification text;
训练样本确定模块,用于在所述组合对的置信度满足设定的置信度阈值的情况下,确定所述组合对为语音识别模型的训练样本。The training sample determination module is configured to determine the combination pair as a training sample for the speech recognition model when the confidence of the combination pair satisfies a set confidence threshold.
在一些可能的实现方式中,上述装置还包括:In some possible implementations, the above device also includes:
正则化处理模块,用于在所述目标字幕存在目标字符的情况下,对所述目标字幕进行正则化处理。A regularization processing module, configured to perform regularization processing on the target subtitle when there are target characters in the target subtitle.
本公开实施例各装置中的各单元、模块或子模块的功能可以参见上述方法实施例中的对应描述,在此不再赘述。For functions of each unit, module, or submodule in each device in the embodiments of the present disclosure, reference may be made to the corresponding descriptions in the foregoing method embodiments, and details are not repeated here.
根据本公开的实施例,本公开还提供了一种电子设备、一种可读存储介质和一种计算机程序产品。According to the embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.
图11示出了可以用来实施本公开的实施例的示例电子设备800的示意性框图。电子设备旨在表示各种形式的数字计算机,诸如,膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置,诸如,个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例,并且不意在限制本文中描述的和/或要求的本公开的实现。FIG. 11 shows a schematic block diagram of an example
如图11所示,电子设备800包括计算单元801,其可以根据存储在只读存储器(ROM)802中的计算机程序或者从存储单元808加载到随机访问存储器(RAM)803中的计算机程序来执行各种适当的动作和处理。在RAM 803中,还可存储电子设备800操作所需的各种程序和数据。计算单元801、ROM 802以及RAM 803通过总线804彼此相连。输入输出(I/O)接口805也连接至总线804。As shown in FIG. 11 , the
电子设备800中的多个部件连接至I/O接口805,包括:输入单元806,例如键盘、鼠标等;输出单元807,例如各种类型的显示器、扬声器等;存储单元808,例如磁盘、光盘等;以及通信单元809,例如网卡、调制解调器、无线通信收发机等。通信单元809允许电子设备800通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。Multiple components in the
计算单元801可以是各种具有处理和计算能力的通用和/或专用处理组件。计算单元801的一些示例包括但不限于中央处理单元(CPU)、图形处理单元(GPU)、各种专用的人工智能(AI)计算芯片、各种运行机器学习模型算法的计算单元、数字信号处理器(DSP)、以及任何适当的处理器、控制器、微控制器等。计算单元801执行上文所描述的各个方法和处理,例如音频与文本组合方法。例如,在一些实施例中,音频与文本组合方法可被实现为计算机软件程序,其被有形地包含于机器可读介质,例如存储单元808。在一些实施例中,计算机程序的部分或者全部可以经由ROM 102和/或通信单元809而被载入和/或安装到电子设备800上。当计算机程序加载到RAM 803并由计算单元801执行时,可以执行上文描述的音频与文本组合方法的一个或多个步骤。备选地,在其他实施例中,计算单元801可以通过其他任何适当的方式(例如,借助于固件)而被配置为执行音频与文本组合方法。The
本文中以上描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、芯片上系统的系统(SOC)、复杂可编程逻辑设备(CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括:实施在一个或者多个计算机程序中,该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释,该可编程处理器可以是专用或者通用可编程处理器,可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令,并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。Various implementations of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips Implemented in a system of systems (SOC), complex programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor Can be special-purpose or general-purpose programmable processor, can receive data and instruction from storage system, at least one input device, and at least one output device, and transmit data and instruction to this storage system, this at least one input device, and this at least one output device an output device.
用于实施本公开的方法的程序代码可以采用一个或多个编程语言的任何组合来编写。这些程序代码可以提供给通用计算机、专用计算机或其他可编程氛围灯调节装置的处理器或控制器,使得程序代码当由处理器或控制器执行时使流程图和/或框图中所规定的功能/操作被实施。程序代码可以完全在机器上执行、部分地在机器上执行,作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。Program codes for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to processors or controllers of general-purpose computers, special-purpose computers, or other programmable ambient light adjustment devices, so that when the program codes are executed by the processor or controller, the functions specified in the flowchart and/or block diagrams may be implemented. / operation is implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
为了提供与用户的交互,可以在计算机上实施此处描述的系统和技术,该计算机具有:用于向用户显示信息的显示装置(例如,CRT(阴极射线管)或者LCD(液晶显示器)监视器);以及键盘和指向装置(例如,鼠标或者轨迹球),用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互;例如,提供给用户的反馈可以是任何形式的传感反馈(例如,视觉反馈、听觉反馈、或者触觉反馈);并且可以用任何形式(包括声输入、语音输入、或者触觉输入来接收来自用户的输入。To provide for interaction with the user, the systems and techniques described herein can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user. ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer. Other kinds of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and can be in any form (including Input from the user may be received through acoustic input, voice input, or tactile input.
可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如,作为数据服务器)、或者包括中间件部件的计算系统(例如,应用服务器)、或者包括前端部件的计算系统(例如,具有图形用户界面或者网络浏览器的用户计算机,用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如,通信网络)来将系统的部件相互连接。通信网络的示例包括:局域网(LAN)、广域网(WAN)和互联网。The systems and techniques described herein can be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., as a a user computer having a graphical user interface or web browser through which a user can interact with embodiments of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system. The components of the system can be interconnected by any form or medium of digital data communication, eg, a communication network. Examples of communication networks include: Local Area Network (LAN), Wide Area Network (WAN) and the Internet.
计算机系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。服务器可以是云服务器,也可以为分布式系统的服务器,或者是结合了区块链的服务器。A computer system may include clients and servers. Clients and servers are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, a server of a distributed system, or a server combined with a blockchain.
应该理解,可以使用上面所示的各种形式的流程,重新排序、增加或删除步骤。例如,本公开中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行,只要能够实现本公开公开的技术方案所期望的结果,本文在此不进行限制。It should be understood that steps may be reordered, added or deleted using the various forms of flow shown above. For example, each step described in the present disclosure may be executed in parallel, sequentially, or in a different order, as long as the desired result of the technical solution disclosed in the present disclosure can be achieved, no limitation is imposed herein.
上述具体实施方式,并不构成对本公开保护范围的限制。本领域技术人员应该明白的是,根据设计要求和其他因素,可以进行各种修改、组合、子组合和替代。任何在本公开的精神和原则之内所作的修改、等同替换和改进等,均应包含在本公开保护范围之内。The specific implementation manners described above do not limit the protection scope of the present disclosure. It should be apparent to those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made depending on design requirements and other factors. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present disclosure shall be included within the protection scope of the present disclosure.
Claims (18)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211049871.3A CN115396690B (en) | 2022-08-30 | Audio and text combination method, device, electronic device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211049871.3A CN115396690B (en) | 2022-08-30 | Audio and text combination method, device, electronic device and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115396690A true CN115396690A (en) | 2022-11-25 |
CN115396690B CN115396690B (en) | 2025-02-11 |
Family
ID=
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115797943A (en) * | 2023-02-08 | 2023-03-14 | 广州数说故事信息科技有限公司 | Multimode-based video text content extraction method, system and storage medium |
Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150088508A1 (en) * | 2013-09-25 | 2015-03-26 | Verizon Patent And Licensing Inc. | Training speech recognition using captions |
CN106604125A (en) * | 2016-12-29 | 2017-04-26 | 北京奇艺世纪科技有限公司 | Video subtitle determining method and video subtitle determining device |
CN110210299A (en) * | 2019-04-26 | 2019-09-06 | 平安科技(深圳)有限公司 | Voice training data creation method, device, equipment and readable storage medium storing program for executing |
CN110427930A (en) * | 2019-07-29 | 2019-11-08 | 中国工商银行股份有限公司 | Multimedia data processing method and device, electronic equipment and readable storage medium storing program for executing |
CN111445902A (en) * | 2020-03-27 | 2020-07-24 | 北京字节跳动网络技术有限公司 | Data collection method and device, storage medium and electronic equipment |
CN111723790A (en) * | 2020-06-11 | 2020-09-29 | 腾讯科技(深圳)有限公司 | Method, device and equipment for screening video subtitles and storage medium |
CN112232260A (en) * | 2020-10-27 | 2021-01-15 | 腾讯科技(深圳)有限公司 | Subtitle region identification method, device, equipment and storage medium |
CN112735476A (en) * | 2020-12-29 | 2021-04-30 | 北京声智科技有限公司 | Audio data labeling method and device |
CN112738640A (en) * | 2020-12-28 | 2021-04-30 | 出门问问(武汉)信息科技有限公司 | Method and device for determining subtitles of video stream and readable storage medium |
CN113408301A (en) * | 2021-07-12 | 2021-09-17 | 北京沃东天骏信息技术有限公司 | Sample processing method, device, equipment and medium |
CN113435443A (en) * | 2021-06-28 | 2021-09-24 | 中国兵器装备集团自动化研究所有限公司 | Method for automatically identifying landmark from video |
CN113450774A (en) * | 2021-06-23 | 2021-09-28 | 网易(杭州)网络有限公司 | Training data acquisition method and device |
CN113705300A (en) * | 2021-03-16 | 2021-11-26 | 腾讯科技(深圳)有限公司 | Method, device and equipment for acquiring phonetic-to-text training corpus and storage medium |
CN114067807A (en) * | 2021-11-15 | 2022-02-18 | 海信视像科技股份有限公司 | Audio data processing method and device and electronic equipment |
CN114387589A (en) * | 2021-12-14 | 2022-04-22 | 北京达佳互联信息技术有限公司 | Voice supervision data acquisition method and device, electronic equipment and storage medium |
Patent Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150088508A1 (en) * | 2013-09-25 | 2015-03-26 | Verizon Patent And Licensing Inc. | Training speech recognition using captions |
CN106604125A (en) * | 2016-12-29 | 2017-04-26 | 北京奇艺世纪科技有限公司 | Video subtitle determining method and video subtitle determining device |
CN110210299A (en) * | 2019-04-26 | 2019-09-06 | 平安科技(深圳)有限公司 | Voice training data creation method, device, equipment and readable storage medium storing program for executing |
CN110427930A (en) * | 2019-07-29 | 2019-11-08 | 中国工商银行股份有限公司 | Multimedia data processing method and device, electronic equipment and readable storage medium storing program for executing |
CN111445902A (en) * | 2020-03-27 | 2020-07-24 | 北京字节跳动网络技术有限公司 | Data collection method and device, storage medium and electronic equipment |
CN111723790A (en) * | 2020-06-11 | 2020-09-29 | 腾讯科技(深圳)有限公司 | Method, device and equipment for screening video subtitles and storage medium |
CN112232260A (en) * | 2020-10-27 | 2021-01-15 | 腾讯科技(深圳)有限公司 | Subtitle region identification method, device, equipment and storage medium |
CN112738640A (en) * | 2020-12-28 | 2021-04-30 | 出门问问(武汉)信息科技有限公司 | Method and device for determining subtitles of video stream and readable storage medium |
CN112735476A (en) * | 2020-12-29 | 2021-04-30 | 北京声智科技有限公司 | Audio data labeling method and device |
CN113705300A (en) * | 2021-03-16 | 2021-11-26 | 腾讯科技(深圳)有限公司 | Method, device and equipment for acquiring phonetic-to-text training corpus and storage medium |
CN113450774A (en) * | 2021-06-23 | 2021-09-28 | 网易(杭州)网络有限公司 | Training data acquisition method and device |
CN113435443A (en) * | 2021-06-28 | 2021-09-24 | 中国兵器装备集团自动化研究所有限公司 | Method for automatically identifying landmark from video |
CN113408301A (en) * | 2021-07-12 | 2021-09-17 | 北京沃东天骏信息技术有限公司 | Sample processing method, device, equipment and medium |
CN114067807A (en) * | 2021-11-15 | 2022-02-18 | 海信视像科技股份有限公司 | Audio data processing method and device and electronic equipment |
CN114387589A (en) * | 2021-12-14 | 2022-04-22 | 北京达佳互联信息技术有限公司 | Voice supervision data acquisition method and device, electronic equipment and storage medium |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115797943A (en) * | 2023-02-08 | 2023-03-14 | 广州数说故事信息科技有限公司 | Multimode-based video text content extraction method, system and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR102683700B1 (en) | Video processing method, apparatus, electronic device and storage medium and computer program | |
CN113313022B (en) | Training method of character recognition model and method for recognizing characters in image | |
US20220270382A1 (en) | Method and apparatus of training image recognition model, method and apparatus of recognizing image, and electronic device | |
CN110008343B (en) | Text classification method, apparatus, device and computer readable storage medium | |
CN113254654B (en) | Model training, text recognition method, apparatus, equipment and medium | |
CN111723791A (en) | Character error correction method, device, equipment and storage medium | |
CN112559800B (en) | Method, apparatus, electronic device, medium and product for processing video | |
US20230020022A1 (en) | Method of recognizing text, device, storage medium and smart dictionary pen | |
CN111444349A (en) | Information extraction method and device, computer equipment and storage medium | |
US12118770B2 (en) | Image recognition method and apparatus, electronic device and readable storage medium | |
CN113205160B (en) | Model training method, text recognition method, model training device, text recognition device, electronic equipment and medium | |
CN114022887B (en) | Text recognition model training and text recognition method and device, and electronic equipment | |
CN115098729A (en) | Video processing method, sample generation method, model training method and device | |
CN117725161A (en) | Method and system for identifying variant words in text and extracting sensitive words | |
US20160283582A1 (en) | Device and method for detecting similar text, and application | |
CN115546488B (en) | Information segmentation method, information extraction method and training method of information segmentation model | |
CN113361523A (en) | Text determination method and device, electronic equipment and computer readable storage medium | |
CN114973229A (en) | Text recognition model training method, text recognition device, text recognition equipment and medium | |
CN114611625A (en) | Language model training method, language model training device, language model data processing method, language model data processing device, language model data processing equipment, language model data processing medium and language model data processing product | |
US11301627B2 (en) | Contextualized character recognition system | |
CN113076932A (en) | Method for training audio language recognition model, video detection method and device thereof | |
CN115396690A (en) | Audio and text combination method and device, electronic equipment and storage medium | |
CN114973247A (en) | Text recognition method, device, equipment and medium | |
CN114973276A (en) | Handwritten character detection method and device and electronic equipment | |
CN114724144A (en) | Text recognition method, model training method, device, equipment and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |