CN112818680B

CN112818680B - Corpus processing method and device, electronic equipment and computer readable storage medium

Info

Publication number: CN112818680B
Application number: CN202010662832.5A
Authority: CN
Inventors: 彭俊石; 吴飞; 彭艺
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-07-10
Filing date: 2020-07-10
Publication date: 2023-08-01
Anticipated expiration: 2040-07-10
Also published as: CN112818680A

Abstract

The application provides a corpus processing method, device, electronic equipment and computer readable storage medium, and relates to the field of data processing. The method comprises the following steps: acquiring a multimedia file meeting preset conditions, acquiring audio data of the multimedia file, acquiring a subtitle file of the multimedia file, and processing the subtitle file based on a preset first rule to obtain a processed subtitle file; the processed caption file comprises at least one caption, the audio data is cut based on the at least one caption to obtain at least one audio data segment, and the at least one caption and the corresponding audio data segment are used as a first audio caption pair to obtain at least one first audio caption pair. The corpus automatic labeling method and device achieve the automatic labeling of the corpus, manual participation is not needed, a large amount of labor cost and time cost are saved, the labeling efficiency of the corpus is greatly improved, the purchase expense of the corpus is reduced, and a large amount of financial cost is saved.

Description

Corpus processing method and device, electronic equipment and computer readable storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a method and apparatus for processing a corpus, an electronic device, and a computer readable storage medium.

Background

The corpus labeling method in the market at present is artificial labeling, and mainly comprises two modes. One is to prepare a part of proper text corpus, cover the pronunciation of corresponding languages as much as possible, and then read manually according to the text to obtain audio, thus obtaining audio text pairs; the other mode is to find out the corresponding audio file, manually listen and write, and mark out the correct text, thereby obtaining the audio text pair; most of the currently marketed corpora are corpora generated in the first way.

Although the first scheme can obtain cleaner corpus, because the corpus is obtained in a reading mode, the pronunciation situation of the corpus deviates from the real voice environment, and a reader is basically in a manner of a legal department to read the corpus, and has no related context and related environmental sounds; the second method label may be subject to the dictation person's ability to understand, and some words may be dictated with deviations.

That is, the cost of labeling is not low in either of the two modes, but of course, the cost of labeling in the second mode may be higher, and the labeling efficiency is low because the labeling is performed manually, which requires a lot of labor cost and time cost.

Disclosure of Invention

The application provides a corpus processing method, device, electronic equipment and computer readable storage medium, which can solve the problem that manual corpus labeling needs to consume a large amount of labor cost, time cost and financial cost. The technical scheme is as follows:

in a first aspect, a method for processing a corpus is provided, where the method includes:

acquiring a multimedia file meeting preset conditions, and acquiring audio data of the multimedia file;

acquiring a subtitle file of the multimedia file, and processing the subtitle file based on a preset first rule to obtain a processed subtitle file; the processed subtitle file comprises at least one subtitle;

and cutting the audio data based on the at least one caption to obtain at least one audio data segment, and taking the at least one caption and the corresponding audio data segment as a first audio caption pair to obtain at least one first audio caption pair.

Preferably, the method further comprises:

training a preset first acoustic model based on the at least one first audio subtitle to obtain a trained second acoustic model;

performing alignment processing on the at least one audio subtitle pair based on the second acoustic model to obtain a processed second audio subtitle pair;

And repeating the steps of training a preset first acoustic model based on the at least one first audio caption pair to obtain a trained second acoustic model, and performing alignment processing on the at least one first audio caption pair based on the second acoustic model to obtain a processed second audio caption pair until the minimum value of the loss function of the trained second acoustic model converges to obtain the current trained second acoustic model.

Preferably, the processing the subtitle file based on the preset first rule to obtain a processed subtitle file includes:

filtering each caption in the caption file based on a preset second rule to obtain at least one residual caption;

and acquiring the starting and ending time of the at least one residual caption to obtain a processed caption file.

Preferably, the training the preset first acoustic model based on the at least one first audio subtitle to obtain a trained second acoustic model includes:

Inputting any one first audio caption pair of the at least one first audio caption pair into the first acoustic model, so that a convolution layer in the first acoustic model extracts a characteristic sequence from an audio data segment in the any one audio caption pair, predicts label distribution of the characteristic sequence through a circulation layer, and converts the label distribution through a transcription layer to obtain a caption result corresponding to the audio data segment;

calculating a loss function based on the caption in any first audio caption pair and the caption result;

and updating the first acoustic model by adopting the loss function to obtain the second acoustic model.

Preferably, the aligning the at least one first audio subtitle pair based on the second acoustic model to obtain a processed second audio subtitle pair includes:

performing time disturbance processing on the starting time and the ending time of the audio data segment in any one of the at least one first audio caption pair to obtain at least two starting times and at least two ending times;

cutting the audio data by taking each starting time in the at least two starting times as a starting cutting point and taking each ending time in the at least two ending times as an ending cutting point to obtain at least two pieces of cut audio data;

Identifying the at least two pieces of cut audio data by adopting the second acoustic model to obtain at least two phoneme sequences;

determining at least one target phoneme sequence which is the same as the identification phoneme sequence of the caption in any first audio caption pair from the at least two phoneme sequences;

acquiring the starting time and the ending time corresponding to each target phoneme sequence;

and taking the median value in each starting time as a target starting time, taking the median value in each ending time as a target ending time, obtaining the audio data segments aligned with the subtitles in any one of the first audio subtitle pairs, and taking the subtitles in any one of the first audio subtitle pairs and the aligned audio data segments as a second audio subtitle pair.

Preferably, the obtaining the multimedia file meeting the preset condition includes:

for any multimedia file to be acquired, detecting whether the multimedia file to be acquired contains a subtitle file;

if yes, the multimedia file to be obtained is obtained.

Preferably, the acquiring the audio data of the multimedia file includes:

if the multimedia file is a video file, extracting the audio data from the video file;

And if the multimedia file is an audio file, taking the audio file as the audio data.

Preferably, the filtering each subtitle in the subtitle file based on the preset second rule to obtain at least one remaining subtitle includes:

detecting whether the start and stop time of any two adjacent subtitles in the subtitle file are overlapped or not;

if yes, deleting the two adjacent captions to obtain at least one residual caption.

acquiring the non-pronouncing characters in each subtitle;

and deleting the non-pronouncing characters to obtain at least one remaining subtitle.

acquiring the digital character and a preset target character in each subtitle;

converting the digital character into a preset digital character expression, and converting the target character into a pronunciation of a preset target language to obtain at least one residual subtitle.

and matching each character in each subtitle with a preset character table, and deleting characters without matching items in the character table to obtain at least one remaining subtitle.

Preferably, the filtering, based on the second preset rule, each caption in the caption file to obtain at least one remaining caption includes:

acquiring the number of characters in each subtitle;

and deleting the subtitles with the number of characters not exceeding a first number threshold value to obtain at least one remaining subtitle.

detecting whether the interval of the start and stop time of any two adjacent subtitles in the subtitle file does not exceed an interval threshold value;

if yes, splicing the two adjacent captions to obtain at least one residual caption.

Acquiring the time length corresponding to any caption in the caption file;

when the duration does not exceed a first duration threshold or the duration exceeds a second duration threshold, deleting any caption to obtain at least one residual caption; the first time length threshold does not exceed the second time length threshold, and the first time length threshold and the second time length threshold are positive numbers.

acquiring the duration and the number of characters corresponding to any caption in the caption file; wherein the time period has a corresponding second number threshold;

and deleting any caption when the number exceeds the second number threshold value, so as to obtain at least one residual caption.

Preferably, the cutting the audio data based on the remaining at least one subtitle to obtain at least one audio data segment includes:

acquiring the start-stop time of each caption in the remaining at least one caption;

and cutting the audio data based on the start-stop time of each caption to obtain the audio data segment corresponding to each caption.

In a second aspect, there is provided a corpus processing apparatus, including:

the acquisition module is used for acquiring the multimedia file meeting the preset condition and acquiring the audio data of the multimedia file;

the first processing module is used for acquiring the subtitle file of the multimedia file, and processing the subtitle file based on a preset first rule to obtain a processed subtitle file; the processed subtitle file comprises at least one subtitle;

and the second processing module is used for cutting the audio data based on the at least one caption to obtain at least one audio data segment, and taking the at least one caption and the audio data segment corresponding to the at least one caption as a first audio caption pair to obtain at least one first audio caption pair.

Preferably, the apparatus further comprises:

the third processing module is used for training a preset first acoustic model based on the at least one first audio subtitle to obtain a trained second acoustic model;

a fourth processing module, configured to perform alignment processing on the at least one first audio subtitle pair based on the second acoustic model, to obtain a processed second audio subtitle pair;

And repeatedly calling the third processing module and the fourth processing module by taking the second audio subtitle pair as a current first audio subtitle pair and the second acoustic model as a current first acoustic model until the minimum value of the loss function of the trained second acoustic model is converged, so as to obtain the current trained second acoustic model.

Preferably, the first processing module includes:

the filtering sub-module is used for filtering each caption in the caption file based on a preset second rule to obtain at least one residual caption;

and the first processing sub-module is used for acquiring the start-stop time of the at least one residual caption to obtain a processed caption file.

Preferably, the third processing module includes:

the second processing sub-module is used for inputting any one of the at least one first audio caption pair into the first acoustic model, so that a convolution layer in the first acoustic model extracts a characteristic sequence from an audio data segment in the any one audio caption pair, predicts label distribution of the characteristic sequence through a circulation layer, and converts the label distribution through a transcription layer to obtain a caption result corresponding to the audio data segment;

The calculation sub-module is used for calculating a loss function based on the caption in any first audio caption pair and the caption result;

and the updating sub-module is used for updating the first acoustic model by adopting the loss function to obtain the second acoustic model.

Preferably, the fourth processing module includes:

the disturbing sub-module is used for carrying out time disturbance processing on the starting time and the ending time of the audio data segment in any one of the at least one first audio caption pair to obtain at least two starting times and at least two ending times;

the cutting sub-module is used for cutting the audio data by taking each starting time in the at least two starting times as a starting cutting point and taking each ending time in the at least two ending times as an ending cutting point to obtain at least two pieces of cut audio data;

the recognition submodule is used for recognizing the at least two pieces of cut audio data by adopting the second acoustic model to obtain at least two phoneme sequences;

a matching sub-module, configured to determine at least one target phoneme sequence that is the same as the identification phoneme sequence of the subtitle in the any one of the first audio subtitle pairs from the at least two phoneme sequences;

The acquisition sub-module is used for acquiring the starting time and the ending time corresponding to each target phoneme sequence;

and the determining submodule is used for taking the median value in each starting time as a target starting time and taking the median value in each ending time as a target ending time to obtain an audio data segment aligned with the caption in any one of the first audio caption pairs, and taking the caption in any one of the first audio caption pairs and the aligned audio data segment as a second audio caption pair.

Preferably, the acquiring module is specifically configured to:

for any multimedia file to be acquired, detecting whether the multimedia file to be acquired contains a subtitle file; if yes, acquiring the multimedia file to be acquired; the method comprises the steps of,

if the multimedia file is a video file, extracting the audio data from the video file; and if the multimedia file is an audio file, taking the audio file as the audio data.

Preferably, the first processing sub-module is specifically configured to:

detecting whether the start and stop time of any two adjacent subtitles in the subtitle file are overlapped or not; if yes, deleting the two adjacent captions to obtain at least one residual caption.

Preferably, the first processing sub-module is specifically configured to:

acquiring the non-pronouncing characters in each subtitle; and deleting the non-pronouncing characters to obtain at least one remaining subtitle.

Preferably, the first processing sub-module is specifically configured to:

acquiring the digital character and a preset target character in each subtitle; converting the digital character into a preset digital character expression, and converting the target character into a pronunciation of a preset target language to obtain at least one residual subtitle.

Preferably, the first processing sub-module is specifically configured to:

acquiring the number of characters in each subtitle; and deleting the subtitles with the number of characters not exceeding a first number threshold value to obtain at least one remaining subtitle.

Preferably, the first processing sub-module is specifically configured to:

detecting whether the interval of the start and stop time of any two adjacent subtitles in the subtitle file does not exceed an interval threshold value; if yes, splicing the two adjacent captions to obtain at least one residual caption.

Preferably, the first processing sub-module is specifically configured to:

acquiring the time length corresponding to any caption in the caption file; when the duration does not exceed a first duration threshold or the duration exceeds a second duration threshold, deleting any caption to obtain at least one residual caption; the first time length threshold does not exceed the second time length threshold, and the first time length threshold and the second time length threshold are positive numbers.

Preferably, the first processing sub-module is specifically configured to:

acquiring the duration and the number of characters corresponding to any caption in the caption file; wherein the time period has a corresponding second number threshold; and deleting any caption when the number exceeds the second number threshold value, so as to obtain at least one residual caption.

Preferably, the cutting sub-module is specifically configured to:

acquiring the start-stop time of each caption in the remaining at least one caption; and cutting the audio data based on the start-stop time of each caption to obtain the audio data segment corresponding to each caption.

In a third aspect, an electronic device is provided, the electronic device comprising:

A processor, a memory, and a bus;

the bus is used for connecting the processor and the memory;

the memory is used for storing operation instructions;

the processor is configured to, by invoking the operation instruction, cause the processor to execute an operation corresponding to a processing method of a corpus as shown in the first aspect of the present application.

In a fourth aspect, a computer readable storage medium is provided, where a computer program is stored, where the program, when executed by a processor, implements a method for processing a corpus as described in the first aspect of the application.

The beneficial effects that this application provided technical scheme brought are:

in the embodiment of the invention, a multimedia file meeting a preset condition is obtained, audio data of the multimedia file is obtained, then a subtitle file of the multimedia file is obtained, and the subtitle file is processed based on a preset first rule to obtain a processed subtitle file; the processed caption file comprises at least one caption, the audio data is cut based on the at least one caption to obtain at least one audio data segment, and the at least one caption and the audio data segment corresponding to the at least one caption are used as a first audio caption pair to obtain at least one first audio caption pair. Therefore, aiming at the multimedia file meeting the preset conditions and the corresponding subtitle file thereof, at least one section of audio can be obtained from the multimedia file, and the subtitle corresponding to each section of audio can be obtained from the subtitle file, wherein each section of audio and the subtitle form an audio subtitle pair, the corpus of automatic annotation is obtained, manual participation is not needed, a great deal of labor cost and time cost are saved, the annotation efficiency of the corpus is greatly improved, the purchase expense of the corpus is reduced, and a great deal of financial cost is saved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings that are required to be used in the description of the embodiments of the present application will be briefly described below.

Fig. 1 is a flow chart of a corpus processing method according to an embodiment of the present application;

fig. 2 is a flow chart of a corpus processing method according to another embodiment of the present application;

FIG. 3 is a flow chart of one of the processing methods for processing subtitle files in the present application;

FIG. 4 is an exemplary diagram of an audio spectral image of the present application;

FIG. 5 is a schematic diagram of the structure of an acoustic model and a schematic diagram of the logic of the acoustic model processing speech data in the present application;

FIG. 6 is a logic diagram of corpus processing in the present application;

fig. 7 is a schematic structural diagram of a corpus processing device according to another embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device for processing a corpus according to another embodiment of the present application.

Detailed Description

Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for the purpose of illustrating the present application and are not to be construed as limiting the invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein includes all or any element and all combination of one or more of the associated listed items.

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Several terms which are referred to in this application are first introduced and explained:

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Key technologies to the speech technology (Speech Technology) are automatic speech recognition technology (ASR) and speech synthesis technology (TTS) and voiceprint recognition technology. The method can enable the computer to listen, watch, say and feel, is the development direction of human-computer interaction in the future, and voice becomes one of the best human-computer interaction modes in the future.

The corpus processing method, device, electronic equipment and computer readable storage medium aim to solve the technical problems in the prior art.

The following describes the technical solutions of the present application and how the technical solutions of the present application solve the above technical problems in detail with specific embodiments. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

In one embodiment, a corpus processing method is provided, as shown in fig. 1, and the method includes:

step S101, acquiring a multimedia file meeting preset conditions, and acquiring audio data of the multimedia file;

the multimedia file may be obtained from a preset database for storing the multimedia file, or may be obtained from a network, or may be obtained by other means, and may be set according to the needs in practical application, which is not limited in the embodiment of the present invention. After the multimedia file meeting the preset condition is obtained, the audio data of the multimedia file can be further obtained.

Further, the terminal may acquire voice data, and the terminal may also recognize the voice data; or, the terminal may acquire the voice data, then the terminal sends the voice data to the server, the server identifies the voice data to obtain a caption result, and then sends the caption result to the terminal. In practical application, the setting can be performed according to practical requirements, and the embodiment of the invention is not limited to the setting.

The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing service.

The terminal may have the following features:

(1) In a hardware system, the device includes a central processing unit, a memory, an input unit, and an output unit, that is, the device is often a microcomputer device having a communication function. In addition, there may be various input modes such as a keyboard, a mouse, a touch panel, a microphone, a camera, and the like, and the input may be adjusted as necessary. Meanwhile, the equipment often has various output modes, such as a receiver, a display screen and the like, and can be adjusted according to the needs;

(2) On a software architecture, the device must be provided with an operating system, such as Windows Mobile, symbian, palm, android, iOS, etc. Meanwhile, the operating systems are more and more open, and personalized application programs developed based on the open operating system platforms are layered endlessly, such as an address book, a calendar, a notepad, a calculator, various games and the like, so that the demands of personalized users are met to a great extent;

(3) In terms of communication capability, the device has flexible access mode and high-bandwidth communication performance, and can automatically adjust the selected communication mode according to the selected service and the environment, thereby facilitating the use of users. The device can support GSM (Global System for Mobile Communication, global System for Mobile communications), WCDMA (Wideband Code Division Multiple Access ), CDMA2000 (Code Division Multiple Access, code Division multiple Access), TDSCDMA (Time Division-Synchronous Code Division Multiple Access, time Division synchronous code Division multiple Access), wi-Fi (Wireless Fidelity), wiMAX (Worldwide Interoperability for Microwave Access ), etc., thereby adapting to various system networks, supporting not only voice services, but also various Wireless data services;

(4) In terms of functional use, the device is more focused on humanization, individualization and multifunctionality. With the development of computer technology, the device enters a mode of 'centering on people' from a mode of 'centering on the device', and embedded computing, control technology, artificial intelligence technology, biological authentication technology and the like are integrated, so that the aim of people is fully embodied. Due to the development of software technology, the device can adjust the settings according to personal needs, and is more personalized. Meanwhile, the device integrates a plurality of software and hardware, and the functions are more and more powerful.

Step S102, acquiring a subtitle file of the multimedia file, and processing the subtitle file based on a preset first rule to obtain a processed subtitle file; the processed subtitle file comprises at least one subtitle;

the subtitle file may be an independent subtitle file of the multimedia file, or may be a subtitle file embedded in the multimedia file, or may be a subtitle file of another form, which may be set according to actual requirements in practical applications, which is not limited in the embodiment of the present invention.

After the multimedia file and the subtitle file are obtained, the subtitle file can be processed based on a preset first rule to obtain a processed subtitle file, wherein the processed subtitle file comprises at least one subtitle; the first rule includes: filtering each caption in the caption file by adopting a second rule to obtain at least one residual caption after filtering; acquiring the start-stop time of each caption in the rest at least one caption; the second rule then includes specific filtering rules including, but not limited to: filtering captions with overlapped start-stop time; filtering the non-sounding characters in the subtitle and converting Arabic numerals in the subtitle; filtering the subtitles based on the character table; filtering subtitles with the number of characters smaller than a threshold value; merging subtitles with time intervals smaller than a threshold value; filtering subtitles which do not meet the duration; and filtering subtitles with non-corresponding time length and character number.

Step S103, cutting the audio data based on the at least one caption to obtain at least one audio data segment, and using the at least one caption and the audio data segment corresponding to the at least one caption as a first audio caption pair to obtain at least one first audio caption pair.

After the processed subtitle file is obtained, the audio data can be cut based on at least one subtitle in the processed subtitle file to obtain at least one audio data segment, so that each audio data segment corresponds to one subtitle to obtain at least one first audio subtitle pair. The audio subtitle pair comprises a section of audio containing a sentence and a subtitle corresponding to the sentence.

In another embodiment, a method for processing a corpus is provided, as shown in fig. 2, and the method includes:

step S201, acquiring a multimedia file meeting preset conditions, and acquiring audio data of the multimedia file;

In a preferred embodiment of the present invention, obtaining a multimedia file meeting a preset condition includes:

if yes, acquiring the multimedia file to be acquired.

Specifically, when any multimedia file is acquired, it may be detected whether the multimedia file contains a subtitle file, and if so, the multimedia file is acquired. The subtitle file may be an independent subtitle file, or may be a subtitle file embedded in a multimedia file, or may be a subtitle file in other forms, which may be set according to actual requirements in practical applications, which is not limited in the embodiment of the present invention.

In a preferred embodiment of the present invention, acquiring audio data of a multimedia file includes:

if the multimedia file is a video file, extracting audio data from the video file;

if the multimedia file is an audio file, the audio file is used as audio data.

Specifically, if the acquired multimedia file is a video file, extracting audio data from the video file; if the audio file is an audio file, the audio file is directly used as audio data.

It should be noted that, if the subtitle file is not embedded in the multimedia file, when the multimedia file is acquired, the subtitle file corresponding to the multimedia file is acquired at the same time.

Step S202, acquiring a subtitle file of the multimedia file, and processing the subtitle file based on a preset first rule to obtain a processed subtitle file; the processed subtitle file comprises at least one subtitle;

In a preferred embodiment of the present invention, the processing the subtitle file based on the preset first rule to obtain a processed subtitle file includes:

Because each subtitle is not available, each subtitle in the subtitle file needs to be filtered correspondingly to obtain available at least one subtitle, and then the start-stop time of each subtitle in the at least one subtitle is acquired to obtain the processed subtitle file. For example, in the above case, the time information of a is 0:00:01-0:00:08, and then 0:00:01 is the start time of a, and 0:00:08 is the end time of a.

In a preferred embodiment of the present invention, filtering each caption in the caption file based on a preset second rule to obtain at least one remaining caption, including:

if yes, deleting any two adjacent subtitles to obtain at least one remaining subtitle.

Specifically, after the start-stop time of each caption is obtained, whether the start-stop time of any two adjacent captions in all the captions is overlapped or not is detected, and if yes, the two captions detected are deleted. And so on until each caption in the caption file is detected, and obtaining at least one residual caption.

For example, if a certain subtitle file includes a, b, c, d subtitles and it is determined by detection that the ending time of b overlaps the starting time of c, b and c are deleted, so as to obtain the remaining subtitles a and d.

acquiring non-pronouncing characters in each subtitle;

deleting the non-pronouncing characters to obtain at least one remaining subtitle.

Specifically, the subtitles may include non-sounding characters, such as notes, emoticons, etc., so in practical applications, it is necessary to obtain non-sounding characters in each subtitle, and then delete the non-sounding characters to obtain at least one remaining subtitle.

It should be noted that if all of the subtitles are non-sounding characters, all of the subtitles may be deleted; if only a part of the non-pronouncing characters in a subtitle are contained, the non-pronouncing characters in the subtitle are deleted.

acquiring a digital character and a preset target character in each subtitle;

The preset target character may be a non-chinese character that may have a pronunciation, for example, "%" corresponds to "percent" of pronunciation. Specifically, arabic numerals and target characters in each caption are obtained, then all the numerals are converted into a preset numeral character expression, and all the target characters are converted into the voice of a preset target language, so that at least one residual caption is obtained. For example, arabic numerals in subtitles are converted into simplified chinese numerals, such as "1" is converted into "one", while "%" in subtitles is converted into "percent".

And matching each character in each subtitle with a preset character table, and deleting the characters without matching items in the character table to obtain at least one residual subtitle.

Specifically, at least one caption obtained by processing at least one of the processing methods can be matched with a preset character table, and characters without matching items in the character table are all deleted, so that at least one residual caption is further obtained.

Of course, each caption in the acquired caption file may be matched with a preset character table, and the characters in the character table, for which no matching item exists, may be deleted, so as to obtain at least one remaining caption. In practical application, the setting can be performed according to the requirement, and the embodiment of the invention is not limited to this.

acquiring the number of characters in each subtitle;

Since the number of characters in one caption is too small, misrecognition is caused, so in order to make the corpus cleaner, misrecognition is avoided, and the caption with too small number of characters needs to be deleted. Specifically, the number of characters in each caption is obtained, then, the captions with the number of characters not exceeding a first number threshold (such as 4) are determined, and the captions are deleted to obtain at least one caption which is remained.

if yes, any two adjacent subtitles are spliced, and at least one remaining subtitle is obtained.

In practical application, when a caption corresponding to a sentence is long, in order to facilitate reading, a situation that one caption is split into two or more captions, but the interval of each caption is very short may occur. Therefore, in order to ensure accuracy, the embodiment of the invention can determine the time interval between each subtitle based on the start-stop time of each subtitle, if the interval between any two adjacent subtitles does not exceed the interval threshold (for example, 0.5 seconds), the two subtitles are likely to be one subtitle, and therefore the two subtitles are spliced. And so on, after the detection of each caption in the caption file is completed, at least one residual caption is obtained.

After any two adjacent subtitles are spliced, the starting time of the spliced subtitles is the starting time of the subtitle with the front time, and the ending time is the ending time of the subtitle with the single buried subtitle.

acquiring the time length corresponding to any caption in the caption file;

when the duration does not exceed the first duration threshold or the duration exceeds the second duration threshold, deleting any caption to obtain at least one residual caption; the first time length threshold does not exceed the second time length threshold, and the first time length threshold and the second time length threshold are positive numbers.

In the embodiment of the invention, the time length of one caption corresponds to the time length of the sentence corresponding to the caption, so if the time length of one caption is too short, for example, less than 1 second, the sentence corresponding to the caption is also less than 1 second, and thus, the erroneous recognition is easily caused; if the length of a caption is too long, the sentence corresponding to the caption is too long, the recognition speed is slower, the efficiency is low, and the training of the acoustic model is not facilitated. Therefore, the duration corresponding to any caption in the caption file can be obtained based on the start-stop time of the caption, then whether the duration does not exceed a first duration threshold (for example, 1 second) or exceeds a second duration threshold (for example, 10 seconds) is detected, and if so, the caption is deleted; the first time length threshold does not exceed the second time length threshold, and the first time length threshold and the second time length threshold are positive numbers. And so on, after the detection of each caption in the captions is completed, at least one residual caption is obtained.

acquiring the duration and the number of characters corresponding to any caption in the caption file; the duration has a corresponding second number threshold;

and if the number exceeds the second number threshold, deleting any caption to obtain at least one residual caption.

In practical applications, the duration of a sentence is related to the number of words in a text, for example, 10 words may be spoken in 1 second when a user speaks normally, if the sentence contains 30 words in 1 second, the sentence is obviously abnormal, although this situation may be true, it is not beneficial to training an acoustic model, the speech speed is too fast, and the continuous reading phenomenon is too serious, which may result in serious misrecognition. Therefore, in the embodiment of the invention, subtitles with different durations are respectively preset with corresponding second number thresholds (for example, 1 second corresponds to 10 subtitles and 2 seconds corresponds to 20 subtitles), the corresponding duration and the number of characters are obtained for any subtitle in the subtitle file, and if the number of characters exceeds the second number threshold, the subtitle is deleted. And so on, after the detection of each caption in the caption file is completed, at least one residual caption is obtained.

It should be noted that, in practical application, when each subtitle in the subtitle file is processed based on the preset rule, the processing may be sequentially performed according to the above sequence, as shown in fig. 3; processing is also performed by at least one of the above methods, which may be set according to actual requirements in practical applications, and the embodiment of the present invention is not limited thereto.

Step 203, cutting the audio data based on the at least one caption to obtain at least one audio data segment, and using the at least one caption and the audio data segment corresponding to the at least one caption as a first audio caption pair to obtain at least one first audio caption pair;

after the processed subtitle file is obtained, the audio data can be cut based on at least one subtitle in the processed subtitle file to obtain at least one audio data segment, so that each audio data segment corresponds to one subtitle to obtain at least one first audio subtitle pair. The audio subtitle pair comprises a section of audio containing a sentence and a subtitle corresponding to the sentence. For example, the duration of a certain voice data is 15", wherein two sentences A, B are included, the time information of A is 0:00:01-0:00:08, the time information of B is 0:00:10-0:00:15, a subtitle file is obtained through recognition by the trained acoustic model, the subtitle file comprises two subtitles a and B, the time information of a is 0:00:01-0:00:08, the time information of B is 0:00:10-0:00:15, A-a is an audio subtitle pair, and B-B is an audio subtitle pair.

In a preferred embodiment of the present invention, the audio data is cut based on the remaining at least one subtitle to obtain at least one audio data segment, including:

acquiring the start-stop time of each caption in the rest at least one caption;

Specifically, for at least one remaining caption, the start-stop time of each caption is obtained, then the audio data is cut according to the start-stop time of each caption, each audio data segment is obtained, and the same start-stop time corresponds to one caption and one audio data segment.

For example, the duration of a certain audio data is 15", the corresponding subtitle file of the audio data includes two subtitles a, b, the time information of a is 0:00:01 (start time) to 0:00:08 (end time), the time information of b is 0:00:10 (start time) to 0:00:15 (end time), and then the portions of the audio data from 0:00:01 to 0:00:08 and from 0:00:10 to 0:00:15 are cut out to obtain two audio data segments a 'and b', wherein a and a 'are used as a first audio subtitle pair, and b' are used as a second first audio subtitle pair.

Step S204, training a preset first acoustic model based on at least one first audio subtitle to obtain a trained second acoustic model;

after at least one first audio subtitle pair is obtained, the at least one first audio subtitle pair is input into a preset first acoustic model to be trained, and a trained second acoustic model is obtained.

Specifically, the acoustic model may include a three-layer neural network: the network structure of the convolutional layer CNN, the cyclic layer BLSTM, and the transcriptional layer CTC, the acoustic model can be as shown in table 1:

/>

TABLE 1

Where "CTC" is the transcribed layer, "stm-with-Dropout 5" is the circulating layer, the others are the convolved layers.

The convolution layer is used for extracting a characteristic sequence from the audio frequency spectrum image. Specifically, after processing the voice data through framing, windowing, feature extraction and the like, a two-dimensional audio spectrum image corresponding to the voice data can be obtained, as shown in fig. 4, wherein the horizontal axis represents time and the vertical axis represents feature dimensions. After the audio spectrum image is obtained, a convolution layer can be adopted to extract the feature sequence from the audio spectrum image.

The loop layer is used for predicting the label distribution of the feature sequence extracted from the convolution layer. In particular, humans do not begin their thinking from a blank brain at all times. When a human reads an article, the true meaning of the current word is inferred based on the understanding of the previously seen word that the human already has, without discarding everything altogether, and then thinking with a blank brain. The human thought has persistence, and the traditional neural network does not have the problem that the traditional neural network looks like a huge disadvantage, so that a loop layer is to consider historical information to deduce current information, and further, future information can also feed back the current information, namely the origin of a bidirectional LSTM (Long Short-Term Memory).

The loop layer is formed by a bidirectional LSTM loop neural network, predicting the label distribution of each feature vector in the feature sequence. Since LSTM requires a time dimension, the width of the sequence is treated as the LSTM time steps in the acoustic model. The Map-to-Sequence custom network layer mainly performs error feedback of the circulating layer, converts the error feedback and the characteristic Sequence and serves as a bridge connected between the circulating layer and the circulating layer, so that the error is fed back from the circulating layer to the circulating layer.

The transcription layer is used for converting the label distribution obtained from the circulation layer into a final subtitle result through operations such as duplication removal, integration and the like. Specifically, the conventional acoustic model training of speech recognition requires knowledge of the corresponding tag for each frame of data to perform effective training, and requires pre-processing of speech alignment with the tag prior to training the data. The process of aligning the voice with the tag itself requires repeated iterations to ensure more accurate alignment, which is a time consuming task.

In practical application, the flow of identifying voice data by the acoustic model is shown in fig. 5.

Compared with the traditional acoustic model training, the embodiment of the invention adopts CTC as the acoustic model training of a loss function, is a complete end-to-end acoustic model training, does not need to align data in advance, and can train only by one input sequence and one output sequence. Therefore, data alignment and one-to-one labeling are not needed, and the CTC directly outputs the probability of sequence prediction, and external post-processing is not needed.

The transcription layer integrates the result of the feature sequence predicted by the circulation layer and converts the result into a caption result which is finally output. And the last connection of the bidirectional LSTM network layer in the acoustic model is connected with a CTC model, so that end-to-end identification is realized. CTC is mainly used to solve the alignment problem of input data and a given label, and can be used to perform end-to-end training and output a sequence result with an indefinite length. The CTC can be used for predicting the positions (start-stop time) of characters, words and the like of the subtitles in the first audio subtitle pair in the audio data segment so as to obtain the positions of the subtitles in the first audio subtitle pair, and therefore, the audio data segment in the first audio subtitle pair can be aligned with the subtitles even if data alignment and one-to-one labeling are not needed. For example, the start-stop time of a sentence in an audio data segment in a certain first audio caption pair is 1.0 "to 4.1", and the start-stop time of the caption obtained through CTC prediction is 1.5 "to 4.6".

In this way, the voice data is sequentially processed through the three layers of neural networks in the trained acoustic model, and then the caption result corresponding to the voice data can be obtained.

Further, at least one caption may be included in the caption result, each caption having time information, and the time information of each caption corresponds to the time information of the corresponding sentence in the voice data.

In a preferred embodiment of the present invention, training a preset first acoustic model based on at least one first audio subtitle to obtain a trained second acoustic model includes:

inputting any one of at least one first audio caption pair into a first acoustic model, so that a convolution layer in the first acoustic model extracts a characteristic sequence from an audio data segment in any one of the first audio caption pair, predicts label distribution of the characteristic sequence through a circulation layer, and converts the label distribution through a transcription layer to obtain a caption result corresponding to the audio data segment;

calculating a loss function based on the subtitle and subtitle results in any first audio subtitle pair;

and updating the first acoustic model by adopting a loss function to obtain a second acoustic model.

Specifically, after at least one first audio caption pair is obtained, at least one first audio caption pair is input into an initial first acoustic model, a feature sequence is extracted from audio data segments in any one audio caption pair by a convolution layer in the first acoustic model, label distribution of the feature sequence is predicted through a circulation layer, label distribution is converted through a transcription layer, caption results corresponding to the audio data segments are obtained, then a loss function is obtained through calculation based on the caption results and captions in any one first audio caption pair, and various parameters in the initial first acoustic model are updated through the loss function, so that an updated second acoustic model is obtained. For example, in the previous example, the loss function is calculated based on 1.0 "to 4.1" and 1.5 "to 4.6", and then the loss function is used to update each parameter in the first acoustic model.

Step S205, aligning at least one audio caption pair based on a second acoustic model to obtain a processed second audio caption pair;

after the trained second acoustic model is obtained, the trained second acoustic model may be used to further align at least one audio subtitle pair. In practical applications, the alignment process may reduce errors in the speaker's start-stop time and the subtitle start-stop time, so that the quality of the audio subtitle pair may be improved by the alignment process.

In a preferred embodiment of the present invention, the alignment processing is performed on at least one first audio subtitle pair based on the second acoustic model, so as to obtain a processed second audio subtitle pair, which includes:

for any one of at least one first audio caption pair, performing time disturbance processing on the starting time and the ending time of an audio data segment in any one first audio caption pair to obtain at least two starting times and at least two ending times;

cutting the audio data by taking each of at least two starting times as a starting cutting point and each of at least two ending times as an ending cutting point to obtain at least two pieces of cut audio data;

Identifying at least two pieces of cut audio data by adopting a second acoustic model to obtain at least two phoneme sequences;

determining at least one target phoneme sequence which is the same as the identification phoneme sequence of the caption in any first audio caption pair from at least two phoneme sequences;

and taking the median value in each starting time as a target starting time, taking the median value in each ending time as a target ending time, obtaining audio data segments aligned with the subtitles in any one of the first audio subtitle pairs, and taking the subtitles in any one of the first audio subtitle pairs and the aligned audio data segments as a second audio subtitle pair.

In practical applications, since subtitles of a multimedia file are artificially added, the start-stop time of a sentence in the multimedia file and the start-stop time of the subtitle corresponding to the sentence are error-prone. For example, the starting and ending time of a sentence in the multimedia file is 1.0 "to 4.1", and the starting and ending time of a caption corresponding to the sentence is 1.0 "to 4.1", so that errors exist between the two. This may cut out some characters of the sentence head and tail or some other sentence characters more head and tail when cutting the audio data segment. Therefore, in the embodiment of the invention, each first audio subtitle pair can be aligned and cleaned.

Specifically, for the audio data segment in any first audio subtitle pair, the starting time and the ending time of the audio data segment are subjected to time disturbance processing, so that at least two starting times and at least two ending times are obtained.

For example, assuming that the start time of a sentence is start and the end time is end, if the cutting is performed at start and end, one audio data segment can be obtained. At this time, the perturbation processing is performed on the start and end, for example [ start-200ms, start+200ms ] and [ end-200ms, end+500ms ], taking 100ms as a step, so that the start has 5 cutting points: start-200ms, start-100ms, start, start +100ms, start+200ms, end has 8 cut points: the audio data is cut by taking 5 start points as initial cutting points and 8 end points as end cutting points respectively, and 40 pieces of cut audio data can be obtained in total.

And then identifying at least two pieces of cut audio data by adopting a second acoustic model to obtain a phoneme sequence corresponding to each piece of cut audio data. Since the output of the acoustic model is a phoneme sequence, that is, a pinyin sequence, each subtitle in the subtitle file needs to be converted into a phoneme sequence, for example, a phoneme sequence corresponding to the subtitle "your good" is "n in h ao". The method of conversion can be converted through a dictionary, and in order to obtain more accurate conversion, the texts need to be segmented, and a segmentation tool can select an open-source tool, such as jieba for Chinese, mecab for Japanese and the like. After word segmentation, a pronunciation system is constructed, so that words can be mapped to phonemes, and each subtitle in the subtitle file is converted by using a dictionary. Of course, other conversion methods besides the above conversion method are also suitable for the embodiment of the present invention, and may be set according to actual requirements in practical applications, and the embodiment of the present invention is not limited thereto.

After obtaining the phoneme sequences corresponding to each piece of the cut audio data, comparing each phoneme sequence with the identification phoneme sequence of the caption in any one of the first audio caption pairs, for example, comparing the first phoneme and the last phoneme of the two phoneme sequences, and determining at least one target phoneme sequence identical to the identification phoneme sequence from each phoneme sequence. For example, if 10 of the 40 phoneme sequences are identical to the identification phoneme sequence, the 10 phoneme sequences are regarded as target phoneme sequences.

And acquiring the starting time and the ending time corresponding to each target phoneme sequence, taking the median value in each starting time as the target starting time, taking the median value in each ending time as the target ending time, obtaining the audio data segment aligned with the caption in any first audio caption pair, and taking the caption in any first audio caption pair and the aligned audio data segment as a second audio caption pair.

For example, the 10 phoneme sequences correspond to 10 start times and 10 end times, the median value in the 10 start times is taken as a target start time, that is, the 10 start times are ordered in time sequence, the start time of the 5 th or 6 th order is taken as a target start time, and the target end time is obtained in the same way.

Thus, the start-stop time is neither minimum, as it is easy to cause a few cuts resulting in few words; nor is it a maximum because it is easy to cause multiple cuts resulting in multiple words. And finally, the starting and ending time of the determined median value is the correct starting and ending time of the audio data segment after the cleaning is aligned, then the audio data is cut according to the target starting time and the target ending time, the final audio data segment can be obtained, and the final audio data segment and the caption in any audio caption pair are used as a second audio caption pair.

Step S206, the second audio caption pair is used as a current first audio caption pair, the second acoustic model is used as a current first acoustic model, training is repeatedly performed on the basis of at least one first audio caption pair to obtain a trained second acoustic model, and the at least one first audio caption pair is aligned on the basis of the second acoustic model to obtain a processed second audio caption pair until the minimum value of the loss function of the trained second acoustic model is converged, so that the current trained second acoustic model is obtained.

Specifically, the second audio caption pair is used as a current first audio caption pair, the second acoustic model is used as a current first acoustic model, training is repeatedly performed on the basis of at least one first audio caption pair to obtain a trained second acoustic model, and the at least one first audio caption pair is aligned on the basis of the second acoustic model to obtain a processed second audio caption pair until the minimum value of a loss function of the trained second acoustic model is converged to obtain a current trained second acoustic model; the repeated steps may refer to the training steps, and are not described herein.

Further, for ease of understanding, fig. 6 shows a training flow of the acoustic model, and specific training steps may refer to the above training steps, which are not described herein.

Furthermore, the corpus obtained by the method can be used as sample data of a speech recognition model.

Further, after each audio subtitle pair is obtained, a preset first acoustic model can be trained based on each audio subtitle pair to obtain a trained second acoustic model, and then each audio subtitle pair is further aligned by adopting the trained second acoustic model, so that each processed second audio subtitle pair is obtained, and loop iteration is performed, so that the precision of the corpus can be improved, and the quality of the corpus can be improved.

Fig. 7 is a schematic structural diagram of a corpus processing apparatus according to another embodiment of the present application, and as shown in fig. 7, the apparatus of this embodiment may include:

an obtaining module 701, configured to obtain a multimedia file meeting a preset condition, and obtain audio data of the multimedia file;

a first processing module 702, configured to obtain a subtitle file of the multimedia file, and process the subtitle file based on a preset first rule, to obtain a processed subtitle file; the processed subtitle file comprises at least one subtitle;

the second processing module 703 cuts the audio data with the at least one caption to obtain at least one audio data segment, and uses the at least one caption and the audio data segment corresponding to the at least one caption as a first audio caption pair to obtain at least one first audio caption pair.

In a preferred embodiment of the invention, the device further comprises:

and repeatedly calling the third processing module and the fourth processing module by taking the second audio caption pair as a current first audio caption pair and the second acoustic model as a current first acoustic model until the minimum value of the loss function of the trained second acoustic model converges, so as to obtain the current trained second acoustic model.

In a preferred embodiment of the invention, the first processing module comprises:

In a preferred embodiment of the present invention, the third processing module includes:

The second processing sub-module is used for inputting any one of the at least one first audio caption pair into the first acoustic model, so that a convolution layer in the first acoustic model extracts a characteristic sequence from an audio data segment in any one of the first audio caption pair, predicts label distribution of the characteristic sequence through a circulation layer, and converts the label distribution through a transcription layer to obtain a caption result corresponding to the audio data segment;

the calculation sub-module is used for calculating a loss function based on the caption and caption result in any first audio caption pair;

and the updating sub-module is used for updating the first acoustic model by adopting the loss function to obtain a second acoustic model.

In a preferred embodiment of the present invention, the fourth processing module includes:

the disturbing sub-module is used for carrying out time disturbance processing on the starting time and the ending time of the audio data section in any one of the at least one first audio caption pair to obtain at least two starting times and at least two ending times;

the cutting sub-module is used for cutting the audio data by taking each of at least two starting times as a starting cutting point and taking each of at least two ending times as an ending cutting point to obtain at least two pieces of cut audio data;

The recognition sub-module is used for recognizing at least two pieces of cut audio data by adopting a second acoustic model to obtain at least two phoneme sequences;

a matching sub-module, configured to determine at least one target phoneme sequence that is the same as the identification phoneme sequence of the subtitle in any one of the first audio subtitle pairs from at least two phoneme sequences;

and the determining submodule is used for taking the median value in each starting time as a target starting time and taking the median value in each ending time as a target ending time to obtain an audio data segment aligned with the caption in any first audio caption pair, and taking the caption in any first audio caption pair and the aligned audio data segment as a second audio caption pair.

In a preferred embodiment of the present invention, the obtaining module is specifically configured to:

for any multimedia file to be acquired, detecting whether the multimedia file to be acquired contains a subtitle file; if yes, acquiring a multimedia file to be acquired; the method comprises the steps of,

if the multimedia file is a video file, extracting audio data from the video file; if the multimedia file is an audio file, the audio file is used as audio data.

In a preferred embodiment of the present invention, the first processing sub-module is specifically configured to:

detecting whether the start and stop time of any two adjacent subtitles in the subtitle file are overlapped or not; if yes, deleting any two adjacent subtitles to obtain at least one remaining subtitle.

acquiring non-pronouncing characters in each subtitle; deleting the non-pronouncing characters to obtain at least one remaining subtitle.

acquiring a digital character and a preset target character in each subtitle; converting the digital character into a preset digital character expression, and converting the target character into a pronunciation of a preset target language to obtain at least one residual subtitle.

detecting whether the interval of the start and stop time of any two adjacent subtitles in the subtitle file does not exceed an interval threshold value; if yes, any two adjacent subtitles are spliced, and at least one remaining subtitle is obtained.

acquiring the time length corresponding to any caption in the caption file; when the duration does not exceed the first duration threshold or the duration exceeds the second duration threshold, deleting any caption to obtain at least one residual caption; the first time length threshold does not exceed the second time length threshold, and the first time length threshold and the second time length threshold are positive numbers.

acquiring the duration and the number of characters corresponding to any caption in the caption file; the duration has a corresponding second number threshold; and if the number exceeds the second number threshold, deleting any caption to obtain at least one residual caption.

In a preferred embodiment of the invention, the cutting submodule is specifically adapted to:

acquiring the start-stop time of each caption in the rest at least one caption; and cutting the audio data based on the start-stop time of each caption to obtain the audio data segment corresponding to each caption.

The corpus processing device in this embodiment may execute the corpus processing methods shown in the first embodiment and the second embodiment of the present application, and the implementation principles are similar, and are not repeated here.

In yet another embodiment of the present application, there is provided an electronic device including: a memory and a processor; at least one program stored in the memory for execution by the processor, which, when executed by the processor, performs: acquiring a multimedia file meeting preset conditions, acquiring audio data of the multimedia file, acquiring a subtitle file of the multimedia file, and processing the subtitle file based on a preset first rule to obtain a processed subtitle file; the processed caption file comprises at least one caption, the audio data is cut based on the at least one caption to obtain at least one audio data segment, and the at least one caption and the audio data segment corresponding to the at least one caption are used as a first audio caption pair to obtain at least one first audio caption pair. Therefore, aiming at the multimedia file meeting the preset conditions and the corresponding subtitle file thereof, at least one section of audio can be obtained from the multimedia file, and the subtitle corresponding to each section of audio can be obtained from the subtitle file, wherein each section of audio and the subtitle form an audio subtitle pair, the corpus of automatic annotation is obtained, manual participation is not needed, a great deal of labor cost and time cost are saved, the annotation efficiency of the corpus is greatly improved, the purchase expense of the corpus is reduced, and a great deal of financial cost is saved.

In an alternative embodiment, an electronic device is provided, as shown in fig. 8, the electronic device 8000 shown in fig. 8 comprising: a processor 8001, and a memory 8003. Processor 8001 is coupled to memory 8003, such as via bus 8002. Optionally, electronic device 8000 may also include a transceiver 8004. In practice, the transceiver 8004 is not limited to one, and the structure of the electronic device 8000 is not limited to the embodiment of the present application.

The processor 8001 may be a CPU, general purpose processor, DSP, ASIC, FPGA or other programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules, and circuits described in connection with this disclosure. The processor 8001 may also be a combination that implements computing functionality, e.g., comprising one or more microprocessor combinations, a combination of DSP and microprocessor, etc.

Bus 8002 may include a path to transfer information between the components. Bus 8002 may be a PCI bus or an EISA bus, etc. Bus 8002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 8, but not only one bus or one type of bus.

Memory 8003 may be, but is not limited to, a ROM or other type of static storage device that can store static information and instructions, a RAM or other type of dynamic storage device that can store information and instructions, an EEPROM, CD-ROM or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.

The memory 8003 is used to store application code for executing the present application and is controlled by the processor 8001 to execute. Processor 8001 is used to execute application code stored in memory 8003 to implement what is shown in any of the method embodiments described above.

Among them, electronic devices include, but are not limited to: mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like.

Yet another embodiment of the present application provides a computer readable storage medium having stored thereon a computer program which, when run on a computer, causes the computer to perform the corresponding content of the foregoing method embodiments. Compared with the prior art, acquiring a multimedia file meeting preset conditions, acquiring audio data of the multimedia file, acquiring a subtitle file of the multimedia file, and processing the subtitle file based on a preset first rule to obtain a processed subtitle file; the processed caption file comprises at least one caption, the audio data is cut based on the at least one caption to obtain at least one audio data segment, and the at least one caption and the audio data segment corresponding to the at least one caption are used as a first audio caption pair to obtain at least one first audio caption pair. Therefore, aiming at the multimedia file meeting the preset conditions and the corresponding subtitle file thereof, at least one section of audio can be obtained from the multimedia file, and the subtitle corresponding to each section of audio can be obtained from the subtitle file, wherein each section of audio and the subtitle form an audio subtitle pair, the corpus of automatic annotation is obtained, manual participation is not needed, a great deal of labor cost and time cost are saved, the annotation efficiency of the corpus is greatly improved, the purchase expense of the corpus is reduced, and a great deal of financial cost is saved.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

The foregoing is only a partial embodiment of the present invention, and it should be noted that it will be apparent to those skilled in the art that modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the present invention.

Claims

1. The corpus processing method is characterized by comprising the following steps of:

cutting the audio data based on the at least one caption to obtain at least one audio data segment, and taking the at least one caption and the corresponding audio data segment as a first audio caption pair to obtain at least one first audio caption pair;

The second audio caption pair is used as a current first audio caption pair, the second acoustic model is used as a current first acoustic model, training is repeatedly carried out on a preset first acoustic model based on the at least one first audio caption pair to obtain a trained second acoustic model, and the at least one first audio caption pair is aligned based on the second acoustic model to obtain a processed second audio caption pair until the minimum value of a loss function of the trained second acoustic model is converged to obtain a current trained second acoustic model;

training a preset first acoustic model based on the at least one first audio subtitle to obtain a trained second acoustic model, including:

inputting any one first audio caption pair of the at least one first audio caption pair into the first acoustic model, so that a convolution layer in the first acoustic model extracts a characteristic sequence from an audio data segment in any one audio caption pair, predicts label distribution of the characteristic sequence through a circulation layer, and converts the label distribution through a transcription layer to obtain a caption result corresponding to the audio data segment;

2. The method for processing the corpus according to claim 1, wherein the processing the subtitle file based on the preset first rule to obtain the processed subtitle file includes:

3. The method for processing the corpus according to claim 1, wherein the aligning the at least one first audio subtitle pair based on the second acoustic model to obtain a processed second audio subtitle pair includes:

4. The method for processing a corpus according to claim 1, wherein the obtaining a multimedia file meeting a preset condition includes:

if yes, the multimedia file to be obtained is obtained.

5. The method for processing the corpus according to claim 2, wherein the obtaining the audio data of the multimedia file includes:

6. The method for processing the corpus according to claim 2, wherein filtering each subtitle in the subtitle file based on the preset second rule to obtain at least one remaining subtitle includes:

7. The method for processing the corpus according to claim 2, wherein filtering each subtitle in the subtitle file based on the preset second rule to obtain at least one remaining subtitle includes:

Acquiring the non-pronouncing characters in each subtitle;

8. The method for processing the corpus according to claim 2, wherein filtering each subtitle in the subtitle file based on the preset second rule to obtain at least one remaining subtitle includes:

acquiring the digital character and a preset target character in each subtitle;

9. The method for processing a corpus according to any one of claims 1 to 8, wherein filtering each subtitle in the subtitle file based on a preset second rule to obtain at least one remaining subtitle includes:

10. The method for processing the corpus according to claim 2, wherein the cutting the audio data based on the remaining at least one caption to obtain at least one audio data segment comprises:

11. A corpus processing apparatus, comprising:

the second processing module is used for cutting the audio data based on the at least one caption to obtain at least one audio data segment, and taking the at least one caption and the audio data segment corresponding to the at least one caption as a first audio caption pair to obtain at least one first audio caption pair;

repeatedly calling the third processing module and the fourth processing module by taking the second audio subtitle pair as a current first audio subtitle pair and the second acoustic model as a current first acoustic model until the minimum value of the loss function of the trained second acoustic model is converged to obtain a current trained second acoustic model;

the third processing module includes:

12. An electronic device, comprising:

a processor, a memory, and a bus;

the bus is used for connecting the processor and the memory;

the memory is used for storing operation instructions;

the processor is configured to execute the corpus processing method according to any one of claims 1 to 10 by invoking the operation instruction.

13. A computer-readable storage medium storing computer instructions that, when run on a computer, cause the computer to perform the method of processing a corpus according to any of the preceding claims 1-10.