CN116996632A

CN116996632A - Video subtitle generating method, electronic equipment and storage medium

Info

Publication number: CN116996632A
Application number: CN202310829756.6A
Authority: CN
Inventors: 杨丹
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2023-07-07
Filing date: 2023-07-07
Publication date: 2023-11-03

Abstract

The present application relates to the field of computer technologies, and in particular, to a video subtitle generating method, an electronic device, and a storage medium. In the video subtitle generating method, video terminal equipment acquires a subtitle generating instruction and a target video file, and performs audio extraction based on the target video file to obtain a target audio file; uploading the target audio file to a resource server so that the resource server forms resource attribute data; when the target audio file is stored in the resource server, acquiring resource attribute data from the resource server; based on the caption generating instruction, the resource attribute data is sent to a caption generating server, so that the caption generating server generates a video caption file according to the resource attribute data and the pre-trained audio recognition model; the video subtitle file is acquired from the subtitle generating server. Therefore, the video caption generating method can reduce the calculation burden on the video terminal equipment while the video terminal equipment realizes the function of automatically adding the captions.

Description

Video subtitle generating method, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a video subtitle generating method, an electronic device, and a storage medium.

Background

Currently, if a user wants to add a subtitle to a video file, a separate software tool is required to edit the subtitle by itself, so that the subtitle is added, and the time required by the method is long and low in efficiency. For practitioners in some business fields (such as insurance agents in insurance fields), a large part of practitioners do not have video editing experience, but often, some devices are required to shoot and make short videos related to some business fields for propaganda and popularization, so that the practitioners have difficulty in autonomously adding subtitles to video files.

In the related art, if automatic subtitle adding in video is required, an algorithm tool kit of voice recognition, audio/video processing and other types needs to be configured on the device, and the function of automatic subtitle adding is realized through the algorithm tool kit. However, the subtitle generating function is not a main functional module in the video terminal device (such as a mobile phone, a video camera, etc.), and if the subtitle generating function is only implemented by additionally configuring an algorithm tool kit of the type such as voice recognition, audio and video processing, etc., the subtitle generating function occupies a storage space of the video terminal device and affects an operation speed of the video terminal device. Therefore, how to realize the function of automatically adding subtitles while reducing the computational burden on the video terminal device, and is a problem to be solved in the industry.

Disclosure of Invention

The present application aims to solve at least one of the technical problems existing in the prior art. Therefore, the application provides a video subtitle generating method, electronic equipment and a storage medium, which can reduce the computational burden on video terminal equipment while realizing the function of automatically adding subtitles.

An embodiment of a video subtitle generating method according to a first aspect of the present application is applied to a video terminal device, and includes:

acquiring a subtitle generation instruction and a target video file, and extracting audio based on the target video file to obtain a target audio file;

uploading the target audio file to a resource server so that the resource server forms resource attribute data;

when the target audio file is stored in the resource server, acquiring the resource attribute data from the resource server;

based on the caption generating instruction, the resource attribute data is sent to a caption generating server, so that the caption generating server generates a video caption file according to the resource attribute data and a pre-trained audio recognition model;

and acquiring the video subtitle file from the subtitle generating server.

According to some embodiments of the application, the target video file includes a plurality of short video clips, and the audio extraction is performed based on the target video file to obtain a target audio file, including:

Traversing the short video segments to obtain segment arrangement information corresponding to each short video segment;

performing audio extraction on the short video segments based on the segment arrangement information to obtain audio segments corresponding to the short video segments;

and integrating the plurality of audio clips to obtain the target audio file.

According to some embodiments of the application, uploading the target audio file to a resource server to cause the resource server to form resource attribute data includes:

acquiring storage identification information corresponding to the target audio file, wherein the storage identification information is used for identifying a project deployment environment corresponding to the target audio file;

and uploading the target audio file to the resource server based on the storage identification information so that the resource server forms the resource attribute data.

According to some embodiments of the application, the obtaining the video subtitle file from the subtitle generating server includes:

inquiring the generation condition of the video subtitle file from the subtitle generation server based on a preset time interval to obtain result feedback information;

and acquiring the video subtitle file from the subtitle generating server based on the result feedback information.

According to some embodiments of the application, the obtaining the video subtitle file from the subtitle generating server based on the result feedback information includes:

when the video subtitle file fails to be generated, fault index data are obtained based on the result feedback information;

and executing fault feedback action based on the fault index data.

The video subtitle generating method according to the embodiment of the second aspect of the present application is applied to a subtitle generating server, and includes:

acquiring resource attribute data from video terminal equipment, and analyzing and processing the resource attribute data to obtain a resource positioning link, video identification information and audio identification information;

downloading a target audio file from the resource server based on the resource location link;

performing audio recognition processing on the target audio file through a pre-trained audio recognition model based on the audio identification information to obtain a target recognition text;

integrating the target identification text based on the video identification information to obtain a video subtitle file;

and sending the video subtitle file to the video terminal equipment.

According to some embodiments of the application, the performing audio recognition processing on the target audio file through a pre-trained audio recognition model based on the audio identification information to obtain a target recognition text includes:

Based on the audio identification information, performing audio identification processing on the target audio file through a pre-trained audio identification model to obtain a primary identification text;

and correcting the primary recognition text based on a preset business term specification to obtain the target recognition text.

According to some embodiments of the present application, before the audio recognition processing is performed on the target audio file through a pre-trained audio recognition model based on the audio identification information to obtain a target recognition text, the method further includes pre-training the audio recognition model, specifically including:

acquiring a training audio set, wherein the training audio set comprises training audio and an audio conversion label of the training audio;

identifying the training audio through the original identification model to obtain training identification data;

comparing the training identification data with the audio conversion label to obtain identification accuracy;

when the identification accuracy is lower than a preset accuracy threshold, updating the original identification model based on the identification accuracy;

and carrying out iterative training on the updated original recognition model based on the training audio and the audio conversion label until the recognition accuracy is not lower than the accuracy threshold value, so as to obtain the audio recognition model.

In a third aspect, an embodiment of the present application provides an electronic device, including: the video subtitle generating method according to any one of the embodiments of the first aspect of the present application is implemented when the processor executes the computer program.

In a fourth aspect, an embodiment of the present application provides a computer readable storage medium storing a computer program, where the computer program is executed by a processor to implement a video subtitle generating method according to any one of the embodiments of the first aspect of the present application.

The video subtitle generating method, the electronic equipment and the storage medium have at least the following beneficial effects:

in the video subtitle generating method, a video terminal device firstly acquires a subtitle generating instruction and a target video file, performs audio extraction based on the target video file to obtain a target audio file, uploads the target audio file to a resource server to enable the resource server to form resource attribute data, when the target audio file is stored in the resource server, acquires the resource attribute data from the resource server, further transmits the resource attribute data to a subtitle generating server based on the subtitle generating instruction, and after the subtitle generating server acquires the resource attribute data from the video terminal device, analyzes the resource attribute data to obtain a resource positioning link, video identification information and audio identification information, downloads the target audio file from the resource server based on the resource positioning link, further performs audio identification processing on the target audio file based on the audio identification information through a pre-trained audio identification model to obtain a target identification text, integrates the target identification text based on the video identification information to obtain a video subtitle file, and finally the video terminal device acquires the video subtitle file from the subtitle generating server. Therefore, the video caption generating method can reduce the calculation burden on the video terminal equipment while the video terminal equipment realizes the function of automatically adding the captions.

Additional aspects and advantages of the application will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application.

Drawings

The foregoing and/or additional aspects and advantages of the application will become apparent and may be better understood from the following description of embodiments taken in conjunction with the accompanying drawings in which:

fig. 1 is a schematic flow chart of a video subtitle generating method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of step S101 in the embodiment shown in FIG. 1 according to the present application;

FIG. 3 is a flowchart illustrating step S102 of the embodiment of FIG. 1 according to the present application;

fig. 4 is another schematic flow chart provided by the video subtitle generating method according to the embodiment of the present application;

FIG. 5 is a schematic flow chart of step S107 of the embodiment shown in FIG. 1 according to the present application;

FIG. 6 is a flow chart illustrating step S109 of the embodiment of FIG. 1 according to the present application;

FIG. 7 is a flow chart illustrating step S602 of the embodiment of FIG. 6 according to the present application;

fig. 8 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the application.

In the description of the present application, a number means one or more, a number means two or more, and greater than, less than, exceeding, etc. are understood to not include the present number, and above, below, within, etc. are understood to include the present number. The description of the first and second is for the purpose of distinguishing between technical features only and should not be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.

In the description of the present application, it should be understood that the direction or positional relationship indicated with respect to the description of the orientation, such as up, down, left, right, front, rear, etc., is based on the direction or positional relationship shown in the drawings, is merely for convenience of describing the present application and simplifying the description, and does not indicate or imply that the apparatus or element to be referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present application.

In the description of the present specification, reference to the terms "one embodiment," "some embodiments," "illustrative embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

In the description of the present application, unless explicitly defined otherwise, terms such as arrangement, installation, connection, etc. should be construed broadly and the specific meaning of the terms in the present application can be determined reasonably by a person skilled in the art in combination with the specific content of the technical solution. In addition, the following description of specific steps does not represent limitations on the order of steps or logic performed, and the order of steps and logic performed between steps should be understood and appreciated with reference to what is described in the embodiments.

The speech to text technique (Automatic Speech Recognition, ASR) is a technique that converts human speech into text. Speech recognition is a multi-disciplinary, intersecting domain that is tightly coupled to numerous disciplines such as acoustics, speech, linguistics, digital signal processing theory, information theory, computer science, and the like. Due to the variety and complexity of speech signals, speech recognition systems can only achieve satisfactory performance under certain constraints or can only be used in certain specific applications. The performance of a speech recognition system is roughly dependent on four types of factors: the size of the recognition vocabulary and the complexity of the speech, the quality of the speech signal, whether a single speaker or multiple speakers, hardware.

The following is a further description based on the accompanying drawings.

Fig. 1 is an optional flowchart provided by the video subtitle generating method according to an embodiment of the present application, where the steps applied to the video terminal device may include, but are not limited to, steps S101 to S104 and S109 described below, and the steps applied to the subtitle generating server may include, but are not limited to, steps S105 to S108 described below.

Step S101, acquiring a caption generating instruction and a target video file, and extracting audio based on the target video file to obtain a target audio file;

step S102, uploading a target audio file to a resource server so that the resource server forms resource attribute data;

step S103, when the target audio file is stored in the resource server, acquiring resource attribute data from the resource server;

step S104, based on the caption generating instruction, the resource attribute data is sent to a caption generating server, so that the caption generating server generates a video caption file according to the resource attribute data and the pre-trained audio recognition model;

Step S105, acquiring resource attribute data from the video terminal equipment, and analyzing the resource attribute data to obtain a resource positioning link, video identification information and audio identification information;

step S106, downloading the target audio file from the resource server based on the resource positioning link;

step S107, performing audio recognition processing on the target audio file through a pre-trained audio recognition model based on the audio identification information to obtain a target recognition text;

step S108, integrating the target identification text based on the video identification information to obtain a video subtitle file;

in step S109, the video terminal apparatus acquires a video subtitle file from the subtitle generating server.

The video terminal device refers to a terminal device for displaying video, such as a mobile phone, a computer, a tablet computer, and the like. And the resource server may be an IOBS file server for storing files uploaded by the video terminal device. The subtitle generating server refers to a server for providing a subtitle generating service, and in some embodiments, the subtitle generating server generates a video subtitle file according to a target audio file immediately after downloading the target audio file and transmits the video subtitle file to the video terminal device. In some exemplary embodiments, the target video file, the target audio file, or other types of files of the video terminal device may be uploaded to the resource server, and in addition, the subtitle generating server is configured to provide the computing power resource required for audio conversion, so as to save the storage space and the memory space of the video terminal device, thereby improving the service performance of the video terminal device. After uploading the target audio file to the resource server, the video terminal device can acquire corresponding resource attribute data from the resource server, and it is pointed out that the resource attribute data can be used for locating the downloading address of the target video file, so that the resource attribute data is used as a resource index to be sent to the subtitle generating server, the subtitle generating server can autonomously download the target video file from the resource server based on the resource attribute data and is used as an analysis basis to generate the video subtitle file, the target audio file with larger data capacity is not required to be transferred, and the generation of the video subtitle file can be efficiently realized.

In step S101 of some embodiments of the present application, a subtitle generating instruction and a target video file are acquired, and audio extraction is performed based on the target video file, so as to obtain a target audio file. It should be noted that, the caption generating instruction refers to a control instruction for initiating the generation of the video caption file, where the caption generating instruction may be obtained by the video terminal device from other devices, may be generated based on an input signal received in an input module of the video terminal device, or may be obtained by other manners. In addition, the target video file refers to a video file to which subtitles are required. It should be understood that the target video file may be a video shot by the video terminal device or may be a video downloaded by the video terminal device. In some exemplary embodiments of the present application, after the target video file is acquired, audio extraction is performed based on the target video file to obtain the target audio file, and it should be noted that the target audio file may be the complete audio of the target video file or may be an audio paragraph in the target video file.

Referring to fig. 2, step S101 according to some embodiments of the present application may include, but is not limited to, steps S201 to S203 described below.

Step S201, traversing short video segments to obtain segment arrangement information corresponding to each short video segment;

step S202, performing audio extraction on short video clips based on clip arrangement information to obtain audio clips corresponding to the short video clips;

in step S203, the plurality of audio clips are integrated to obtain the target audio file.

It should be noted that, some practitioners in some business fields (such as insurance agents in the insurance field) have no experience of video editing, but often need to use some devices to shoot and make short videos related to some business fields for propaganda and promotion, where the corresponding target video file is not limited to a single video, but may be a group of short videos related to each other. When the target video file is a set of interrelated short videos.

In order to extract a target audio file corresponding to a target video file from a group of interrelated short videos, in some exemplary embodiments, the short video segments are traversed first by the method shown in steps S201 to S203 to obtain segment arrangement information corresponding to each short video segment, where the segment arrangement information indicates information that each short video segment is arranged based on a time sequence, and it should be pointed out that the segment arrangement information is not limited to reflect a time sequence ordering relationship of the short video segments from old to new according to a generation time, but may also reflect an ordering relationship formed by a user by freely setting a delivery sequence of the short video segments. Further, the short video clips are subjected to audio extraction based on the clip arrangement information, and the arrangement sequence of each piece of audio can be processed according to the clip arrangement information while the short video clips are subjected to audio extraction, so that the audio clips corresponding to the short video clips are obtained. And further, integrating the plurality of audio clips to obtain the target audio file.

The embodiment shown in step S201 to step S203 can solve the processing of the zero-broken short video clips, thereby further improving the universality of the video subtitle generating method, and providing convenience for extracting the target audio file for some people without video editing experience, and further providing convenience for generating the subsequent video subtitle file.

In steps S102 to S103 of some embodiments of the present application, the target audio file is uploaded to the resource server, so that the resource server forms resource attribute data, and when the target audio file is stored in the resource server, the resource attribute data is obtained from the resource server. It should be noted that, after the target audio file is extracted from the target video file, in order to avoid occupying the storage space on the video terminal device, the target audio file needs to be further uploaded to a resource server (e.g., an IOBS file server). It should be noted that, when the target audio file is stored in the resource server, which means that the target audio file to be converted into the video subtitle file may be found in the resource server, the video terminal device may request the subtitle generating server to generate the video subtitle file based on the resource attribute data after obtaining the resource attribute data from the resource server. It should be noted that, the resource attribute data refers to attribute information identifying a resource file in the resource server, for example, URL location information of a uniform resource locator (Uniform Resource Locator, URL), download link, audio type, video type, and other information, in some exemplary embodiments, the resource attribute data may be used to locate a download address of a target video file, so that the resource attribute data is sent as a resource index to the subtitle generating server, that is, the subtitle generating server may autonomously download the target video file from the resource server based on the resource attribute data and use the target video file as an analysis basis to generate the video subtitle file, without transferring the target audio file with a larger data capacity, so that the generation of the video subtitle file can be efficiently implemented.

Referring to fig. 3, step S102 according to some embodiments of the present application may include, but is not limited to, the following steps S301 to S302.

Step S301, storage identification information corresponding to a target audio file is obtained, wherein the storage identification information is used for identifying an item deployment environment corresponding to the target audio file;

step S302, the target audio file is uploaded to the resource server based on the storage identification information, so that the resource server forms resource attribute data.

In steps S301 to S302 of some embodiments of the present application, storage identification information corresponding to a target audio file is obtained, where the storage identification information is used to identify an item deployment environment corresponding to the target audio file, and the target audio file is uploaded to a resource server based on the storage identification information, so that the resource server forms resource attribute data. It should be noted that, storing the identification information refers to a project deployment environment for identifying the corresponding target audio file. In some exemplary embodiments, the resource server may be an IOBS file server, and then storing the identification information at this time refers to IOBS configuration information corresponding to the IOBS file server.

It should be appreciated that project deployment environments can be generally divided into three categories: production environment, test environment and development environment. The development environment is a server specially used for development, the configuration is more random, and in order to facilitate development and debugging, all error reports and test tools are generally opened in the development environment, so that the development environment is the most basic project deployment environment in three project deployment environments; the test environment refers to the configuration of cloning a production environment, if the program works abnormally in the test environment, the program cannot be issued to the production environment, and the program belongs to an excessive environment from a development environment to the production environment; in addition, the production environment refers to a server for providing external services, generally turns off error reports and turns on error logs, and is the most important environment in three project deployment environments. The above three project deployment environments can also be said to be three phases of program development: development, testing and online. The development environment is a local environment where developers interact when developing the joint debugging, such as front-end and back-end, and typically deploy codes into a testing environment for testing after the local development is completed. Because of the possible interaction between the different systems, each time a software project is tested, these systems need to be guaranteed to be in the same testing environment to test the joint debugging. Therefore, through some preferred embodiments shown in steps S301 to S302, the video terminal device needs to selectively upload the target audio file to the generating environment, the testing environment or the developing environment according to the storage identification information, so that the target audio file can be used in the respective project deployment environments during each stage of development, testing and online, and flexibility of operation and maintenance is increased.

In step S104 of some embodiments of the present application, the resource attribute data is sent to the caption generating server based on the caption generating instruction, so that the caption generating server generates the video caption file according to the resource attribute data and the pre-trained audio recognition model. It should be emphasized that when the target audio file is stored in the resource server, which means that the target audio file to be converted into the video subtitle file can be found in the resource server, and then the video terminal device can request the subtitle generating server to generate the video subtitle file based on the resource attribute data after obtaining the resource attribute data from the resource server.

In steps S105 to S106 of some embodiments of the present application, resource attribute data is obtained from a video terminal device, and the resource attribute data is parsed to obtain a resource location link, video identification information and audio identification information, and then a target audio file is downloaded from a resource server based on the resource location link. After receiving the resource attribute data, the caption generating server needs to analyze the resource attribute data to obtain a resource positioning link, video identification information and audio identification information in order to convert the target audio file into a video caption file. The resource positioning link refers to a download link or URL positioning information of the target audio file, and the subtitle generating server may download the target audio file from the resource server according to the resource positioning link after parsing the resource positioning link. It should be noted that, the subtitle generating server downloads the target audio file through the resource positioning link, and compared with a mode of transmitting the target audio file between the video terminal device and the subtitle generating server, the buffer consumption of the video terminal device can be reduced, so that the subtitle generating efficiency is further improved.

Referring to fig. 4, the video subtitle generating method further includes pre-training an audio recognition model, including, but not limited to, steps S401 to S405 described below, according to some embodiments of the present application, prior to step S107.

Step S401, a training audio set is obtained, wherein the training audio set comprises training audio and an audio conversion label of the training audio;

step S402, training audio is identified through an original identification model, and training identification data is obtained;

step S403, comparing the training identification data with the audio conversion label to obtain the identification accuracy;

step S404, when the recognition accuracy is lower than a preset accuracy threshold, updating the original recognition model based on the recognition accuracy;

and step S405, performing iterative training on the updated original recognition model based on the training audio and the audio conversion label until the recognition accuracy is not lower than the accuracy threshold value, and obtaining the audio recognition model.

In step S401 of some embodiments of the present application, a training audio set is obtained, where the training audio set includes training audio and audio conversion tags of the training audio. It should be noted that, the audio recognition model refers to an artificial intelligence model for converting audio into text, where the artificial intelligence model is pre-trained to have the capability of converting audio into text. The preset original recognition model is a preset artificial intelligent model which is not trained yet. It is noted that the training audio set includes training audio and audio conversion tags for the training audio, wherein the training audio includes a variety of speech audio, and each segment of speech audio is configured with an audio conversion tag for identifying the actual text content of the segment of speech audio. The method comprises the steps of inputting a plurality of training audios into a preset original recognition model for recognition, adjusting and correcting the original recognition model through audio conversion tags corresponding to the training audios one by one, gradually training the capability of the original recognition model for converting the audios into texts, and obtaining the audio recognition model after the pre-training is completed.

In step S402 of some embodiments of the present application, training audio is identified by the original identification model, so as to obtain training identification data. It should be noted that, the training recognition data refers to result data obtained after the original recognition model recognizes the training audio, that is, training recognition text obtained by converting the original recognition model into the training audio.

In step S403 of some embodiments of the present application, training identification data is compared with an audio conversion tag, so as to obtain an identification accuracy. It should be noted that, since the training recognition data refers to training recognition text obtained by converting the original recognition model for training audio, each piece of voice audio is configured with an audio conversion tag for identifying the actual text content of the piece of voice audio. Therefore, through the comparison between the audio conversion label and the training recognition data, the recognition accuracy can be obtained based on the comparison results of text coincidence degree, semantic coincidence degree and the like. It should be noted that the training recognition data is compared with the audio conversion label to obtain the recognition accuracy, and the purpose is to learn the capability of the original recognition model to convert the audio into the text at the current moment according to the recognition accuracy.

And step S404, when the recognition accuracy is lower than a preset accuracy threshold, updating the original recognition model based on the recognition accuracy. It should be noted that, if the training result data is closer to the audio conversion tag, the recognition accuracy is higher, and the capability of the original recognition model for converting the audio into the text is higher; similarly, the further the training result data is away from the audio conversion tag, the smaller the recognition accuracy is, indicating that the original recognition model has a lower capability of converting audio into text. It should be noted that, when the recognition accuracy is lower than the preset accuracy threshold, which means that the capability of the original recognition model for converting the audio into the text is lower at this time, the original recognition model needs to be updated by adjusting the model parameters of the original recognition model, so as to improve the recognition accuracy of the original recognition model.

And step S405, performing iterative training on the updated original recognition model based on the training audio and the audio conversion label until the recognition accuracy is not lower than the accuracy threshold value, and obtaining the audio recognition model. It should be noted that, the iterative training of each round adjusts the model parameters according to the variation trend of the recognition accuracy rate, so as to perform the iterative training of the next round, for example, the recognition accuracy rate of the present round of training is obviously improved compared with that of the previous round of training, which indicates that the parameter adjustment of the previous round is favorable for improving the model performance, so that the parameter adjustment of the present round can be performed along the parameter adjustment mode of the previous round, and if the recognition accuracy rate of the present round of training is not obviously improved compared with that of the previous round of training, the parameter adjustment mode of the previous round may be defective, or the model performance is optimized. It should be noted that the manner in which the original recognition model is trained based on the training audio and audio conversion tags is varied and may include, but is not limited to, the specific embodiments set forth above.

Through the steps S401 to S405, the audio recognition model is pre-trained to train the capability of the audio recognition model to convert the audio into the text, so that the recognition accuracy of the video subtitle generating method of the present application can be further improved.

In step S107 of some embodiments of the present application, based on the audio identification information, the target audio file is subjected to audio recognition processing by using the pre-trained audio recognition model, so as to obtain a target recognition text. It should be understood that the audio identification information refers to information for identifying the attribute of the target audio file, and in some embodiments, the audio identification information includes audio timing information and audio format information. It should be appreciated that the subtitle generating server needs to convert the target audio file into the video subtitle file, so in some embodiments, the subtitle generating server needs to process the target audio file based on the audio format information to convert the voice content therein into the corresponding target recognition text, and the audio timing information is the timing information marked for the target recognition text when the target audio file is converted into the text subtitle, so that the target recognition text can form a correct timing correspondence with the voice content. It should be noted that, the audio recognition model refers to an artificial intelligence model for converting audio into text, where the audio recognition model is pre-trained to obtain the capability of converting audio into text. It should be noted that the audio recognition model may utilize artificial intelligence models in the field of audio conversion, such as a time series classification model (Connectionist Temporal Classification, CTC) based on a joint mechanism, a recurrent neural network sensing model (Recurrent Neural Network Transducer, RNN-T), and the like, and it should be understood that the models in the field of audio conversion may be collectively referred to as a phonetic text conversion model (ASR model).

Referring to fig. 5, step S107 may include, but is not limited to, steps S501 to S502 described below, according to some embodiments of the present application.

Step S501, performing audio recognition processing on a target audio file through a pre-trained audio recognition model based on the audio identification information to obtain a primary recognition text;

step S502, correcting the primary recognition text based on a preset business term specification to obtain a target recognition text.

In step S501 of some embodiments of the present application, based on the audio identification information, the audio recognition processing is performed on the target audio file through the pre-trained audio recognition model, so as to obtain the preliminary recognition text. It should be noted that, because the audio recognition model has the capability of converting audio into text after being pre-trained, the pre-trained audio recognition model is used for performing audio recognition processing on the target audio file, so that the preliminary recognition text can be obtained. It is emphasized that the audio identifying information refers to information for identifying the properties of the target audio file, and in some embodiments, the audio identifying information includes audio timing information and audio format information. It should be appreciated that the subtitle generating server needs to convert the target audio file into the video subtitle file, so in some embodiments, the subtitle generating server needs to process the target audio file based on the audio format information to convert the voice content therein into the corresponding target recognition text, and the audio timing information is the timing information marked for the target recognition text when the target audio file is converted into the text subtitle, so that the target recognition text can form a correct timing correspondence with the voice content.

In step S502 of some embodiments of the present application, correction processing is performed on the preliminary recognition text based on a preset business term specification, so as to obtain the target recognition text. It should be noted that, the preset business term specification refers to preset general industry terms and normative language in a certain business field. It should be noted that, when the video subtitle generating method of the present application is applied in some service fields, the speech audio provided by the speaker in the target video file may have proper nouns or normative expressions in some service fields. For example, "serious illness risk" in the insurance field means serious illness risk, and "standard body underwriting" means health of a underwriter who can normally apply insurance. In this case, if the audio recognition processing is directly performed on the target audio file by the pre-trained audio recognition model, the obtained target recognition text is often prone to be recognized as a recognition error, for example, the "serious danger" is recognized as a "medium danger" or a "relay line", and the "standard body underwriting" is recognized as a "standard body contract" or a "standard body underwriting". Therefore, in order to further improve the subtitle accuracy of the video subtitle generating method, in some preferred embodiments of the present application, correction processing is performed on the primary recognition text based on the preset business term specification, and the primary recognition text is converted into a text conforming to the business term specification, so as to obtain the target recognition text.

Through the embodiment shown in step S501 to step S502, the preliminary recognition text obtained by performing the audio recognition processing on the target audio file by the pre-trained audio recognition model is corrected by using the preset business term specification, and the target recognition text is formed. The subtitle accuracy of the video subtitle generating method can be further improved.

Step S108, integrating the target identification text based on the video identification information to obtain the video subtitle file. It should be understood that the video identification information refers to information for identifying the attribute of the target video file, and in some embodiments, the video identification information includes video timing information and video format information. It should be understood that after the target recognition text is obtained, the video subtitle file corresponding to the target video file needs to be generated according to the target recognition text, so that the target recognition text can be integrated into the subtitle file corresponding to the target video file in time sequence and format, namely, the video subtitle file, through the video time sequence information and the video format information in the video identification information.

In step S109, the video terminal apparatus acquires a video subtitle file from the subtitle generating server. It should be noted that, after the video subtitle file is generated, the video subtitle file needs to be sent from the subtitle generating server to the video terminal device, and in some exemplary embodiments, the subtitle generating server may respond to a query request of the video terminal device and send the video subtitle file to the video terminal device, or may send the video subtitle file to the video terminal device immediately after the video subtitle file is generated. It should be understood that the manner in which the video subtitle file is transmitted from the subtitle generating server to the video terminal device, or the video terminal device obtains the video subtitle file from the subtitle generating server, may include, but is not limited to, the specific embodiments set forth above.

Referring to fig. 6, step S109 includes, but is not limited to, steps S601 to S602 described below, according to some embodiments of the present application.

Step S601, inquiring the generation condition of a video subtitle file from a subtitle generation server based on a preset time interval to obtain result feedback information;

step S602, based on the result feedback information, a video subtitle file is acquired from the subtitle generating server.

It is required to make sure that the video subtitle file is sent from the subtitle generating server to the video terminal device, which is equivalent to the video terminal device acquiring the video subtitle file from the subtitle generating server.

In steps S601 to S602 of some embodiments of the present application, the subtitle generating server is queried about the generation condition of the video subtitle file based on the preset time interval, so as to obtain the result feedback information, and then the video subtitle file is obtained from the subtitle generating server based on the result feedback information. It is emphasized that the subtitle generating server may transmit the video subtitle file in response to the query request of the video terminal device, or may transmit the video subtitle file to the video terminal device immediately after the video subtitle file is generated. In some more specific embodiments, a micro-service mode is adopted between the caption generating server and the application back end of the video terminal device to communicate information, and the application back end cannot notify the application front end: the associated video subtitle file is already generated, so in some embodiments, the application front end needs to acquire the video subtitle file by opening a timing task.

In some exemplary embodiments, the video terminal device queries the subtitle generating server for the generation of the video subtitle file based on the preset time interval, and the subtitle generating server may form result feedback information in response to the query request of the video terminal device and send the result feedback information to the video terminal device. It should be noted that, the result feedback information refers to feedback information formed based on the query request of the video terminal device, and may be active feedback information, for example, if the subtitle generating server finds a video subtitle file corresponding to the query request, the video subtitle file may be sent to the video terminal device as the result feedback information; the result feedback information may also be negative feedback information, for example, the subtitle generating server does not find a video subtitle file corresponding to the query request, and at this time, the subtitle generating server may determine the type of error currently occurring and send the result feedback information to the video subtitle file. It should be understood that the feedback types corresponding to the result feedback information are various and may include, but are not limited to, the specific embodiments mentioned above.

Referring to fig. 7, step S602 may include, but is not limited to, steps S701 to S702 described below, according to some embodiments of the present application.

Step S701, when the generation of the video subtitle file fails, fault index data is obtained based on result feedback information;

step S702, based on the fault index data, performs a fault feedback action.

In step S701 of some embodiments of the present application, when the video subtitle file generation fails, failure index data is obtained based on the result feedback information. It should be noted that the failure index data is used to identify the reason for the failure of the video subtitle file generation. It should be understood that various reasons may exist for the failure of the video subtitle file generation in the embodiment of the present application, for example, the failure of extracting the target audio file, the storage identification information being null, the resource attribute data being null, the failure of uploading the target audio file, etc. In some exemplary embodiments of the present application, for different types of subtitle generation failure reasons, the subtitle generation server can form some fault index data (e.g., error codes) and incorporate them into the result feedback information, so that the video terminal device synchronously learns the type of fault that is currently occurring.

In step S702 of some embodiments of the present application, a fault feedback action is performed based on the fault index data. It should be noted that, because the fault index data is used to identify the reason for the failure of generating the video subtitle file, after the video terminal device obtains the fault index data based on the result feedback information, the fault feedback action can be further performed based on the fault index data. It should be noted that, for different types of subtitle generation failure reasons, different types of fault feedback actions may be correspondingly performed, for example, when an error occurs in which extraction of the target audio file fails, the current error may be attempted to be repaired by re-performing the audio extraction operation; for another example, when an error occurs in which the resource attribute data is empty, an attempt may be made to fix the current error by querying the resource server for the resource attribute data. It should be understood that the fault feedback action may also be to report an error to the user, and prompt the user of the current error type, so as to help the user to propose a repair strategy; alternatively, the fault feedback action may be to send an error log directly to the operation and maintenance personnel to help the operation and maintenance personnel troubleshoot the problem. It should be appreciated that the types of fault feedback actions are numerous and may include, but are not limited to, the specific embodiments set forth above.

Fig. 8 shows an electronic device 800 provided by an embodiment of the application. The electronic device 800 includes: a processor 801, a memory 802, and a computer program stored on the memory 802 and executable on the processor 801, the computer program when executed is for performing the video subtitle generating method described above.

The processor 801 and the memory 802 may be connected by a bus or other means.

The memory 802, as a non-transitory computer readable storage medium, may be used to store a non-transitory software program and a non-transitory computer executable program, such as the video subtitle generating method described in embodiments of the present application. The processor 801 implements the video subtitle generating method described above by running a non-transitory software program and instructions stored in the memory 802.

The memory 802 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area. The storage data area may store video subtitle generating methods described above. Further, the memory 802 may include high-speed random access memory 802, and may also include non-transitory memory 802, such as at least one storage device memory device, flash memory device, or other non-transitory solid state memory device. In some implementations, the memory 802 may optionally include memory 802 located remotely from the processor 801, the remote memory 802 being connectable to the electronic device 800 through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The non-transitory software programs and instructions required to implement the video subtitle generating method described above are stored in the memory 802, and when executed by the one or more processors 801, the video subtitle generating method described above is performed, for example, method steps S101 to S109 in fig. 1, method steps S201 to S203 in fig. 2, method steps S301 to S302 in fig. 3, method steps S401 to S405 in fig. 4, method steps S501 to S502 in fig. 5, method steps S601 to S602 in fig. 6, and method steps S701 to S702 in fig. 7.

The embodiment of the application also provides a computer readable storage medium which stores computer executable instructions for executing the video subtitle generating method.

In an embodiment, the computer-readable storage medium stores computer-executable instructions that are executed by one or more control processors, for example, to perform method steps S101 through S109 in fig. 1, method steps S201 through S203 in fig. 2, method steps S301 through S302 in fig. 3, method steps S401 through S405 in fig. 4, method steps S501 through S502 in fig. 5, method steps S601 through S602 in fig. 6, and method steps S701 through S702 in fig. 7.

The apparatus embodiments described above are merely illustrative, in which the units illustrated as separate components may or may not be physically separate, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Those of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, storage device storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically include computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and may include any information delivery media. It should also be appreciated that the various embodiments provided by the embodiments of the present application may be arbitrarily combined to achieve different technical effects.

While the preferred embodiment of the present application has been described in detail, the present application is not limited to the above embodiments, and those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit and scope of the present application, and these equivalent modifications or substitutions are included in the scope of the present application as defined in the appended claims.

Claims

1. A video subtitle generating method applied to a video terminal device, comprising:

and acquiring the video subtitle file from the subtitle generating server.

2. The method of claim 1, wherein the target video file comprises a plurality of short video clips, the audio extraction based on the target video file results in a target audio file, comprising:

and integrating the plurality of audio clips to obtain the target audio file.

3. The method of claim 1, wherein uploading the target audio file to a resource server to cause the resource server to form resource attribute data comprises:

4. A method according to any one of claims 1 to 3, wherein said obtaining the video subtitle file from the subtitle generating server comprises:

5. The method of claim 4, wherein the obtaining the video subtitle file from the subtitle generating server based on the result feedback information includes:

and executing fault feedback action based on the fault index data.

6. A video subtitle generating method applied to a subtitle generating server, comprising:

and sending the video subtitle file to the video terminal equipment.

7. The method of claim 6, wherein the performing audio recognition processing on the target audio file through the pre-trained audio recognition model based on the audio identification information to obtain target recognition text comprises:

8. The method according to claim 6 or 7, wherein before performing an audio recognition process on the target audio file by a pre-trained audio recognition model based on the audio identification information to obtain a target recognition text, the method further comprises pre-training the audio recognition model, specifically comprising:

9. An electronic device, comprising: a memory, a processor storing a computer program, the processor implementing the video subtitle generating method according to any one of claims 1 to 8 when the computer program is executed.

10. A computer-readable storage medium storing a computer program that is executed by a processor to implement the video subtitle generating method according to any one of claims 1 to 8.