WO2023005193A1

WO2023005193A1 - Subtitle display method and device

Info

Publication number: WO2023005193A1
Application number: PCT/CN2022/076656
Authority: WO
Inventors: 卢家辉
Original assignee: 北京达佳互联信息技术有限公司
Priority date: 2021-07-30
Filing date: 2022-02-17
Publication date: 2023-02-02
Also published as: CN113781988A

Abstract

A subtitle display method and device, an electronic apparatus, a computer-readable storage medium and a computer program product, the method comprising: receiving audio content (S21); in response to a subtitle addition operation, performing recognition on the audio content, thus obtaining text content (S22); in response to a melody recognition operation, performing recognition on melody information in the audio content, thus obtaining melody content (S23); and, on the basis of the text content and melody content, generating subtitles and displaying same on a display interface (S24).

Description

Caption display method and device

Cross References to Related Applications

This application is based on a Chinese patent application with application number 202110876235.7 and a filing date of July 30, 2021, and claims the priority of this Chinese patent application. The entire content of this Chinese patent application is hereby incorporated by reference into this application.

technical field

The present disclosure relates to the field of computers, and in particular to a subtitle display method, device, electronic equipment, and computer-readable storage medium.

Background technique

At present, in related technologies, the STT subtitle (Speech To Text, speech recognition subtitle) function is very popular among users. The emergence of STT subtitles can easily allow users to generate subtitle content using audio content. These subtitles can make video works widely disseminated on the Internet, making it easier and clearer for video viewers to understand the creative content of the video creator and the textual information of the audio in the video. However, the STT subtitles recognized by the speech recognition function often have a single audio content.

Contents of the invention

The disclosure provides a subtitle display method, device, electronic equipment, and computer-readable storage medium.

According to the first aspect of an embodiment of the present disclosure, there is provided a method for displaying subtitles, including: receiving audio content; in response to a subtitle adding operation, identifying the audio content to obtain text content; in response to the melody identification operation, The melody information of the audio content is identified to obtain the melody content; based on the text content and the melody content, subtitles are generated and displayed on the display interface.

In some embodiments, based on the text content and the melody content, generating subtitles and displaying them on the display interface includes: splitting the text content into independent words, and recording each independent word in the audio content Time information of the audio content; identifying the melody information of the audio content, and obtaining the melody content includes: selecting the melody of the part of the audio content corresponding to the time information based on the time information of the independent characters in the audio content respectively Information is identified to obtain the independent melody content corresponding to the independent characters, wherein the independent melody content corresponding to the independent characters constitutes the melody content corresponding to the text content; based on the independent characters and the corresponding independent melody content melody content, generating subtitles and displaying them on the display interface.

In some embodiments, when the time information includes the start time point and duration of the independent characters in the audio content, based on the time information of the independent characters in the audio content , select and identify the melody information of the part of the audio content corresponding to the time information, and obtain the independent melody content corresponding to the independent characters, including: respectively based on the start time of the independent characters in the audio content point, and the duration, select the part of the audio content corresponding to the start time point and the duration; process the part of the audio content to obtain the spectral distribution of the part of the audio content; based on the spectral distribution, obtain Independent melody content corresponding to each independent text.

In some embodiments, the obtaining the independent melody content corresponding to each independent text based on the frequency spectrum distribution includes: when the audio content is music and the independent melody content is music melody, determining the The highest frequency in the spectrum distribution is the main frequency of each independent character; the main frequency is converted into music text information, wherein the music text information represents the music melody of each independent text.

In some embodiments, the music text information includes at least one of the following: numbered musical notation in digital form, and staff notation in symbolic form.

In some embodiments, the displaying the subtitle on the display interface includes: displaying the melody content above or below the text content.

According to the second aspect of the embodiments of the present disclosure, there is provided a method for displaying subtitles, including: playing a video on a display interface, wherein the video includes audio content; receiving a subtitle display instruction; responding to the subtitle display instruction, in the Subtitles are displayed on the display interface, wherein the subtitles include: text content and melody content, the text content is obtained by identifying the audio content, and the melody content is obtained by identifying the melody information of the audio content .

According to a third aspect of an embodiment of the present disclosure, there is provided a subtitle display device, including: a first receiving module, configured to receive audio content; a first identification module, configured to identify the audio content in response to a subtitle adding operation , to obtain the text content; the second recognition module is used to identify the melody information of the audio content in response to the melody recognition operation to obtain the melody content; the processing module is used to based on the text content and the melody content , generate subtitles and display them on the display interface.

In some embodiments, the processing module includes: a splitting unit and a first processing unit, wherein the splitting unit is configured to split the text content into independent words, and record each independent word in the Time information in the audio content; the second identification module is also used to select the melody information of the part of the audio content corresponding to the time information based on the time information of the independent characters in the audio content for identification , to obtain the independent melody content corresponding to each independent character, wherein the independent melody content corresponding to each independent character constitutes the melody content corresponding to the text content; the first processing unit is configured to Generate subtitles corresponding to the independent melody content and display them on the display interface.

In some embodiments, the second identification module includes: a selection unit, configured to, when the time information includes the start time point and duration of each independent text in the audio content, respectively based on the The start time point and the duration of each independent text in the audio content, select a part of the audio content corresponding to the start time point and the duration; the second processing unit is used to process the part The audio content is processed to obtain the spectral distribution of the part of the audio content; the third processing unit is configured to obtain the independent melody content corresponding to the independent characters based on the spectral distribution.

In some embodiments, the third processing unit includes: a determining subunit, configured to determine that the highest frequency in the spectrum distribution is The main frequency of each independent character; a conversion subunit, configured to convert the main frequency into music text information, wherein the music text information represents the music melody of each independent text.

In some embodiments, the processing module includes: a display unit, configured to display the melody content above or below the text content.

According to a fourth aspect of an embodiment of the present disclosure, there is provided a subtitle display device, including: a playing module, configured to play a video on a display interface, wherein the video includes audio content; a second receiving module, configured to receive subtitles A display instruction; a display module, configured to respond to the subtitle display instruction and display subtitles on the display interface, wherein the subtitles include: text content and melody content, and the text content is obtained by identifying the audio content , the melody content is obtained by identifying the melody information of the audio content.

According to a fifth aspect of the embodiments of the present disclosure, there is provided an electronic device, including: a processor; a memory for storing instructions executable by the processor; wherein the processor is configured to execute the instructions to implement The subtitle display method described in any one.

According to a sixth aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium, when the instructions in the computer-readable storage medium are executed by the processor of the electronic device, the electronic device can perform any one of the Subtitle display method.

According to a seventh aspect of the embodiments of the present disclosure, a computer program product is provided, including a computer program, and when the computer program is executed by a processor, any subtitle display method described in any one is implemented.

By responding to the subtitle addition operation and the melody recognition operation, the audio content is identified to obtain the text content and the melody content, and the subtitles are generated based on the above text content and melody content, and then the subtitles are displayed on the display interface, because the displayed subtitles carry the melody content Therefore, the subtitle not only displays the text content of the audio, but also displays the melody content that cannot be reflected in the text content, which reduces the loss of the audio content as much as possible, fully reflects the audio content, and avoids The audio displays subtitles, reflecting the fact that the audio content is single.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the present disclosure.

Description of drawings

The accompanying drawings here are incorporated into the specification and constitute a part of the specification, show embodiments consistent with the disclosure, and are used together with the description to explain the principle of the disclosure, and do not constitute an improper limitation of the disclosure.

Fig. 1 is a block diagram showing a hardware structure of a computer terminal for implementing a subtitle display method according to an exemplary embodiment.

Fig. 2 is a flow chart of a subtitle display method 1 according to an exemplary embodiment.

Fig. 3 is a flow chart of a second subtitle display method according to an exemplary embodiment.

FIG. 4 is a flowchart of a subtitle display method according to an embodiment of the present disclosure.

Fig. 5 is a device block diagram of a subtitle display device 1 according to an exemplary embodiment.

Fig. 6 is a device block diagram of a subtitle display device 2 according to an exemplary embodiment.

Fig. 7 is a device block diagram of a terminal according to an exemplary embodiment.

Fig. 8 is a structural block diagram of a server according to an exemplary embodiment.

Detailed ways

In order to enable ordinary persons in the art to better understand the technical solutions of the present disclosure, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below in conjunction with the accompanying drawings.

It should be noted that the terms "first" and "second" in the specification and claims of the present disclosure and the above drawings are used to distinguish similar objects, but not necessarily used to describe a specific sequence or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein can be practiced in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples do not represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatuses and methods consistent with aspects of the present disclosure as recited in the appended claims.

First of all, some nouns or terms that appear during the description of the embodiments of the present application are applicable to the following explanations:

STT subtitles: STT is the abbreviation of Speech To Text, that is, "from speech to text". In mobile video editing software, speech recognition technology is used to convert the audio input by the user into text, and then convert the text into subtitles and embed them in the video, which is called STT subtitles.

FFT transform: FFT transform is the abbreviation of Fast Fourier Transform, that is, Fast Fourier Transform. The FFT transform is a method for quickly calculating the discrete Fourier transform (DFT, Discere Fourier Transform) of a sequence or its inverse transform. Fourier analysis transforms a signal from the original domain (usually time or space) to a representation in the frequency domain or vice versa. FFT computes such transformations quickly by decomposing a DFT matrix into a product of sparse (mostly zero) factors. Therefore, it can reduce the complexity of computing DFT from O(n2), which is required for computing only with DFT definition, to O(nlogn), where n is the data size.

Music notation: The so-called musical notation refers to digital notation, which uses numbers to represent the melody of music. Numbered musical notation is based on movable solfa, with 1, 2, 3, 4, 5, 6, and 7 representing the seven basic levels in the scale, and the pronunciations are do, re, mi, fa, sol, la, ti ( Chinese is si), English is represented by C, D, E, F, G, A, B, and rest is represented by 0. The time value name of each number is equivalent to the quarter note of the staff.

Music spectrum analysis: Music spectrum analysis is a very commonly used algorithm. Spectrum principle: According to Fourier analysis, any sound can be decomposed into several or even infinite sine waves, and they often contain countless harmonic components. Using FFT (Fast Fourier Transform), digital signals can be converted from time-domain signals to frequency-domain signals to obtain the spectral characteristics of music.

According to an embodiment of the present disclosure, a method embodiment of a subtitle display method is proposed. It should be noted that the steps shown in the flowcharts of the accompanying drawings may be performed in a computer system, such as a set of computer-executable instructions, and that although a logical order is shown in the flowcharts, in some cases, The steps shown or described may be performed in an order different than here.

The method embodiments provided by the embodiments of the present disclosure may be executed in mobile terminals, computer terminals or similar computing devices. Fig. 1 is a block diagram showing a hardware structure of a computer terminal (or mobile device) for realizing a subtitle display method according to an exemplary embodiment. As shown in FIG. 1 , the computer terminal 10 (or mobile device) may include one or more (shown as 102a, 102b, ..., 102n in the figure) processors 102 (processors 102 may include but not limited to microprocessor processor MCU or programmable logic device FPGA and other processing devices), the memory 104 for storing data, and the transmission device for communication functions. In addition, it can also include: a display, an input/output interface (I/O interface), a universal serial bus (USB) port (which can be included as one of the ports of the BUS bus), a network interface, a power supply, and/or or camera. Those of ordinary skill in the art can understand that the structure shown in FIG. 1 is only a schematic diagram, and it does not limit the structure of the above-mentioned electronic device. For example, computer terminal 10 may also include more or fewer components than shown in FIG. 1 , or have a different configuration than that shown in FIG. 1 .

It should be noted that the one or more processors 102 and/or other data processing circuits described above may generally be referred to herein as "data processing circuits". The data processing circuit may be implemented in whole or in part as software, hardware, firmware or other arbitrary combinations. In addition, the data processing circuit can be a single independent processing module, or be fully or partially integrated into any of the other elements in the computer terminal 10 (or mobile device). As involved in the embodiments of the present disclosure, the data processing circuit serves as a processor control (for example, the selection of the variable resistor terminal path connected to the interface).

The memory 104 can be used to store software programs and modules of application software, such as the program instruction/data storage device corresponding to the subtitle display method in the embodiment of the present disclosure, and the processor 102 executes the software programs and modules stored in the memory 104 by running Various functional applications and data processing, that is, to realize the subtitle display method of the above-mentioned application program. The memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory 104 may further include a memory that is remotely located relative to the processor 102 , and these remote memories may be connected to the computer terminal 10 through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Transmission means are used to receive or transmit data via a network. Examples of the aforementioned network may include a wireless network provided by a communication provider of the computer terminal 10 . In one example, the transmission device includes a network adapter (Network Interface Controller, NIC), which can be connected to other network devices through a base station so as to communicate with the Internet. In one example, the transmission device may be a radio frequency (Radio Frequency, RF) module, which is used to communicate with the Internet in a wireless manner.

The display may be, for example, a touchscreen liquid crystal display (LCD), which may enable a user to interact with the user interface of the computer terminal 10 (or mobile device).

It should be noted here that, in some embodiments, the computer device (or mobile device) shown in FIG. 1 may include hardware components (including circuits), software components (including computer codes stored on computer-readable media) , or a combination of both hardware and software elements. It should be noted that FIG. 1 is only one example of a specific example, and is intended to illustrate the types of components that may be present in a computer device (or mobile device) as described above.

Under the above operating environment, the present disclosure provides a subtitle display method as shown in FIG. 2 . Fig. 2 is a flow chart of a subtitle display method 1 according to an exemplary embodiment. As shown in Fig. 2, the method is used in the above-mentioned computer terminal, and includes the following steps S21 to S24.

In step S21, audio content is received;

In step S22, in response to the subtitle adding operation, the audio content is identified to obtain the text content;

In step S23, in response to the melody recognition operation, the melody information of the audio content is identified to obtain the melody content;

In step S24, based on the text content and the melody content, subtitles are generated and displayed on the display interface.

Using the above processing, by responding to the subtitle addition operation and the melody recognition operation, the audio content is recognized to obtain the text content and melody content, and the subtitles are generated based on the above text content and melody content, and then the subtitles are displayed on the display interface. Since the displayed subtitles are Carrying melody content, therefore, the subtitle not only displays the text content of the audio, but also displays the melody content that cannot be reflected in the text content, which reduces the loss of audio content as much as possible and fully reflects the audio content , to avoid displaying subtitles based on audio, reflecting the fact that the audio content is single.

In some embodiments, audio content is received, wherein the audio content can be various types of audio, for example, it can be a recording, a song, a video, and so on. The format of audio content can also be multiple, for example MP3 (Moving Picture Experts Group Audio Layer 3) format, WMA (Windows Media Audio) format, etc.

In some embodiments, in response to the subtitle adding operation, the audio content is identified to obtain the text content. The subtitle adding operation may be based on an operation on a predetermined control, or may be configured by default in the system, for example, the operation is automatically triggered upon receiving audio content. Therefore, it can be flexibly set based on the needs of different scenarios. In the case of recognizing the audio content, various methods can be adopted, for example, it can be implemented according to various intelligent voice processing software. In addition, in the case of identifying the audio content, it may be to identify real-time audio content, or it may be non-real-time, depending on requirements.

In some embodiments, in response to the melody recognition operation, the melody information of the audio content is recognized to obtain the melody content. The melody recognition operation can be based on the operation of the melody selection control, or it can be configured by default by the system. For example, the melody recognition operation and the above subtitle adding operation can be unified into one operation, that is, the melody recognition is triggered in response to receiving the subtitle adding operation functions, thereby simplifying the operation process and avoiding secondary operations. Wherein, the melody information of the audio content includes multiple types, for example, in the audio content, there are multiple types of melody information that can be expressed. For example, a variety of melody information can be analyzed according to the fundamental frequency and pitch, harmonics and timbre, amplitude and sound intensity, sound width and frequency band of the audio content. For example, when the audio content is a song, the melody of the song can be judged according to the frequency of the audio content, and then a staff notation can be automatically generated or a corresponding numbered notation can be displayed on each subtitle of the song, etc.

In some embodiments, subtitles are generated and displayed on the display interface based on the textual content and the melodic content. That is, the melody content can be carried on the subtitles. For example, if the audio content is music, in the case of displaying subtitles, the melodies of the music can be displayed on the displayed subtitles. At this time, these melodies can be expressed in multiple ways, such as using 1, 2, 3, 4, 5 , 6, 7 represent the 7 basic levels in the scale, or represented by C, D, E, F, G, A, B. Because when the audio content is music, a word in the lyrics may have a different melody, and then compose a piece of music. For example: "Ah", this word is quoted in many pieces of music. Although the word is the same, it expresses different melodies. At this time, the melody of this word can be marked above the word "Ah", and, in Many musical pieces played by musical instruments do not have subtitles but have melodies. At this time, the audio content can be obtained according to the melody information in the subtitles, so that the subtitles can more completely reflect the audio content.

In some embodiments, subtitles are displayed on the display interface. Wherein, the subtitle displayed on the display interface includes displaying the melody content above or below the text content. The subtitles can display more information about the audio content, which enriches the user's perception and enhances the user's experience. And in the case that the user is an editor of audio content, the user can easily generate STT subtitles with melody information (for example, musical notation) with music, which improves the fun of watching subtitles and makes STT subtitles more expressive , greatly improving the enthusiasm of users to edit videos and the quality of works related to audio content.

In some embodiments, when subtitles are generated based on the text content and melody content and displayed on the display interface, various methods can be used, for example, the text content is split into independent words, and each independent word is recorded in the audio content The time information in the audio content; identify the melody information of the audio content, and obtain the melody content including: respectively based on the time information of each independent text in the audio content, select the melody information of the part of the audio content corresponding to the time information for identification, and obtain the independent The independent melody content corresponding to the text, wherein the independent melody content corresponding to each independent text constitutes the melody content corresponding to the text content; based on each independent text and the corresponding independent melody content, subtitles are generated and displayed on the display interface. For example, in a song, each recognized character can be regarded as an independent character, and the time information of the independent character in the audio content can be recorded. Afterwards, according to the time information of each independent character in the audio content, select the melody information of a part of the audio content corresponding to the time information for identification, and obtain the independent melody content corresponding to each independent character, and the independent melody content corresponding to each independent character included in the text content. The melodic content constitutes the melody of the whole song. By using the melody corresponding to the text included in the text content one by one, that is, by accurately obtaining the independent melody content corresponding to each independent text, the melody content expressed by the entire audio content based on the independent melody content can be obtained more clearly, so as to achieve A more comprehensive display of audio content.

In some embodiments, when the time information includes the start time point and duration of each independent text in the audio content, select the part of the audio content corresponding to the time information based on the time information of each independent text in the audio content In the case of identifying the melody information of each independent text and obtaining the independent melody content corresponding to each independent text, the following method can be adopted: based on the start time point and duration of each independent text in the audio content, select the corresponding to the start time point and duration Part of the audio content; process the part of the audio content to obtain the frequency spectrum distribution of the part of the audio content; based on the frequency spectrum distribution, obtain the independent melody content corresponding to each independent text. Wherein, the start time of each character and the duration of each character may be recorded in seconds or smaller time units. Based on the start time point and duration of each independent text in the audio content, select the part of the audio content corresponding to the start time point and duration to obtain the corresponding independent melody content, because this part of the audio content is the start time point corresponding to the independent text and duration, therefore, the obtained independent melody content can be made to correspond to the independent text. In the case of processing the part of the audio content to obtain the spectral distribution of the part of the audio content, various methods can also be used. For example, a fast Fourier transform can be performed on the part of the audio content to obtain the spectral distribution of the part of the audio content . Specifically, the following operations can be used: first determine the audio signal corresponding to each independent text, for example, on the basis of taking out the start time and duration corresponding to each independent text in the text content, according to the start time and duration, the original audio is obtained The audio signal in the time period in the file, and the audio signal in this time period is used as the input of the fast Fourier transform algorithm, and the spectral distribution of the original audio file in the time period is identified by the fast Fourier transform algorithm. Afterwards, according to the spectrum distribution, the independent melody content corresponding to each independent character is obtained.

In some embodiments, when the independent melody content corresponding to each independent text is obtained based on the frequency spectrum distribution, various methods can also be adopted, for example: when the audio content is music and the independent melody content is music melody, determine the frequency spectrum The highest frequency in the distribution is the main frequency of each independent text; the main frequency is converted into music text information, wherein the music text information represents the music melody of each independent text. The music text information can be in various forms, for example, it can be musical notation in digital form, or stave notation in symbolic form, and so on. By using the strongest frequency in the spectral distribution as the main frequency of the time point corresponding to each independent character, compared with other representations, the audio feature information of the audio content can be more accurately reflected.

Fig. 3 is a flow chart of a subtitle display method 2 according to an exemplary embodiment. As shown in Fig. 3, the method is used in the above-mentioned computer terminal, and includes the following steps S31 to S33.

In step S31, the video is played on the display interface, wherein the video includes audio content;

In step S32, receive subtitle display instruction;

In step S33, in response to the subtitle display instruction, the subtitle is displayed on the display interface, wherein the subtitle includes: text content and melody content, the text content is obtained by identifying the audio content, and the melody content is obtained by identifying the melody information of the audio content .

Using the above processing, by displaying video content including audio content on the display interface, receiving and responding to subtitle display instructions, and displaying subtitles generated based on text content and melody content on the display interface, wherein the text content and melody content are related to the audio content Therefore, the subtitle not only displays the text content of the audio, but also displays the melody content that cannot be reflected in the text content, which reduces the loss of the audio content as much as possible, and reflects the audio content more completely. It avoids displaying subtitles based on audio and reflects the fact that the audio content is single.

In related technologies, the information expressed by the STT subtitles recognized by the speech recognition function through the mobile video editing software is not as rich as the audio content, and the information in the audio content other than text will be lost in the speech recognition process. For example, when a user adds a piece of song content, STT subtitles can only express text content, and information such as music melodies cannot be expressed in STT subtitles. And these music melody information itself is also one of the information content of this audio. Based on this, in an embodiment of the present disclosure, a method for displaying subtitles is provided, in which method, while generating STT subtitles, the music melody information of the audio content is expressed on the subtitles.

For example, through the mobile terminal video editing software, identify the user singing content, and add music notation to the displayed subtitles. This method uses the spectrum recognition algorithm to add the melody in the audio to the STT subtitles in the form of musical notation, which makes the STT subtitles more expressive, more interesting, and can improve the spread of video works.

FIG. 4 is a flow chart of a subtitle display method according to an embodiment of the present disclosure. As shown in FIG. 4 , based on a scene where a user performs video clipping on audio content, the following details are introduced:

1) The user uses a mobile video editing software to import a piece of audio content.

2) When the user chooses to add STT subtitles, ask the user whether to recognize the audio melody and add it to the subtitles. If the user chooses not to use this function, just add STT subtitles directly. If the user selects this function, then proceed to step 3).

3) Through speech recognition technology, the audio content is recognized as text, and in the recognition process, it is necessary to record the start time and duration (in seconds) of each word in the audio, and record the text information, the start time of each word and the duration of the text, saved in the form of json text, the saved form is as follows:

It should be noted that in this json, each recognized text is used as an element in the array, and the start time (start_time) and duration (duration) of each text are also recorded in the element. Among them, the melody field represents the melody at the time point of the word, and the melody will be processed and obtained below.

4) Traverse the array of the json root node, take out the start time and duration corresponding to each element (each text) in the array, and according to the start time and duration, get all the sound signals in the time period from the original audio file , and the sound signal during this period is used as the input of the FFT algorithm, and the frequency spectrum distribution of the original audio file during this period is identified through the FFT algorithm. Afterwards, the frequency with the strongest spectrum distribution in this time period is taken as the main frequency at this time point, and the main frequency is recorded in json in the form of numbered musical notation, and the field is melody. After step 4), the content of json becomes as follows Show:

5) When the video editing software adds STT subtitles, add the subtitles of musical notation above the STT subtitles. In this way, the STT subtitles with musical notation can be generated.

In this disclosed embodiment:

(1) Video editing software users can easily generate STT subtitles with musical notation information through mobile video editing software, which improves the interest of video works;

(2) The expressive ability of STT subtitles is enhanced, which greatly improves the user's enthusiasm for editing videos and the quality of edited works.

It should be noted that for the foregoing method embodiments, for the sake of simple description, they are all expressed as a series of action combinations, but those skilled in the art should know that the present disclosure is not limited by the described action sequence. Because of this disclosure, certain steps may be performed in other orders or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification are all exemplary, and the actions and modules involved are not necessarily required by the present disclosure.

Through the description of the above embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by means of software plus a necessary general-purpose hardware platform, and of course also by hardware, but in many cases the former is better implementation. Based on such an understanding, the technical solution of the present disclosure can be embodied in the form of a software product in essence or the part that contributes to the prior art, and the computer software product is stored in a computer-readable storage medium (such as ROM/RAM, magnetic disk, optical disk), several instructions are included to make a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) execute the method of each embodiment of the present disclosure.

According to an embodiment of the present disclosure, a device for implementing the first subtitle display method above is also provided. FIG. 5 is a device block diagram of a subtitle display device 1 according to an exemplary embodiment. Referring to FIG. 5 , the device includes: a first receiving module 502 , a first identification module 504 , a second identification module 506 and a processing module 508 , and the device will be described below.

The first receiving module 502 is used to receive audio content; the first identification module 504 is connected to the above-mentioned first receiving module 502, and is used to identify the audio content in response to the subtitle adding operation to obtain text content; the second identification module 506 , connected to the above-mentioned first recognition module 504, used to identify the melody information of the audio content in response to the melody recognition operation, and obtain the melody content; the processing module 508, connected to the above-mentioned second recognition module 506, used for based on the text content and melody content, generate subtitles and display them on the display interface.

In some embodiments, the processing module 508 includes: a splitting unit and a first processing unit, wherein the splitting unit is configured to split the text content into independent characters, and record the time information of each independent character in the audio content; The second recognition module is also used to select the melody information of the part of the audio content corresponding to the time information based on the time information of each independent character in the audio content for recognition, and obtain the independent melody content corresponding to each independent character, wherein each independent The independent melody content corresponding to the text constitutes the melody content corresponding to the text content; the first processing unit is configured to generate subtitles based on each independent text and the corresponding independent melody content and display them on the display interface.

In some embodiments, the processing module 508 further includes: a display unit, configured to display the melody content above or below the text content.

In some embodiments, the second identification module 506 includes: a selection unit, configured to, when the time information includes the start time point and duration of each independent text in the audio content, respectively based on the time points of each independent text in the audio content The start time point and the duration select part of the audio content corresponding to the start time point and duration; the second processing unit is used to process the part of the audio content to obtain the spectral distribution of the part of the audio content; the third processing unit is used to process the part of the audio content based on spectrum distribution to obtain the independent melody content corresponding to each independent text.

In some embodiments, the third processing unit includes: a determining subunit, configured to determine that the highest frequency in the spectrum distribution is the main frequency of each independent text when the audio content is music and the independent melody content is music melody; The subunit is used to convert the main frequency into music text information, wherein the music text information represents the music melody of each independent text.

It should be noted here that the first receiving module 502, the first identification module 504, the second identification module 506 and the processing module 508 correspond to steps S21 to S24 in the above embodiment, and the above modules and corresponding steps are implemented The examples and application scenarios are the same, but are not limited to the content disclosed in the above embodiments. It should be noted that, as a part of the device, the above modules can run in the computer terminal 10 provided in the embodiment.

According to an embodiment of the present disclosure, a device for implementing the second method for displaying subtitles is also provided. FIG. 6 is a device block diagram of the second method for displaying subtitles according to an exemplary embodiment. Referring to FIG. 6 , the device includes: a playback module 602 , a second receiving module 604 and a display module 606 , and the device will be described below.

The playing module 602 is used to play the video on the display interface, wherein the video includes audio content; the second receiving module 604 is connected to the above-mentioned playing module 602 and is used to receive subtitle display instructions; the display module 606 is connected to the above-mentioned second The receiving module 604 is used to respond to the subtitle display instruction and display the subtitle on the display interface, wherein the subtitle includes: text content and melody content, the text content is obtained by identifying the audio content, and the melody content is obtained by identifying the melody information of the audio content get.

It should be noted here that the playing module 602, the second receiving module 604 and the display module 606 correspond to steps S31 to S33 in the above embodiment, and the examples and application scenarios implemented by the above modules are the same as those of the corresponding steps, but It is not limited to the content disclosed in the above embodiments. It should be noted that, as a part of the device, the above modules can run in the computer terminal 10 provided in the embodiment.

Regarding the apparatus in the foregoing embodiments, the specific manner in which each module executes operations has been described in detail in the embodiments related to the method, and will not be described in detail here.

Embodiments of the present disclosure may provide an electronic device, and the electronic device may be a terminal or a server. For example, in the case that the electronic device is a terminal, the terminal may be any computer terminal device in the group of computer terminals. In some embodiments, the foregoing terminal may also be a terminal device such as a mobile terminal.

In some embodiments, the above-mentioned terminal may be located in at least one network device among multiple network devices of the computer network.

In some embodiments, Fig. 7 is a structural block diagram of a terminal according to an exemplary embodiment. As shown in FIG. 7, the terminal may include: one or more (only one is shown in the figure) processors 71, and a memory 72 for storing processor-executable instructions; wherein, the processors are configured to execute instructions to A subtitle display method of any one of the above items is realized.

Wherein, the memory can be used to store software programs and modules, such as program instructions/modules corresponding to the subtitle display method and device in the embodiments of the present disclosure, and the processor executes various functional applications by running the software programs and modules stored in the memory. And data processing, that is to realize the above subtitle display method. The memory may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory may further include a memory located remotely from the processor, and these remote memories may be connected to the computer terminal through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The processor can call the information stored in the memory and the application program through the transmission device to perform the following steps: receive the audio content; respond to the subtitle adding operation, identify the audio content to obtain the text content; respond to the melody recognition operation, and identify the audio content The melody information is identified to obtain the melody content; based on the text content and melody content, subtitles are generated and displayed on the display interface.

In some embodiments, the above-mentioned processor can also execute the program code of the following steps: based on the text content and the melody content, generating subtitles and displaying them on the display interface, including: splitting the text content into independent words, and recording each independent word Time information in the audio content; identifying the melody information of the audio content, and obtaining the melody content includes: respectively based on the time information of each independent text in the audio content, selecting the melody information of a part of the audio content corresponding to the time information for identification, The independent melody content corresponding to each independent text is obtained, wherein the independent melody content corresponding to each independent text constitutes the melody content corresponding to the text content; based on each independent text and the corresponding independent melody content, subtitles are generated and displayed on the display interface.

In some embodiments, the above-mentioned processor can also execute the program code of the following steps: when the time information includes the start time point and duration of each independent word in the audio content, respectively based on the time of each independent word in the audio content Time information, select the melody information of part of the audio content corresponding to the time information for identification, and obtain the independent melody content corresponding to each independent text, including: based on the start time point and duration of each independent text in the audio content, selection and start Part of the audio content corresponding to the time point and duration; process the part of the audio content to obtain the spectrum distribution of the part of the audio content; based on the spectrum distribution, obtain the independent melody content corresponding to each independent text.

In some embodiments, the above-mentioned processor can also execute the program code of the following steps: based on the frequency spectrum distribution, obtain the independent melody content corresponding to each independent text, including: when the audio content is music, and the independent melody content is music melody, The highest frequency in the frequency spectrum distribution is determined as the main frequency of each independent text; the main frequency is converted into music text information, wherein the music text information represents the music melody of each independent text.

In some embodiments, the above-mentioned processor can also execute the program code of the following steps: the music text information includes at least one of the following: numbered musical notation in digital form, and stave notation in symbolic form.

In some embodiments, the above-mentioned processor may also execute the program code for the following steps: displaying subtitles on the display interface, including: displaying melody content above or below the text content.

The processor can call the information stored in the memory and the application program through the transmission device to perform the following steps: play a video on the display interface, wherein the video includes audio content; receive a subtitle display instruction; respond to the subtitle display instruction, and display on the display interface Displaying subtitles, wherein the subtitles include: text content and melody content, the text content is obtained by identifying the audio content, and the melody content is obtained by identifying the melody information of the audio content.

As above, the electronic device may also be a server. Embodiments of the present disclosure provide a server. FIG. 8 is a structural block diagram of a server according to an exemplary embodiment. As shown in Figure 8, the server 17 may include: one or more (only one is shown in the figure) processing components 81, a memory 82 for storing executable instructions of the processing components 81, a power supply component 83 for providing power, and realizing the same A network interface 84 for external network communication and an I/O input and output interface 85 for data transmission with the outside; wherein, the processing component 81 is configured to execute instructions to implement any one of the subtitle display methods above.

The processing component can call the information stored in the memory and the application program through the transmission device to perform the following steps: receive the audio content; respond to the subtitle adding operation, identify the audio content to obtain the text content; respond to the melody recognition operation, and identify the audio content The melody information is identified to obtain the melody content; based on the text content and melody content, subtitles are generated and displayed on the display interface.

In some embodiments, the above-mentioned processing component can also execute the program code of the following steps: based on the text content and the melody content, generate subtitles and display them on the display interface, including: splitting the text content into independent words, and recording each independent word Time information in the audio content; identifying the melody information of the audio content, and obtaining the melody content includes: respectively based on the time information of each independent text in the audio content, selecting the melody information of a part of the audio content corresponding to the time information for identification, The independent melody content corresponding to each independent text is obtained, wherein the independent melody content corresponding to each independent text constitutes the melody content corresponding to the text content; based on each independent text and the corresponding independent melody content, subtitles are generated and displayed on the display interface.

In some embodiments, the above-mentioned processing component can also execute the program code of the following steps: when the time information includes the start time point and duration of each independent word in the audio content, respectively based on the time of each independent word in the audio content Time information, select the melody information of part of the audio content corresponding to the time information for identification, and obtain the independent melody content corresponding to each independent text, including: based on the start time point and duration of each independent text in the audio content, selection and start Part of the audio content corresponding to the time point and duration; process the part of the audio content to obtain the spectrum distribution of the part of the audio content; based on the spectrum distribution, obtain the independent melody content corresponding to each independent text.

In some embodiments, the above-mentioned processing component can also execute the program code of the following steps: based on the frequency spectrum distribution, obtain the independent melody content corresponding to each independent text, including: when the audio content is music, and the independent melody content is music melody, The highest frequency in the frequency spectrum distribution is determined as the main frequency of each independent text; the main frequency is converted into music text information, wherein the music text information represents the music melody of each independent text.

In some embodiments, the above-mentioned processing component can also execute the program code of the following steps: the music text information includes at least one of the following: numbered musical notation in digital form, and stave notation in symbolic form.

In some embodiments, the above-mentioned processing component may also execute the program code of the following steps: displaying the subtitles on the display interface includes: displaying the melody content above or below the text content.

The processing component can call the information stored in the memory and the application program through the transmission device to perform the following steps: play the video on the display interface, wherein the video includes audio content; receive a subtitle display instruction; respond to the subtitle display instruction, and display on the display interface Displaying subtitles, wherein the subtitles include: text content and melody content, the text content is obtained by identifying the audio content, and the melody content is obtained by identifying the melody information of the audio content.

Those of ordinary skill in the art can understand that the structures shown in Fig. 7 and Fig. 8 are only schematic. For example, the above-mentioned terminal can also be a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, an applause computer, and a mobile Internet device ( Mobile Internet Devices, MID), PAD and other terminal equipment. 7 and 8 do not limit the structure of the above-mentioned electronic device. For example, more or less components (such as network interface, display device, etc.) than those shown in FIG. 7 and FIG. 8 may be included, or configurations different from those shown in FIGS. 7 and 8 may be included.

Those skilled in the art can understand that all or part of the steps in the various methods of the above embodiments can be completed by instructing hardware related to the terminal device through a program, and the program can be stored in a computer-readable storage medium, and the computer can The read storage medium may include: a flash disk, a read-only memory (Read-Only Memory, ROM), a random access device (Random Access Memory, RAM), a magnetic disk or an optical disk, and the like.

In an exemplary embodiment, a computer-readable storage medium including instructions is also provided, and when the instructions in the computer-readable storage medium are executed by the processor of the terminal, the terminal is able to perform any one of the subtitle display methods above . In some embodiments, the computer-readable storage medium may be a non-transitory computer-readable storage medium, for example, the non-transitory computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk and optical data storage devices, etc.

In some embodiments, the above-mentioned computer-readable storage medium may be used to store program codes executed by the subtitle display method provided in the above-mentioned embodiments.

In some embodiments, the above-mentioned computer-readable storage medium may be located in any computer terminal in the group of computer terminals in the computer network, or in any mobile terminal in the group of mobile terminals.

In some embodiments, the computer-readable storage medium is configured to store program codes for performing the following steps: receiving audio content; in response to subtitle addition operations, identifying the audio content to obtain text content; in response to the melody recognition operation, The melody information of the audio content is identified to obtain the melody content; based on the text content and the melody content, subtitles are generated and displayed on the display interface.

In some embodiments, the computer-readable storage medium is configured to store program codes for performing the following steps: generating subtitles based on the text content and the melody content and displaying them on the display interface, including: splitting the text content into independent words , and record the time information of each independent text in the audio content; identify the melody information of the audio content, and obtain the melody content including: based on the time information of each independent text in the audio content, select part of the audio content corresponding to the time information The melody information of each independent text is identified to obtain the independent melody content corresponding to each independent text, wherein, the independent melody content corresponding to each independent text constitutes the melody content corresponding to the text content; based on each independent text and the corresponding independent melody content, subtitles are generated and displayed on the displayed on the interface.

In some embodiments, the computer-readable storage medium is configured to store program codes for performing the following steps: when the time information includes the start time point and duration of each independent text in the audio content, respectively based on each independent text The time information of the text in the audio content, the melody information of the part of the audio content corresponding to the time information is selected for identification, and the independent melody content corresponding to each independent text is obtained, including: based on the start time point of each independent text in the audio content, and the duration, select part of the audio content corresponding to the start time point and duration; process the part of the audio content to obtain the spectrum distribution of the part of the audio content; based on the spectrum distribution, obtain the independent melody content corresponding to each independent text.

In some embodiments, the computer-readable storage medium is configured to store program codes for performing the following steps: Obtain the independent melody content corresponding to each independent text based on the frequency spectrum distribution, including: when the audio content is music, the independent melody content is In the case of music melody, determine the highest frequency in the spectrum distribution as the main frequency of each independent character; convert the main frequency into music text information, wherein the music text information represents the music melody of each independent text.

In some embodiments, the computer-readable storage medium is configured to store program codes for performing the following steps: the music text information includes at least one of the following: numbered musical notation in digital form, stave notation in symbolic form.

In some embodiments, the computer-readable storage medium is configured to store program codes for performing the following steps: displaying subtitles on the display interface, including: displaying melody content above or below the text content.

In some embodiments, the computer-readable storage medium is configured to store program codes for performing the following steps: playing a video on a display interface, wherein the video includes audio content; receiving a subtitle display instruction; responding to the subtitle display instruction, The subtitle is displayed on the display interface, wherein the subtitle includes: text content and melody content, the text content is obtained by identifying the audio content, and the melody content is obtained by identifying the melody information of the audio content.

In an exemplary embodiment, a computer program product is also provided, and when the computer program in the computer program product is executed by the processor of the terminal, the terminal is enabled to execute any one of the subtitle display methods above.

In the several embodiments provided in this application, it should be understood that the disclosed technical content can be realized in other ways. Wherein, the device embodiments described above are only illustrative, such as the division of units, which is only a logical function division, and there may be other division methods in actual implementation, for example, multiple units or components can be combined or integrated into Another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of units or modules may be in electrical or other forms.

A unit described as a separate component may or may not be physically separated, and a component displayed as a unit may or may not be a physical unit, that is, it may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solutions of the embodiments of the present disclosure.

In addition, each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.

In a case where an integrated unit is realized in the form of a software function unit and sold or used as an independent product, it may be stored in one computer-readable storage medium. Based on this understanding, the technical solution of the present disclosure is essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a computer-readable The storage medium includes several instructions to enable a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods in various embodiments of the present disclosure. The aforementioned computer-readable storage medium includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk, etc., which can store programs. The medium of the code.

All the embodiments of the present disclosure can be implemented independently or in combination with other embodiments, which are all regarded as the scope of protection required by the present disclosure.

Claims

A subtitle display method, comprising:

receive audio content;

Responding to the operation of adding subtitles, identifying the audio content to obtain text content;

Responding to the melody identification operation, identifying the melody information of the audio content to obtain the melody content;

Based on the text content and the melody content, generate subtitles and display them on a display interface.
The method according to claim 1, wherein, based on the text content and the melody content, generating subtitles and displaying them on a display interface includes:

Split the text content into independent words, and record the time information of each independent word in the audio content;

Identifying the melody information of the audio content, and obtaining the melody content includes: selecting and identifying the melody information of a part of the audio content corresponding to the time information based on the time information of the independent characters in the audio content, respectively, Obtaining the independent melody content corresponding to each independent character, wherein the independent melody content corresponding to each independent character constitutes the melody content corresponding to the text content;

Based on the independent characters and the corresponding independent melody content, subtitles are generated and displayed on the display interface.
The method according to claim 2, wherein, when the time information includes the start time point and duration of the independent characters in the audio content, based on the audio content of the independent characters respectively For the time information in the content, the melody information of the part of the audio content corresponding to the time information is selected for identification, and the independent melody content corresponding to each independent text is obtained, including:

Respectively based on the start time point and the duration of each independent text in the audio content, select a part of the audio content corresponding to the start time point and the duration;

Processing the part of the audio content to obtain the spectrum distribution of the part of the audio content;

Based on the spectrum distribution, the independent melody content corresponding to the independent characters is obtained.
The method according to claim 3, wherein said obtaining the independent melody content corresponding to each independent text based on said frequency spectrum distribution comprises:

In the case where the audio content is music and the independent melody content is music melody, determine that the highest frequency in the spectrum distribution is the main frequency of each independent text;

converting the main frequency into music text information, wherein the music text information represents the music melody of each independent text.
The method according to claim 4, wherein the music text information includes at least one of the following:

Numbered musical notation and symbolic musical notation.
The method according to any one of claims 1 to 5, wherein said displaying said subtitles on a display interface comprises:

The melody content is displayed above or below the text content.
A subtitle display method, comprising:

Playing a video on the display interface, wherein the video includes audio content;

Receive subtitle display instructions;

Responding to the subtitle display instruction, display subtitles on the display interface, wherein the subtitles include: text content and melody content, the text content is obtained by identifying the audio content, and the melody content is obtained by identifying the It can be obtained by identifying the melody information of the audio content.
A subtitle display device, comprising:

a first receiving module, configured to receive audio content;

The first identification module is used to identify the audio content and obtain the text content in response to the subtitle adding operation;

The second identification module is used to identify the melody information of the audio content in response to the melody identification operation to obtain the melody content;

A processing module, configured to generate subtitles and display them on a display interface based on the text content and the melody content.
The device according to claim 8, wherein the processing module comprises: a splitting unit and a first processing unit, wherein,

The splitting unit is configured to split the text content into independent characters, and record the time information of each independent character in the audio content;

The second recognition module is further configured to select the melody information of a part of the audio content corresponding to the time information based on the time information of the independent characters in the audio content for recognition, and obtain the independent characters Corresponding independent melody content, wherein the independent melody content corresponding to each independent text constitutes the melody content corresponding to the text content;

The first processing unit is configured to generate subtitles and display them on the display interface based on the independent characters and the corresponding independent melody content.
The device according to claim 9, wherein the second identification module comprises:

A selection unit, configured to, when the time information includes the start time point and duration of the independent characters in the audio content, respectively based on the start of the independent characters in the audio content time point, and the duration, select part of the audio content corresponding to the start time point and the duration;

a second processing unit, configured to process the part of the audio content to obtain the spectrum distribution of the part of the audio content;

The third processing unit is configured to obtain independent melody content corresponding to each independent text based on the frequency spectrum distribution.
The device according to claim 10, wherein the third processing unit comprises:

A determination subunit is configured to determine that the highest frequency in the spectral distribution is the main frequency of each independent text when the audio content is music and the independent melody content is music melody;

The conversion subunit is configured to convert the main frequency into music text information, wherein the music text information represents the music melody of each independent text.
The device according to claim 11, wherein the music text information includes at least one of the following:

Numbered musical notation and symbolic musical notation.
The device according to any one of claims 8 to 12, wherein the processing module comprises:

A display unit, configured to display the melody content above or below the text content.
A subtitle display device, comprising:

A playback module, configured to play a video on the display interface, wherein the video includes audio content;

The second receiving module is used to receive subtitle display instructions;

A display module, configured to display subtitles on the display interface in response to the subtitle display instruction, wherein the subtitles include: text content and melody content, the text content is obtained by identifying the audio content, the The melody content is obtained by identifying the melody information of the audio content.
An electronic device comprising:

processor;

memory for storing said processor-executable instructions;

Wherein, the processor is configured to execute the instructions to achieve the following steps:

receive audio content;

Responding to the operation of adding subtitles, identifying the audio content to obtain text content;

Responding to the melody identification operation, identifying the melody information of the audio content to obtain the melody content;

Based on the text content and the melody content, generate subtitles and display them on the display interface;

Or implement the following steps:

Playing a video on the display interface, wherein the video includes audio content;

Receive subtitle display instructions;

Responding to the subtitle display instruction, display subtitles on the display interface, wherein the subtitles include: text content and melody content, the text content is obtained by identifying the audio content, and the melody content is obtained by identifying the It can be obtained by identifying the melody information of the audio content.
A computer-readable storage medium, wherein, when the instructions in the computer-readable storage medium are executed by a processor of the electronic device, the electronic device is enabled to perform the following steps:

receive audio content;

Responding to the operation of adding subtitles, identifying the audio content to obtain text content;

Responding to the melody identification operation, identifying the melody information of the audio content to obtain the melody content;

Based on the text content and the melody content, generate subtitles and display them on the display interface;

Or perform the following steps:

Playing a video on the display interface, wherein the video includes audio content;

Receive subtitle display instructions;

Responding to the subtitle display instruction, display subtitles on the display interface, wherein the subtitles include: text content and melody content, the text content is obtained by identifying the audio content, and the melody content is obtained by identifying the It can be obtained by identifying the melody information of the audio content.
A computer program product, comprising a computer program, wherein the computer program implements the following steps when executed by a processor:

receive audio content;

Responding to the operation of adding subtitles, identifying the audio content to obtain text content;

Responding to the melody identification operation, identifying the melody information of the audio content to obtain the melody content;

Based on the text content and the melody content, generate subtitles and display them on the display interface;

Or implement the following steps:

Playing a video on the display interface, wherein the video includes audio content;

Receive subtitle display instructions;

Responding to the subtitle display instruction, display subtitles on the display interface, wherein the subtitles include: text content and melody content, the text content is obtained by identifying the audio content, and the melody content is obtained by identifying the It can be obtained by identifying the melody information of the audio content.