WO2023005193A1 - Subtitle display method and device - Google Patents

Subtitle display method and device Download PDF

Info

Publication number
WO2023005193A1
WO2023005193A1 PCT/CN2022/076656 CN2022076656W WO2023005193A1 WO 2023005193 A1 WO2023005193 A1 WO 2023005193A1 CN 2022076656 W CN2022076656 W CN 2022076656W WO 2023005193 A1 WO2023005193 A1 WO 2023005193A1
Authority
WO
WIPO (PCT)
Prior art keywords
content
melody
text
independent
audio content
Prior art date
Application number
PCT/CN2022/076656
Other languages
French (fr)
Chinese (zh)
Inventor
卢家辉
Original Assignee
北京达佳互联信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京达佳互联信息技术有限公司 filed Critical 北京达佳互联信息技术有限公司
Publication of WO2023005193A1 publication Critical patent/WO2023005193A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/0008Associated control or indicating means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/14Fourier, Walsh or analogous domain transformations, e.g. Laplace, Hilbert, Karhunen-Loeve, transforms
    • G06F17/141Discrete Fourier transforms
    • G06F17/142Fast Fourier transforms, e.g. using a Cooley-Tukey type algorithm
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/031Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
    • G10H2210/056Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal for extraction or identification of individual instrumental parts, e.g. melody, chords, bass; Identification or separation of instrumental parts by their characteristic voices or timbres

Definitions

  • the present disclosure relates to the field of computers, and in particular to a subtitle display method, device, electronic equipment, and computer-readable storage medium.
  • the STT subtitle Speech To Text, speech recognition subtitle
  • speech To Text speech recognition subtitle
  • the STT subtitles recognized by the speech recognition function often have a single audio content.
  • the disclosure provides a subtitle display method, device, electronic equipment, and computer-readable storage medium.
  • a method for displaying subtitles including: receiving audio content; in response to a subtitle adding operation, identifying the audio content to obtain text content; in response to the melody identification operation, The melody information of the audio content is identified to obtain the melody content; based on the text content and the melody content, subtitles are generated and displayed on the display interface.
  • generating subtitles and displaying them on the display interface includes: splitting the text content into independent words, and recording each independent word in the audio content Time information of the audio content; identifying the melody information of the audio content, and obtaining the melody content includes: selecting the melody of the part of the audio content corresponding to the time information based on the time information of the independent characters in the audio content respectively Information is identified to obtain the independent melody content corresponding to the independent characters, wherein the independent melody content corresponding to the independent characters constitutes the melody content corresponding to the text content; based on the independent characters and the corresponding independent melody content melody content, generating subtitles and displaying them on the display interface.
  • the time information includes the start time point and duration of the independent characters in the audio content
  • select and identify the melody information of the part of the audio content corresponding to the time information and obtain the independent melody content corresponding to the independent characters, including: respectively based on the start time of the independent characters in the audio content point, and the duration, select the part of the audio content corresponding to the start time point and the duration; process the part of the audio content to obtain the spectral distribution of the part of the audio content; based on the spectral distribution, obtain Independent melody content corresponding to each independent text.
  • the obtaining the independent melody content corresponding to each independent text based on the frequency spectrum distribution includes: when the audio content is music and the independent melody content is music melody, determining the The highest frequency in the spectrum distribution is the main frequency of each independent character; the main frequency is converted into music text information, wherein the music text information represents the music melody of each independent text.
  • the music text information includes at least one of the following: numbered musical notation in digital form, and staff notation in symbolic form.
  • the displaying the subtitle on the display interface includes: displaying the melody content above or below the text content.
  • a method for displaying subtitles including: playing a video on a display interface, wherein the video includes audio content; receiving a subtitle display instruction; responding to the subtitle display instruction, in the Subtitles are displayed on the display interface, wherein the subtitles include: text content and melody content, the text content is obtained by identifying the audio content, and the melody content is obtained by identifying the melody information of the audio content .
  • a subtitle display device including: a first receiving module, configured to receive audio content; a first identification module, configured to identify the audio content in response to a subtitle adding operation , to obtain the text content; the second recognition module is used to identify the melody information of the audio content in response to the melody recognition operation to obtain the melody content; the processing module is used to based on the text content and the melody content , generate subtitles and display them on the display interface.
  • the processing module includes: a splitting unit and a first processing unit, wherein the splitting unit is configured to split the text content into independent words, and record each independent word in the Time information in the audio content; the second identification module is also used to select the melody information of the part of the audio content corresponding to the time information based on the time information of the independent characters in the audio content for identification , to obtain the independent melody content corresponding to each independent character, wherein the independent melody content corresponding to each independent character constitutes the melody content corresponding to the text content; the first processing unit is configured to Generate subtitles corresponding to the independent melody content and display them on the display interface.
  • the second identification module includes: a selection unit, configured to, when the time information includes the start time point and duration of each independent text in the audio content, respectively based on the The start time point and the duration of each independent text in the audio content, select a part of the audio content corresponding to the start time point and the duration; the second processing unit is used to process the part The audio content is processed to obtain the spectral distribution of the part of the audio content; the third processing unit is configured to obtain the independent melody content corresponding to the independent characters based on the spectral distribution.
  • the third processing unit includes: a determining subunit, configured to determine that the highest frequency in the spectrum distribution is The main frequency of each independent character; a conversion subunit, configured to convert the main frequency into music text information, wherein the music text information represents the music melody of each independent text.
  • the music text information includes at least one of the following: numbered musical notation in digital form, and staff notation in symbolic form.
  • the processing module includes: a display unit, configured to display the melody content above or below the text content.
  • a subtitle display device including: a playing module, configured to play a video on a display interface, wherein the video includes audio content; a second receiving module, configured to receive subtitles A display instruction; a display module, configured to respond to the subtitle display instruction and display subtitles on the display interface, wherein the subtitles include: text content and melody content, and the text content is obtained by identifying the audio content , the melody content is obtained by identifying the melody information of the audio content.
  • an electronic device including: a processor; a memory for storing instructions executable by the processor; wherein the processor is configured to execute the instructions to implement The subtitle display method described in any one.
  • a computer-readable storage medium when the instructions in the computer-readable storage medium are executed by the processor of the electronic device, the electronic device can perform any one of the Subtitle display method.
  • a computer program product including a computer program, and when the computer program is executed by a processor, any subtitle display method described in any one is implemented.
  • the audio content is identified to obtain the text content and the melody content, and the subtitles are generated based on the above text content and melody content, and then the subtitles are displayed on the display interface, because the displayed subtitles carry the melody content Therefore, the subtitle not only displays the text content of the audio, but also displays the melody content that cannot be reflected in the text content, which reduces the loss of the audio content as much as possible, fully reflects the audio content, and avoids
  • the audio displays subtitles, reflecting the fact that the audio content is single.
  • Fig. 1 is a block diagram showing a hardware structure of a computer terminal for implementing a subtitle display method according to an exemplary embodiment.
  • Fig. 2 is a flow chart of a subtitle display method 1 according to an exemplary embodiment.
  • Fig. 3 is a flow chart of a second subtitle display method according to an exemplary embodiment.
  • FIG. 4 is a flowchart of a subtitle display method according to an embodiment of the present disclosure.
  • Fig. 5 is a device block diagram of a subtitle display device 1 according to an exemplary embodiment.
  • Fig. 6 is a device block diagram of a subtitle display device 2 according to an exemplary embodiment.
  • Fig. 7 is a device block diagram of a terminal according to an exemplary embodiment.
  • Fig. 8 is a structural block diagram of a server according to an exemplary embodiment.
  • STT subtitles STT is the abbreviation of Speech To Text, that is, "from speech to text".
  • speech recognition technology is used to convert the audio input by the user into text, and then convert the text into subtitles and embed them in the video, which is called STT subtitles.
  • FFT transform is the abbreviation of Fast Fourier Transform, that is, Fast Fourier Transform.
  • the FFT transform is a method for quickly calculating the discrete Fourier transform (DFT, Discere Fourier Transform) of a sequence or its inverse transform.
  • DFT discrete Fourier transform
  • Fourier analysis transforms a signal from the original domain (usually time or space) to a representation in the frequency domain or vice versa.
  • FFT computes such transformations quickly by decomposing a DFT matrix into a product of sparse (mostly zero) factors. Therefore, it can reduce the complexity of computing DFT from O(n2), which is required for computing only with DFT definition, to O(nlogn), where n is the data size.
  • Music notation refers to digital notation, which uses numbers to represent the melody of music. Numbered musical notation is based on movable solfa, with 1, 2, 3, 4, 5, 6, and 7 representing the seven basic levels in the scale, and the pronunciations are do, re, mi, fa, sol, la, ti ( Chinese is si), English is represented by C, D, E, F, G, A, B, and rest is represented by 0. The time value name of each number is equivalent to the quarter note of the staff.
  • Music spectrum analysis is a very commonly used algorithm. Spectrum principle: According to Fourier analysis, any sound can be decomposed into several or even infinite sine waves, and they often contain countless harmonic components. Using FFT (Fast Fourier Transform), digital signals can be converted from time-domain signals to frequency-domain signals to obtain the spectral characteristics of music.
  • FFT Fast Fourier Transform
  • a method embodiment of a subtitle display method is proposed. It should be noted that the steps shown in the flowcharts of the accompanying drawings may be performed in a computer system, such as a set of computer-executable instructions, and that although a logical order is shown in the flowcharts, in some cases, The steps shown or described may be performed in an order different than here.
  • FIG. 1 is a block diagram showing a hardware structure of a computer terminal (or mobile device) for realizing a subtitle display method according to an exemplary embodiment.
  • the computer terminal 10 may include one or more (shown as 102a, 102b, ..., 102n in the figure) processors 102 (processors 102 may include but not limited to microprocessor processor MCU or programmable logic device FPGA and other processing devices), the memory 104 for storing data, and the transmission device for communication functions.
  • FIG. 1 is only a schematic diagram, and it does not limit the structure of the above-mentioned electronic device.
  • computer terminal 10 may also include more or fewer components than shown in FIG. 1 , or have a different configuration than that shown in FIG. 1 .
  • the one or more processors 102 and/or other data processing circuits described above may generally be referred to herein as "data processing circuits".
  • the data processing circuit may be implemented in whole or in part as software, hardware, firmware or other arbitrary combinations.
  • the data processing circuit can be a single independent processing module, or be fully or partially integrated into any of the other elements in the computer terminal 10 (or mobile device).
  • the data processing circuit serves as a processor control (for example, the selection of the variable resistor terminal path connected to the interface).
  • the memory 104 can be used to store software programs and modules of application software, such as the program instruction/data storage device corresponding to the subtitle display method in the embodiment of the present disclosure, and the processor 102 executes the software programs and modules stored in the memory 104 by running Various functional applications and data processing, that is, to realize the subtitle display method of the above-mentioned application program.
  • the memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory.
  • the memory 104 may further include a memory that is remotely located relative to the processor 102 , and these remote memories may be connected to the computer terminal 10 through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • Transmission means are used to receive or transmit data via a network.
  • the aforementioned network may include a wireless network provided by a communication provider of the computer terminal 10 .
  • the transmission device includes a network adapter (Network Interface Controller, NIC), which can be connected to other network devices through a base station so as to communicate with the Internet.
  • the transmission device may be a radio frequency (Radio Frequency, RF) module, which is used to communicate with the Internet in a wireless manner.
  • RF Radio Frequency
  • the display may be, for example, a touchscreen liquid crystal display (LCD), which may enable a user to interact with the user interface of the computer terminal 10 (or mobile device).
  • LCD liquid crystal display
  • the computer device (or mobile device) shown in FIG. 1 may include hardware components (including circuits), software components (including computer codes stored on computer-readable media) , or a combination of both hardware and software elements. It should be noted that FIG. 1 is only one example of a specific example, and is intended to illustrate the types of components that may be present in a computer device (or mobile device) as described above.
  • Fig. 2 is a flow chart of a subtitle display method 1 according to an exemplary embodiment. As shown in Fig. 2, the method is used in the above-mentioned computer terminal, and includes the following steps S21 to S24.
  • step S21 audio content is received
  • step S22 in response to the subtitle adding operation, the audio content is identified to obtain the text content
  • step S23 in response to the melody recognition operation, the melody information of the audio content is identified to obtain the melody content;
  • step S24 based on the text content and the melody content, subtitles are generated and displayed on the display interface.
  • the audio content is recognized to obtain the text content and melody content, and the subtitles are generated based on the above text content and melody content, and then the subtitles are displayed on the display interface. Since the displayed subtitles are Carrying melody content, therefore, the subtitle not only displays the text content of the audio, but also displays the melody content that cannot be reflected in the text content, which reduces the loss of audio content as much as possible and fully reflects the audio content , to avoid displaying subtitles based on audio, reflecting the fact that the audio content is single.
  • audio content is received, wherein the audio content can be various types of audio, for example, it can be a recording, a song, a video, and so on.
  • the format of audio content can also be multiple, for example MP3 (Moving Picture Experts Group Audio Layer 3) format, WMA (Windows Media Audio) format, etc.
  • the audio content in response to the subtitle adding operation, is identified to obtain the text content.
  • the subtitle adding operation may be based on an operation on a predetermined control, or may be configured by default in the system, for example, the operation is automatically triggered upon receiving audio content. Therefore, it can be flexibly set based on the needs of different scenarios.
  • various methods can be adopted, for example, it can be implemented according to various intelligent voice processing software.
  • identifying the audio content it may be to identify real-time audio content, or it may be non-real-time, depending on requirements.
  • the melody information of the audio content is recognized to obtain the melody content.
  • the melody recognition operation can be based on the operation of the melody selection control, or it can be configured by default by the system.
  • the melody recognition operation and the above subtitle adding operation can be unified into one operation, that is, the melody recognition is triggered in response to receiving the subtitle adding operation functions, thereby simplifying the operation process and avoiding secondary operations.
  • the melody information of the audio content includes multiple types, for example, in the audio content, there are multiple types of melody information that can be expressed.
  • a variety of melody information can be analyzed according to the fundamental frequency and pitch, harmonics and timbre, amplitude and sound intensity, sound width and frequency band of the audio content.
  • the melody of the song can be judged according to the frequency of the audio content, and then a staff notation can be automatically generated or a corresponding numbered notation can be displayed on each subtitle of the song, etc.
  • subtitles are generated and displayed on the display interface based on the textual content and the melodic content. That is, the melody content can be carried on the subtitles.
  • the melodies of the music can be displayed on the displayed subtitles.
  • these melodies can be expressed in multiple ways, such as using 1, 2, 3, 4, 5 , 6, 7 represent the 7 basic levels in the scale, or represented by C, D, E, F, G, A, B.
  • a word in the lyrics may have a different melody, and then compose a piece of music. For example: "Ah", this word is quoted in many pieces of music.
  • the word is the same, it expresses different melodies.
  • the melody of this word can be marked above the word "Ah”, and, in Many musical pieces played by musical instruments do not have subtitles but have melodies.
  • the audio content can be obtained according to the melody information in the subtitles, so that the subtitles can more completely reflect the audio content.
  • subtitles are displayed on the display interface.
  • the subtitle displayed on the display interface includes displaying the melody content above or below the text content.
  • the subtitles can display more information about the audio content, which enriches the user's perception and enhances the user's experience.
  • the user is an editor of audio content, the user can easily generate STT subtitles with melody information (for example, musical notation) with music, which improves the fun of watching subtitles and makes STT subtitles more expressive , greatly improving the enthusiasm of users to edit videos and the quality of works related to audio content.
  • the text content is split into independent words, and each independent word is recorded in the audio content
  • the time information in the audio content identify the melody information of the audio content, and obtain the melody content including: respectively based on the time information of each independent text in the audio content, select the melody information of the part of the audio content corresponding to the time information for identification, and obtain the independent
  • the independent melody content corresponding to the text wherein the independent melody content corresponding to each independent text constitutes the melody content corresponding to the text content; based on each independent text and the corresponding independent melody content, subtitles are generated and displayed on the display interface.
  • each recognized character can be regarded as an independent character, and the time information of the independent character in the audio content can be recorded. Afterwards, according to the time information of each independent character in the audio content, select the melody information of a part of the audio content corresponding to the time information for identification, and obtain the independent melody content corresponding to each independent character, and the independent melody content corresponding to each independent character included in the text content.
  • the melodic content constitutes the melody of the whole song.
  • the time information when the time information includes the start time point and duration of each independent text in the audio content, select the part of the audio content corresponding to the time information based on the time information of each independent text in the audio content
  • the following method can be adopted: based on the start time point and duration of each independent text in the audio content, select the corresponding to the start time point and duration Part of the audio content; process the part of the audio content to obtain the frequency spectrum distribution of the part of the audio content; based on the frequency spectrum distribution, obtain the independent melody content corresponding to each independent text.
  • the start time of each character and the duration of each character may be recorded in seconds or smaller time units.
  • the obtained independent melody content can be made to correspond to the independent text.
  • various methods can also be used. For example, a fast Fourier transform can be performed on the part of the audio content to obtain the spectral distribution of the part of the audio content .
  • the following operations can be used: first determine the audio signal corresponding to each independent text, for example, on the basis of taking out the start time and duration corresponding to each independent text in the text content, according to the start time and duration, the original audio is obtained The audio signal in the time period in the file, and the audio signal in this time period is used as the input of the fast Fourier transform algorithm, and the spectral distribution of the original audio file in the time period is identified by the fast Fourier transform algorithm. Afterwards, according to the spectrum distribution, the independent melody content corresponding to each independent character is obtained.
  • the independent melody content corresponding to each independent text is obtained based on the frequency spectrum distribution
  • various methods can also be adopted, for example: when the audio content is music and the independent melody content is music melody, determine the frequency spectrum The highest frequency in the distribution is the main frequency of each independent text; the main frequency is converted into music text information, wherein the music text information represents the music melody of each independent text.
  • the music text information can be in various forms, for example, it can be musical notation in digital form, or stave notation in symbolic form, and so on.
  • Fig. 3 is a flow chart of a subtitle display method 2 according to an exemplary embodiment. As shown in Fig. 3, the method is used in the above-mentioned computer terminal, and includes the following steps S31 to S33.
  • step S31 the video is played on the display interface, wherein the video includes audio content
  • step S32 receive subtitle display instruction
  • step S33 in response to the subtitle display instruction, the subtitle is displayed on the display interface, wherein the subtitle includes: text content and melody content, the text content is obtained by identifying the audio content, and the melody content is obtained by identifying the melody information of the audio content .
  • the subtitle not only displays the text content of the audio, but also displays the melody content that cannot be reflected in the text content, which reduces the loss of the audio content as much as possible, and reflects the audio content more completely. It avoids displaying subtitles based on audio and reflects the fact that the audio content is single.
  • the information expressed by the STT subtitles recognized by the speech recognition function through the mobile video editing software is not as rich as the audio content, and the information in the audio content other than text will be lost in the speech recognition process.
  • STT subtitles can only express text content, and information such as music melodies cannot be expressed in STT subtitles. And these music melody information itself is also one of the information content of this audio.
  • a method for displaying subtitles is provided, in which method, while generating STT subtitles, the music melody information of the audio content is expressed on the subtitles.
  • the mobile terminal video editing software identify the user singing content, and add music notation to the displayed subtitles.
  • This method uses the spectrum recognition algorithm to add the melody in the audio to the STT subtitles in the form of musical notation, which makes the STT subtitles more expressive, more interesting, and can improve the spread of video works.
  • FIG. 4 is a flow chart of a subtitle display method according to an embodiment of the present disclosure. As shown in FIG. 4 , based on a scene where a user performs video clipping on audio content, the following details are introduced:
  • the user uses a mobile video editing software to import a piece of audio content.
  • the audio content is recognized as text, and in the recognition process, it is necessary to record the start time and duration (in seconds) of each word in the audio, and record the text information, the start time of each word and the duration of the text, saved in the form of json text, the saved form is as follows:
  • each recognized text is used as an element in the array, and the start time (start_time) and duration (duration) of each text are also recorded in the element.
  • the melody field represents the melody at the time point of the word, and the melody will be processed and obtained below.
  • step 4 Traverse the array of the json root node, take out the start time and duration corresponding to each element (each text) in the array, and according to the start time and duration, get all the sound signals in the time period from the original audio file , and the sound signal during this period is used as the input of the FFT algorithm, and the frequency spectrum distribution of the original audio file during this period is identified through the FFT algorithm. Afterwards, the frequency with the strongest spectrum distribution in this time period is taken as the main frequency at this time point, and the main frequency is recorded in json in the form of numbered musical notation, and the field is melody. After step 4), the content of json becomes as follows Show:
  • Video editing software users can easily generate STT subtitles with musical notation information through mobile video editing software, which improves the interest of video works;
  • the method according to the above embodiments can be implemented by means of software plus a necessary general-purpose hardware platform, and of course also by hardware, but in many cases the former is better implementation.
  • the technical solution of the present disclosure can be embodied in the form of a software product in essence or the part that contributes to the prior art, and the computer software product is stored in a computer-readable storage medium (such as ROM/RAM, magnetic disk, optical disk), several instructions are included to make a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) execute the method of each embodiment of the present disclosure.
  • a terminal device which may be a mobile phone, a computer, a server, or a network device, etc.
  • FIG. 5 is a device block diagram of a subtitle display device 1 according to an exemplary embodiment.
  • the device includes: a first receiving module 502 , a first identification module 504 , a second identification module 506 and a processing module 508 , and the device will be described below.
  • the first receiving module 502 is used to receive audio content;
  • the first identification module 504 is connected to the above-mentioned first receiving module 502, and is used to identify the audio content in response to the subtitle adding operation to obtain text content;
  • the second identification module 506 connected to the above-mentioned first recognition module 504, used to identify the melody information of the audio content in response to the melody recognition operation, and obtain the melody content;
  • the processing module 508, connected to the above-mentioned second recognition module 506, used for based on the text content and melody content, generate subtitles and display them on the display interface.
  • the processing module 508 includes: a splitting unit and a first processing unit, wherein the splitting unit is configured to split the text content into independent characters, and record the time information of each independent character in the audio content;
  • the second recognition module is also used to select the melody information of the part of the audio content corresponding to the time information based on the time information of each independent character in the audio content for recognition, and obtain the independent melody content corresponding to each independent character, wherein each independent The independent melody content corresponding to the text constitutes the melody content corresponding to the text content;
  • the first processing unit is configured to generate subtitles based on each independent text and the corresponding independent melody content and display them on the display interface.
  • the processing module 508 further includes: a display unit, configured to display the melody content above or below the text content.
  • the second identification module 506 includes: a selection unit, configured to, when the time information includes the start time point and duration of each independent text in the audio content, respectively based on the time points of each independent text in the audio content The start time point and the duration select part of the audio content corresponding to the start time point and duration; the second processing unit is used to process the part of the audio content to obtain the spectral distribution of the part of the audio content; the third processing unit is used to process the part of the audio content based on spectrum distribution to obtain the independent melody content corresponding to each independent text.
  • the third processing unit includes: a determining subunit, configured to determine that the highest frequency in the spectrum distribution is the main frequency of each independent text when the audio content is music and the independent melody content is music melody;
  • the subunit is used to convert the main frequency into music text information, wherein the music text information represents the music melody of each independent text.
  • the first receiving module 502, the first identification module 504, the second identification module 506 and the processing module 508 correspond to steps S21 to S24 in the above embodiment, and the above modules and corresponding steps are implemented
  • the examples and application scenarios are the same, but are not limited to the content disclosed in the above embodiments.
  • the above modules can run in the computer terminal 10 provided in the embodiment.
  • FIG. 6 is a device block diagram of the second method for displaying subtitles according to an exemplary embodiment.
  • the device includes: a playback module 602 , a second receiving module 604 and a display module 606 , and the device will be described below.
  • the playing module 602 is used to play the video on the display interface, wherein the video includes audio content;
  • the second receiving module 604 is connected to the above-mentioned playing module 602 and is used to receive subtitle display instructions;
  • the display module 606 is connected to the above-mentioned second
  • the receiving module 604 is used to respond to the subtitle display instruction and display the subtitle on the display interface, wherein the subtitle includes: text content and melody content, the text content is obtained by identifying the audio content, and the melody content is obtained by identifying the melody information of the audio content get.
  • the playing module 602, the second receiving module 604 and the display module 606 correspond to steps S31 to S33 in the above embodiment, and the examples and application scenarios implemented by the above modules are the same as those of the corresponding steps, but It is not limited to the content disclosed in the above embodiments. It should be noted that, as a part of the device, the above modules can run in the computer terminal 10 provided in the embodiment.
  • Embodiments of the present disclosure may provide an electronic device, and the electronic device may be a terminal or a server.
  • the terminal may be any computer terminal device in the group of computer terminals.
  • the foregoing terminal may also be a terminal device such as a mobile terminal.
  • the above-mentioned terminal may be located in at least one network device among multiple network devices of the computer network.
  • Fig. 7 is a structural block diagram of a terminal according to an exemplary embodiment.
  • the terminal may include: one or more (only one is shown in the figure) processors 71, and a memory 72 for storing processor-executable instructions; wherein, the processors are configured to execute instructions to A subtitle display method of any one of the above items is realized.
  • the memory can be used to store software programs and modules, such as program instructions/modules corresponding to the subtitle display method and device in the embodiments of the present disclosure, and the processor executes various functional applications by running the software programs and modules stored in the memory. And data processing, that is to realize the above subtitle display method.
  • the memory may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory.
  • the memory may further include a memory located remotely from the processor, and these remote memories may be connected to the computer terminal through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • the processor can call the information stored in the memory and the application program through the transmission device to perform the following steps: receive the audio content; respond to the subtitle adding operation, identify the audio content to obtain the text content; respond to the melody recognition operation, and identify the audio content
  • the melody information is identified to obtain the melody content; based on the text content and melody content, subtitles are generated and displayed on the display interface.
  • the above-mentioned processor can also execute the program code of the following steps: based on the text content and the melody content, generating subtitles and displaying them on the display interface, including: splitting the text content into independent words, and recording each independent word Time information in the audio content; identifying the melody information of the audio content, and obtaining the melody content includes: respectively based on the time information of each independent text in the audio content, selecting the melody information of a part of the audio content corresponding to the time information for identification, The independent melody content corresponding to each independent text is obtained, wherein the independent melody content corresponding to each independent text constitutes the melody content corresponding to the text content; based on each independent text and the corresponding independent melody content, subtitles are generated and displayed on the display interface.
  • the above-mentioned processor can also execute the program code of the following steps: when the time information includes the start time point and duration of each independent word in the audio content, respectively based on the time of each independent word in the audio content Time information, select the melody information of part of the audio content corresponding to the time information for identification, and obtain the independent melody content corresponding to each independent text, including: based on the start time point and duration of each independent text in the audio content, selection and start Part of the audio content corresponding to the time point and duration; process the part of the audio content to obtain the spectrum distribution of the part of the audio content; based on the spectrum distribution, obtain the independent melody content corresponding to each independent text.
  • the above-mentioned processor can also execute the program code of the following steps: based on the frequency spectrum distribution, obtain the independent melody content corresponding to each independent text, including: when the audio content is music, and the independent melody content is music melody, The highest frequency in the frequency spectrum distribution is determined as the main frequency of each independent text; the main frequency is converted into music text information, wherein the music text information represents the music melody of each independent text.
  • the above-mentioned processor can also execute the program code of the following steps: the music text information includes at least one of the following: numbered musical notation in digital form, and stave notation in symbolic form.
  • the above-mentioned processor may also execute the program code for the following steps: displaying subtitles on the display interface, including: displaying melody content above or below the text content.
  • the processor can call the information stored in the memory and the application program through the transmission device to perform the following steps: play a video on the display interface, wherein the video includes audio content; receive a subtitle display instruction; respond to the subtitle display instruction, and display on the display interface Displaying subtitles, wherein the subtitles include: text content and melody content, the text content is obtained by identifying the audio content, and the melody content is obtained by identifying the melody information of the audio content.
  • FIG. 8 is a structural block diagram of a server according to an exemplary embodiment.
  • the server 17 may include: one or more (only one is shown in the figure) processing components 81, a memory 82 for storing executable instructions of the processing components 81, a power supply component 83 for providing power, and realizing the same
  • the memory can be used to store software programs and modules, such as program instructions/modules corresponding to the subtitle display method and device in the embodiments of the present disclosure, and the processor executes various functional applications by running the software programs and modules stored in the memory. And data processing, that is to realize the above subtitle display method.
  • the memory may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory.
  • the memory may further include a memory located remotely from the processor, and these remote memories may be connected to the computer terminal through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • the processing component can call the information stored in the memory and the application program through the transmission device to perform the following steps: receive the audio content; respond to the subtitle adding operation, identify the audio content to obtain the text content; respond to the melody recognition operation, and identify the audio content
  • the melody information is identified to obtain the melody content; based on the text content and melody content, subtitles are generated and displayed on the display interface.
  • the above-mentioned processing component can also execute the program code of the following steps: based on the text content and the melody content, generate subtitles and display them on the display interface, including: splitting the text content into independent words, and recording each independent word Time information in the audio content; identifying the melody information of the audio content, and obtaining the melody content includes: respectively based on the time information of each independent text in the audio content, selecting the melody information of a part of the audio content corresponding to the time information for identification, The independent melody content corresponding to each independent text is obtained, wherein the independent melody content corresponding to each independent text constitutes the melody content corresponding to the text content; based on each independent text and the corresponding independent melody content, subtitles are generated and displayed on the display interface.
  • the above-mentioned processing component can also execute the program code of the following steps: when the time information includes the start time point and duration of each independent word in the audio content, respectively based on the time of each independent word in the audio content Time information, select the melody information of part of the audio content corresponding to the time information for identification, and obtain the independent melody content corresponding to each independent text, including: based on the start time point and duration of each independent text in the audio content, selection and start Part of the audio content corresponding to the time point and duration; process the part of the audio content to obtain the spectrum distribution of the part of the audio content; based on the spectrum distribution, obtain the independent melody content corresponding to each independent text.
  • the above-mentioned processing component can also execute the program code of the following steps: based on the frequency spectrum distribution, obtain the independent melody content corresponding to each independent text, including: when the audio content is music, and the independent melody content is music melody, The highest frequency in the frequency spectrum distribution is determined as the main frequency of each independent text; the main frequency is converted into music text information, wherein the music text information represents the music melody of each independent text.
  • the above-mentioned processing component can also execute the program code of the following steps: the music text information includes at least one of the following: numbered musical notation in digital form, and stave notation in symbolic form.
  • the above-mentioned processing component may also execute the program code of the following steps: displaying the subtitles on the display interface includes: displaying the melody content above or below the text content.
  • the processing component can call the information stored in the memory and the application program through the transmission device to perform the following steps: play the video on the display interface, wherein the video includes audio content; receive a subtitle display instruction; respond to the subtitle display instruction, and display on the display interface Displaying subtitles, wherein the subtitles include: text content and melody content, the text content is obtained by identifying the audio content, and the melody content is obtained by identifying the melody information of the audio content.
  • the structures shown in Fig. 7 and Fig. 8 are only schematic.
  • the above-mentioned terminal can also be a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, an applause computer, and a mobile Internet device ( Mobile Internet Devices, MID), PAD and other terminal equipment.
  • 7 and 8 do not limit the structure of the above-mentioned electronic device.
  • more or less components such as network interface, display device, etc.
  • configurations different from those shown in FIGS. 7 and 8 may be included.
  • the read storage medium may include: a flash disk, a read-only memory (Read-Only Memory, ROM), a random access device (Random Access Memory, RAM), a magnetic disk or an optical disk, and the like.
  • a computer-readable storage medium including instructions is also provided, and when the instructions in the computer-readable storage medium are executed by the processor of the terminal, the terminal is able to perform any one of the subtitle display methods above .
  • the computer-readable storage medium may be a non-transitory computer-readable storage medium, for example, the non-transitory computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk and optical data storage devices, etc.
  • the above-mentioned computer-readable storage medium may be used to store program codes executed by the subtitle display method provided in the above-mentioned embodiments.
  • the above-mentioned computer-readable storage medium may be located in any computer terminal in the group of computer terminals in the computer network, or in any mobile terminal in the group of mobile terminals.
  • the computer-readable storage medium is configured to store program codes for performing the following steps: receiving audio content; in response to subtitle addition operations, identifying the audio content to obtain text content; in response to the melody recognition operation, The melody information of the audio content is identified to obtain the melody content; based on the text content and the melody content, subtitles are generated and displayed on the display interface.
  • the computer-readable storage medium is configured to store program codes for performing the following steps: generating subtitles based on the text content and the melody content and displaying them on the display interface, including: splitting the text content into independent words , and record the time information of each independent text in the audio content; identify the melody information of the audio content, and obtain the melody content including: based on the time information of each independent text in the audio content, select part of the audio content corresponding to the time information
  • the melody information of each independent text is identified to obtain the independent melody content corresponding to each independent text, wherein, the independent melody content corresponding to each independent text constitutes the melody content corresponding to the text content; based on each independent text and the corresponding independent melody content, subtitles are generated and displayed on the displayed on the interface.
  • the computer-readable storage medium is configured to store program codes for performing the following steps: when the time information includes the start time point and duration of each independent text in the audio content, respectively based on each independent text
  • the time information of the text in the audio content, the melody information of the part of the audio content corresponding to the time information is selected for identification, and the independent melody content corresponding to each independent text is obtained, including: based on the start time point of each independent text in the audio content, and the duration, select part of the audio content corresponding to the start time point and duration; process the part of the audio content to obtain the spectrum distribution of the part of the audio content; based on the spectrum distribution, obtain the independent melody content corresponding to each independent text.
  • the computer-readable storage medium is configured to store program codes for performing the following steps: Obtain the independent melody content corresponding to each independent text based on the frequency spectrum distribution, including: when the audio content is music, the independent melody content is In the case of music melody, determine the highest frequency in the spectrum distribution as the main frequency of each independent character; convert the main frequency into music text information, wherein the music text information represents the music melody of each independent text.
  • the computer-readable storage medium is configured to store program codes for performing the following steps: the music text information includes at least one of the following: numbered musical notation in digital form, stave notation in symbolic form.
  • the computer-readable storage medium is configured to store program codes for performing the following steps: displaying subtitles on the display interface, including: displaying melody content above or below the text content.
  • the computer-readable storage medium is configured to store program codes for performing the following steps: playing a video on a display interface, wherein the video includes audio content; receiving a subtitle display instruction; responding to the subtitle display instruction, The subtitle is displayed on the display interface, wherein the subtitle includes: text content and melody content, the text content is obtained by identifying the audio content, and the melody content is obtained by identifying the melody information of the audio content.
  • a computer program product is also provided, and when the computer program in the computer program product is executed by the processor of the terminal, the terminal is enabled to execute any one of the subtitle display methods above.
  • the disclosed technical content can be realized in other ways.
  • the device embodiments described above are only illustrative, such as the division of units, which is only a logical function division, and there may be other division methods in actual implementation, for example, multiple units or components can be combined or integrated into Another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of units or modules may be in electrical or other forms.
  • a unit described as a separate component may or may not be physically separated, and a component displayed as a unit may or may not be a physical unit, that is, it may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solutions of the embodiments of the present disclosure.
  • each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
  • an integrated unit is realized in the form of a software function unit and sold or used as an independent product, it may be stored in one computer-readable storage medium.
  • the technical solution of the present disclosure is essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a computer-readable
  • the storage medium includes several instructions to enable a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods in various embodiments of the present disclosure.
  • the aforementioned computer-readable storage medium includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk, etc., which can store programs.
  • the medium of the code includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk, etc., which can store programs.

Abstract

A subtitle display method and device, an electronic apparatus, a computer-readable storage medium and a computer program product, the method comprising: receiving audio content (S21); in response to a subtitle addition operation, performing recognition on the audio content, thus obtaining text content (S22); in response to a melody recognition operation, performing recognition on melody information in the audio content, thus obtaining melody content (S23); and, on the basis of the text content and melody content, generating subtitles and displaying same on a display interface (S24).

Description

字幕显示方法及装置Caption display method and device
相关申请的交叉引用Cross References to Related Applications
本申请基于申请号为202110876235.7、申请日为2021年07月30日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。This application is based on a Chinese patent application with application number 202110876235.7 and a filing date of July 30, 2021, and claims the priority of this Chinese patent application. The entire content of this Chinese patent application is hereby incorporated by reference into this application.
技术领域technical field
本公开涉及计算机领域,尤其涉及一种字幕显示方法、装置、电子设备及计算机可读存储介质。The present disclosure relates to the field of computers, and in particular to a subtitle display method, device, electronic equipment, and computer-readable storage medium.
背景技术Background technique
目前,在相关技术中,STT字幕(Speech To Text,语音识别字幕)功能很受用户的欢迎。STT字幕的出现,可以很方便地让用户利用音频内容生成字幕内容。这些字幕内容能够让视频作品在互联网范围内广泛传播,使得视频的观看者更容易和更清晰地了解视频创作者的创作内容以及视频中的音频的文字信息。但是,这些利用语音识别功能识别出来的STT字幕,往往存在体现音频内容单一的情况。At present, in related technologies, the STT subtitle (Speech To Text, speech recognition subtitle) function is very popular among users. The emergence of STT subtitles can easily allow users to generate subtitle content using audio content. These subtitles can make video works widely disseminated on the Internet, making it easier and clearer for video viewers to understand the creative content of the video creator and the textual information of the audio in the video. However, the STT subtitles recognized by the speech recognition function often have a single audio content.
发明内容Contents of the invention
本公开提供一种字幕显示方法、装置、电子设备及计算机可读存储介质。The disclosure provides a subtitle display method, device, electronic equipment, and computer-readable storage medium.
根据本公开实施例的第一方面,提供一种字幕显示方法,包括:接收音频内容;响应于字幕添加操作,对所述音频内容进行识别,得到文本内容;响应于旋律识别操作,对所述音频内容的旋律信息进行识别,得到旋律内容;基于所述文本内容和所述旋律内容,生成字幕并在显示界面上显示。According to the first aspect of an embodiment of the present disclosure, there is provided a method for displaying subtitles, including: receiving audio content; in response to a subtitle adding operation, identifying the audio content to obtain text content; in response to the melody identification operation, The melody information of the audio content is identified to obtain the melody content; based on the text content and the melody content, subtitles are generated and displayed on the display interface.
在一些实施例中,基于所述文本内容和所述旋律内容,生成字幕并在显示界面上显示,包括:将所述文本内容拆分为独立文字,并记录各独立文字在所述音频内容中的时间信息;对所述音频内容的旋律信息进行识别,得到旋律内容包括:分别基于所述各独立文字在所述音频内容中的时间信息,选择与所述时间信息对应的部分音频内容的旋律信息进行识别,得到所述各独立文字对应的独立旋律内容,其中,所述各独立文字对应的独立旋律内容构成所述文本内容对应的旋律内容;基于所述各独立文字与对应的所述独立旋律内容,生成字幕并在所述显示界面上显示。In some embodiments, based on the text content and the melody content, generating subtitles and displaying them on the display interface includes: splitting the text content into independent words, and recording each independent word in the audio content Time information of the audio content; identifying the melody information of the audio content, and obtaining the melody content includes: selecting the melody of the part of the audio content corresponding to the time information based on the time information of the independent characters in the audio content respectively Information is identified to obtain the independent melody content corresponding to the independent characters, wherein the independent melody content corresponding to the independent characters constitutes the melody content corresponding to the text content; based on the independent characters and the corresponding independent melody content melody content, generating subtitles and displaying them on the display interface.
在一些实施例中,在所述时间信息包括所述各独立文字在所述音频内容中的开始时间点,和时长的情况下,分别基于所述各独立文字在所述音频内容中的时间信息,选择与所述时间信息对应的部分音频内容的旋律信息进行识别,得到所述各独立文字对应的独立旋律内容,包括:分别基于所述各独立文字在所述音频内容中的所述开始时间点,以及所述时长,选择与所述开始时间点和所述时长对应的部分音频内容;对所述部分音频内容进行处理,得到所述部分音频内容的频谱分布;基于所述频谱分布,得到所述各独立文字对应的独立旋律内容。In some embodiments, when the time information includes the start time point and duration of the independent characters in the audio content, based on the time information of the independent characters in the audio content , select and identify the melody information of the part of the audio content corresponding to the time information, and obtain the independent melody content corresponding to the independent characters, including: respectively based on the start time of the independent characters in the audio content point, and the duration, select the part of the audio content corresponding to the start time point and the duration; process the part of the audio content to obtain the spectral distribution of the part of the audio content; based on the spectral distribution, obtain Independent melody content corresponding to each independent text.
在一些实施例中,所述基于所述频谱分布,得到所述各独立文字对应的独立旋律内容,包括:在所述音频内容为音乐,所述独立旋律内容为音乐旋律的情况下,确定所述频谱分布中的最高频率为所述各独立文字的主频率;将所述主频率转换为音乐文字信息,其中,所述音乐文字信息表征所述各独立文字的音乐旋律。In some embodiments, the obtaining the independent melody content corresponding to each independent text based on the frequency spectrum distribution includes: when the audio content is music and the independent melody content is music melody, determining the The highest frequency in the spectrum distribution is the main frequency of each independent character; the main frequency is converted into music text information, wherein the music text information represents the music melody of each independent text.
在一些实施例中,所述音乐文字信息包括以下至少之一:数字形式的简谱,符号形式的五线谱。In some embodiments, the music text information includes at least one of the following: numbered musical notation in digital form, and staff notation in symbolic form.
在一些实施例中,所述在显示界面上显示所述字幕,包括:在所述文本内容的上方或下方显示所述旋律内容。In some embodiments, the displaying the subtitle on the display interface includes: displaying the melody content above or below the text content.
根据本公开实施例的第二方面,提供一种字幕显示方法,包括:在显示界面上播放 视频,其中,所述视频中包括音频内容;接收字幕显示指令;响应所述字幕显示指令,在所述显示界面上显示字幕,其中,所述字幕包括:文本内容和旋律内容,所述文本内容通过对所述音频内容进行识别得到,所述旋律内容通过对所述音频内容的旋律信息进行识别得到。According to the second aspect of the embodiments of the present disclosure, there is provided a method for displaying subtitles, including: playing a video on a display interface, wherein the video includes audio content; receiving a subtitle display instruction; responding to the subtitle display instruction, in the Subtitles are displayed on the display interface, wherein the subtitles include: text content and melody content, the text content is obtained by identifying the audio content, and the melody content is obtained by identifying the melody information of the audio content .
根据本公开实施例的第三方面,提供一种字幕显示装置,包括:第一接收模块,用于接收音频内容;第一识别模块,用于响应于字幕添加操作,对所述音频内容进行识别,得到文本内容;第二识别模块,用于响应于所述旋律识别操作,对所述音频内容的旋律信息进行识别,得到旋律内容;处理模块,用于基于所述文本内容和所述旋律内容,生成字幕并在显示界面上显示。According to a third aspect of an embodiment of the present disclosure, there is provided a subtitle display device, including: a first receiving module, configured to receive audio content; a first identification module, configured to identify the audio content in response to a subtitle adding operation , to obtain the text content; the second recognition module is used to identify the melody information of the audio content in response to the melody recognition operation to obtain the melody content; the processing module is used to based on the text content and the melody content , generate subtitles and display them on the display interface.
在一些实施例中,所述处理模块包括:拆分单元和第一处理单元,其中,所述拆分单元,用于将所述文本内容拆分为独立文字,并记录各独立文字在所述音频内容中的时间信息;所述第二识别模块,还用于分别基于所述各独立文字在所述音频内容中的时间信息,选择与所述时间信息对应的部分音频内容的旋律信息进行识别,得到所述各独立文字对应的独立旋律内容,其中,所述各独立文字对应的独立旋律内容构成所述文本内容对应的旋律内容;所述第一处理单元,用于基于所述各独立文字与对应的所述独立旋律内容,生成字幕并在所述显示界面上显示。In some embodiments, the processing module includes: a splitting unit and a first processing unit, wherein the splitting unit is configured to split the text content into independent words, and record each independent word in the Time information in the audio content; the second identification module is also used to select the melody information of the part of the audio content corresponding to the time information based on the time information of the independent characters in the audio content for identification , to obtain the independent melody content corresponding to each independent character, wherein the independent melody content corresponding to each independent character constitutes the melody content corresponding to the text content; the first processing unit is configured to Generate subtitles corresponding to the independent melody content and display them on the display interface.
在一些实施例中,所述第二识别模块包括:选择单元,用于在所述时间信息包括所述各独立文字在所述音频内容中的开始时间点,和时长的情况下,分别基于所述各独立文字在所述音频内容中的所述开始时间点,以及所述时长,选择与所述开始时间点和所述时长对应的部分音频内容;第二处理单元,用于对所述部分音频内容进行处理,得到所述部分音频内容的频谱分布;第三处理单元,用于基于所述频谱分布,得到所述各独立文字对应的独立旋律内容。In some embodiments, the second identification module includes: a selection unit, configured to, when the time information includes the start time point and duration of each independent text in the audio content, respectively based on the The start time point and the duration of each independent text in the audio content, select a part of the audio content corresponding to the start time point and the duration; the second processing unit is used to process the part The audio content is processed to obtain the spectral distribution of the part of the audio content; the third processing unit is configured to obtain the independent melody content corresponding to the independent characters based on the spectral distribution.
在一些实施例中,所述第三处理单元包括:确定子单元,用于在所述音频内容为音乐,所述独立旋律内容为音乐旋律的情况下,确定所述频谱分布中的最高频率为所述各独立文字的主频率;转换子单元,用于将所述主频率转换为音乐文字信息,其中,所述音乐文字信息表征所述各独立文字的音乐旋律。In some embodiments, the third processing unit includes: a determining subunit, configured to determine that the highest frequency in the spectrum distribution is The main frequency of each independent character; a conversion subunit, configured to convert the main frequency into music text information, wherein the music text information represents the music melody of each independent text.
在一些实施例中,所述音乐文字信息包括以下至少之一:数字形式的简谱,符号形式的五线谱。In some embodiments, the music text information includes at least one of the following: numbered musical notation in digital form, and staff notation in symbolic form.
在一些实施例中,所述处理模块包括:显示单元,用于在所述文本内容的上方或下方显示所述旋律内容。In some embodiments, the processing module includes: a display unit, configured to display the melody content above or below the text content.
根据本公开实施例的第四方面,提供一种字幕显示装置,包括:播放模块,用于在显示界面上播放视频,其中,所述视频中包括音频内容;第二接收模块,用于接收字幕显示指令;显示模块,用于响应所述字幕显示指令,在所述显示界面上显示字幕,其中,所述字幕包括:文本内容和旋律内容,所述文本内容通过对所述音频内容进行识别得到,所述旋律内容通过对所述音频内容的旋律信息进行识别得到。According to a fourth aspect of an embodiment of the present disclosure, there is provided a subtitle display device, including: a playing module, configured to play a video on a display interface, wherein the video includes audio content; a second receiving module, configured to receive subtitles A display instruction; a display module, configured to respond to the subtitle display instruction and display subtitles on the display interface, wherein the subtitles include: text content and melody content, and the text content is obtained by identifying the audio content , the melody content is obtained by identifying the melody information of the audio content.
根据本公开实施例的第五方面,提供一种电子设备,包括:处理器;用于存储所述处理器可执行指令的存储器;其中,所述处理器被配置为执行所述指令,以实现任一项所述的字幕显示方法。According to a fifth aspect of the embodiments of the present disclosure, there is provided an electronic device, including: a processor; a memory for storing instructions executable by the processor; wherein the processor is configured to execute the instructions to implement The subtitle display method described in any one.
根据本公开实施例的第六方面,提供一种计算机可读存储介质,当所述计算机可读存储介质中的指令由电子设备的处理器执行时,使得电子设备能够执行任一项所述的字幕显示方法。According to a sixth aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium, when the instructions in the computer-readable storage medium are executed by the processor of the electronic device, the electronic device can perform any one of the Subtitle display method.
根据本公开实施例的第七方面,提供一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时实现任一项所述的字幕显示方法。According to a seventh aspect of the embodiments of the present disclosure, a computer program product is provided, including a computer program, and when the computer program is executed by a processor, any subtitle display method described in any one is implemented.
通过响应于字幕添加操作与旋律识别操作,对音频内容进行识别得到文本内容与旋律内容,基于上述文本内容与旋律内容生成字幕,进而在显示界面上显示字幕,由于显示的字幕是携带有旋律内容的,因此,该字幕不仅显示了音频的文本内容,而且对文本 内容所不能体现的旋律内容也进行了显示,尽可能地减少了音频内容的丢失,较为完整地体现了音频内容,避免了基于音频显示字幕,体现音频内容单一的事实。By responding to the subtitle addition operation and the melody recognition operation, the audio content is identified to obtain the text content and the melody content, and the subtitles are generated based on the above text content and melody content, and then the subtitles are displayed on the display interface, because the displayed subtitles carry the melody content Therefore, the subtitle not only displays the text content of the audio, but also displays the melody content that cannot be reflected in the text content, which reduces the loss of the audio content as much as possible, fully reflects the audio content, and avoids The audio displays subtitles, reflecting the fact that the audio content is single.
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本公开。It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the present disclosure.
附图说明Description of drawings
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本公开的实施例,并与说明书一起用于解释本公开的原理,并不构成对本公开的不当限定。The accompanying drawings here are incorporated into the specification and constitute a part of the specification, show embodiments consistent with the disclosure, and are used together with the description to explain the principle of the disclosure, and do not constitute an improper limitation of the disclosure.
图1是根据一示例性实施例示出的一种用于实现字幕显示方法的计算机终端的硬件结构框图。Fig. 1 is a block diagram showing a hardware structure of a computer terminal for implementing a subtitle display method according to an exemplary embodiment.
图2是根据一示例性实施例示出的一种字幕显示方法一的流程图。Fig. 2 is a flow chart of a subtitle display method 1 according to an exemplary embodiment.
图3是根据一示例性实施例示出的一种字幕显示方法二的流程图。Fig. 3 is a flow chart of a second subtitle display method according to an exemplary embodiment.
图4是根据本公开实施方式的字幕显示方法的流程图。FIG. 4 is a flowchart of a subtitle display method according to an embodiment of the present disclosure.
图5是根据一示例性实施例示出的字幕显示装置一的装置框图。Fig. 5 is a device block diagram of a subtitle display device 1 according to an exemplary embodiment.
图6是根据一示例性实施例示出的字幕显示装置二的装置框图。Fig. 6 is a device block diagram of a subtitle display device 2 according to an exemplary embodiment.
图7是根据一示例性实施例示出的一种终端的装置框图。Fig. 7 is a device block diagram of a terminal according to an exemplary embodiment.
图8是根据一示例性实施例示出的一种服务器的结构框图。Fig. 8 is a structural block diagram of a server according to an exemplary embodiment.
具体实施方式Detailed ways
为了使本领域普通人员更好地理解本公开的技术方案,下面将结合附图,对本公开实施例中的技术方案进行清楚、完整地描述。In order to enable ordinary persons in the art to better understand the technical solutions of the present disclosure, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below in conjunction with the accompanying drawings.
需要说明的是,本公开的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本公开的实施例能够以除了在这里图示或描述的那些以外的顺序实施。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。It should be noted that the terms "first" and "second" in the specification and claims of the present disclosure and the above drawings are used to distinguish similar objects, but not necessarily used to describe a specific sequence or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein can be practiced in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples do not represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatuses and methods consistent with aspects of the present disclosure as recited in the appended claims.
首先,在对本申请实施例进行描述的过程中出现的部分名词或术语适用于如下解释:First of all, some nouns or terms that appear during the description of the embodiments of the present application are applicable to the following explanations:
STT字幕:STT是Speech To Text的缩写,即“从语音到文本”。在移动端视频剪辑软件当中,利用语音识别技术将用户输入的音频转化成为文字,再将文字转化成为字幕内容嵌入到视频当中,称为STT字幕。STT subtitles: STT is the abbreviation of Speech To Text, that is, "from speech to text". In mobile video editing software, speech recognition technology is used to convert the audio input by the user into text, and then convert the text into subtitles and embed them in the video, which is called STT subtitles.
FFT变换:FFT变换是Fast Fourier Transform的简写,即快速傅立叶变换。FFT变换是快速计算序列的离散傅里叶变换(DFT,Discere Fourier Transform)或其逆变换的方法。傅里叶分析将信号从原始域(通常是时间或空间)转换到频域的表示或者逆过来转换。FFT会通过把DFT矩阵分解为稀疏(大多为零)因子之积来快速计算此类变换。因此,它能够将计算DFT的复杂度从只用DFT定义计算需要的O(n2),降低到O(nlogn),其中n为数据大小。FFT transform: FFT transform is the abbreviation of Fast Fourier Transform, that is, Fast Fourier Transform. The FFT transform is a method for quickly calculating the discrete Fourier transform (DFT, Discere Fourier Transform) of a sequence or its inverse transform. Fourier analysis transforms a signal from the original domain (usually time or space) to a representation in the frequency domain or vice versa. FFT computes such transformations quickly by decomposing a DFT matrix into a product of sparse (mostly zero) factors. Therefore, it can reduce the complexity of computing DFT from O(n2), which is required for computing only with DFT definition, to O(nlogn), where n is the data size.
音乐简谱:一般所称的音乐简谱,系指数字简谱,用数字来表示音乐的旋律。数字简谱以可动唱名法为基础,用1、2、3、4、5、6、7代表音阶中的7个基本级,读音为do、re、mi、fa、sol、la、ti(中国为si),英文由C、D、E、F、G、A、B表示,休止以0表示。每一个数字的时值名相当于五线谱的4分音符。Music notation: The so-called musical notation refers to digital notation, which uses numbers to represent the melody of music. Numbered musical notation is based on movable solfa, with 1, 2, 3, 4, 5, 6, and 7 representing the seven basic levels in the scale, and the pronunciations are do, re, mi, fa, sol, la, ti ( Chinese is si), English is represented by C, D, E, F, G, A, B, and rest is represented by 0. The time value name of each number is equivalent to the quarter note of the staff.
音乐频谱分析:音乐频谱分析是一种很常用的算法。频谱原理:根据傅立叶分析,任何声音可以分解为数个甚至无限个正弦波,而它们往往又包含有无数多的谐波分量。利用FFT(快速傅立叶变换),可以将数字信号从时域信号转换为频域信号,从而得出音乐的频谱特征。Music spectrum analysis: Music spectrum analysis is a very commonly used algorithm. Spectrum principle: According to Fourier analysis, any sound can be decomposed into several or even infinite sine waves, and they often contain countless harmonic components. Using FFT (Fast Fourier Transform), digital signals can be converted from time-domain signals to frequency-domain signals to obtain the spectral characteristics of music.
根据本公开实施例,提出了一种字幕显示方法的方法实施例。需要说明的是,在附 图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机系统中执行,并且,虽然在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。According to an embodiment of the present disclosure, a method embodiment of a subtitle display method is proposed. It should be noted that the steps shown in the flowcharts of the accompanying drawings may be performed in a computer system, such as a set of computer-executable instructions, and that although a logical order is shown in the flowcharts, in some cases, The steps shown or described may be performed in an order different than here.
本公开实施例所提供的方法实施例可以在移动终端、计算机终端或者类似的运算装置中执行。图1是根据一示例性实施例示出的一种用于实现字幕显示方法的计算机终端(或移动设备)的硬件结构框图。如图1所示,计算机终端10(或移动设备)可以包括一个或多个(图中采用102a、102b,……,102n来示出)处理器102(处理器102可以包括但不限于微处理器MCU或可编程逻辑器件FPGA等的处理装置)、用于存储数据的存储器104、以及用于通信功能的传输装置。除此以外,还可以包括:显示器、输入/输出接口(I/O接口)、通用串行总线(USB)端口(可以作为BUS总线的端口中的一个端口被包括)、网络接口、电源和/或相机。本领域普通技术人员可以理解,图1所示的结构仅为示意,其并不对上述电子装置的结构造成限定。例如,计算机终端10还可包括比图1中所示更多或者更少的组件,或者具有与图1所示不同的配置。The method embodiments provided by the embodiments of the present disclosure may be executed in mobile terminals, computer terminals or similar computing devices. Fig. 1 is a block diagram showing a hardware structure of a computer terminal (or mobile device) for realizing a subtitle display method according to an exemplary embodiment. As shown in FIG. 1 , the computer terminal 10 (or mobile device) may include one or more (shown as 102a, 102b, ..., 102n in the figure) processors 102 (processors 102 may include but not limited to microprocessor processor MCU or programmable logic device FPGA and other processing devices), the memory 104 for storing data, and the transmission device for communication functions. In addition, it can also include: a display, an input/output interface (I/O interface), a universal serial bus (USB) port (which can be included as one of the ports of the BUS bus), a network interface, a power supply, and/or or camera. Those of ordinary skill in the art can understand that the structure shown in FIG. 1 is only a schematic diagram, and it does not limit the structure of the above-mentioned electronic device. For example, computer terminal 10 may also include more or fewer components than shown in FIG. 1 , or have a different configuration than that shown in FIG. 1 .
应当注意到的是上述一个或多个处理器102和/或其他数据处理电路在本文中通常可以被称为“数据处理电路”。该数据处理电路可以全部或部分的体现为软件、硬件、固件或其他任意组合。此外,数据处理电路可为单个独立的处理模块,或全部或部分的结合到计算机终端10(或移动设备)中的其他元件中的任意一个内。如本公开实施例中所涉及到的,该数据处理电路作为一种处理器控制(例如与接口连接的可变电阻终端路径的选择)。It should be noted that the one or more processors 102 and/or other data processing circuits described above may generally be referred to herein as "data processing circuits". The data processing circuit may be implemented in whole or in part as software, hardware, firmware or other arbitrary combinations. In addition, the data processing circuit can be a single independent processing module, or be fully or partially integrated into any of the other elements in the computer terminal 10 (or mobile device). As involved in the embodiments of the present disclosure, the data processing circuit serves as a processor control (for example, the selection of the variable resistor terminal path connected to the interface).
存储器104可用于存储应用软件的软件程序以及模块,如本公开实施例中的字幕显示方法对应的程序指令/数据存储装置,处理器102通过运行存储在存储器104内的软件程序以及模块,从而执行各种功能应用以及数据处理,即实现上述的应用程序的字幕显示方法。存储器104可包括高速随机存储器,还可包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器104可进一步包括相对于处理器102远程设置的存储器,这些远程存储器可以通过网络连接至计算机终端10。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory 104 can be used to store software programs and modules of application software, such as the program instruction/data storage device corresponding to the subtitle display method in the embodiment of the present disclosure, and the processor 102 executes the software programs and modules stored in the memory 104 by running Various functional applications and data processing, that is, to realize the subtitle display method of the above-mentioned application program. The memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory 104 may further include a memory that is remotely located relative to the processor 102 , and these remote memories may be connected to the computer terminal 10 through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
传输装置用于经由一个网络接收或者发送数据。上述的网络的实例可包括计算机终端10的通信供应商提供的无线网络。在一个实例中,传输装置包括一个网络适配器(Network Interface Controller,NIC),其可通过基站与其他网络设备相连从而可与互联网进行通讯。在一个实例中,传输装置可以为射频(Radio Frequency,RF)模块,其用于通过无线方式与互联网进行通讯。Transmission means are used to receive or transmit data via a network. Examples of the aforementioned network may include a wireless network provided by a communication provider of the computer terminal 10 . In one example, the transmission device includes a network adapter (Network Interface Controller, NIC), which can be connected to other network devices through a base station so as to communicate with the Internet. In one example, the transmission device may be a radio frequency (Radio Frequency, RF) module, which is used to communicate with the Internet in a wireless manner.
显示器可以例如触摸屏式的液晶显示器(LCD),该液晶显示器可使得用户能够与计算机终端10(或移动设备)的用户界面进行交互。The display may be, for example, a touchscreen liquid crystal display (LCD), which may enable a user to interact with the user interface of the computer terminal 10 (or mobile device).
此处需要说明的是,在一些实施例中,上述图1所示的计算机设备(或移动设备)可以包括硬件元件(包括电路)、软件元件(包括存储在计算机可读介质上的计算机代码)、或硬件元件和软件元件两者的结合。应当指出的是,图1仅为特定实例的一个实例,并且旨在示出可存在于上述计算机设备(或移动设备)中的部件的类型。It should be noted here that, in some embodiments, the computer device (or mobile device) shown in FIG. 1 may include hardware components (including circuits), software components (including computer codes stored on computer-readable media) , or a combination of both hardware and software elements. It should be noted that FIG. 1 is only one example of a specific example, and is intended to illustrate the types of components that may be present in a computer device (or mobile device) as described above.
在上述运行环境下,本公开提供了如图2所示的字幕显示方法。图2是根据一示例性实施例示出的一种字幕显示方法一的流程图,如图2所示,该方法用于上述的计算机终端中,包括以下步骤S21至S24。Under the above operating environment, the present disclosure provides a subtitle display method as shown in FIG. 2 . Fig. 2 is a flow chart of a subtitle display method 1 according to an exemplary embodiment. As shown in Fig. 2, the method is used in the above-mentioned computer terminal, and includes the following steps S21 to S24.
在步骤S21中,接收音频内容;In step S21, audio content is received;
在步骤S22中,响应于字幕添加操作,对音频内容进行识别,得到文本内容;In step S22, in response to the subtitle adding operation, the audio content is identified to obtain the text content;
在步骤S23中,响应于旋律识别操作,对音频内容的旋律信息进行识别,得到旋律内容;In step S23, in response to the melody recognition operation, the melody information of the audio content is identified to obtain the melody content;
在步骤S24中,基于文本内容和旋律内容,生成字幕并在显示界面上显示。In step S24, based on the text content and the melody content, subtitles are generated and displayed on the display interface.
采用上述处理,通过响应于字幕添加操作与旋律识别操作,对音频内容进行识别得到文本内容与旋律内容,基于上述文本内容与旋律内容生成字幕,进而在显示界面上显示字幕,由于显示的字幕是携带有旋律内容的,因此,该字幕不仅显示了音频的文本内容,而且对文本内容所不能体现的旋律内容也进行了显示,尽可能地减少了音频内容的丢失,较为完整地体现了音频内容,避免了基于音频显示字幕,体现音频内容单一的事实。Using the above processing, by responding to the subtitle addition operation and the melody recognition operation, the audio content is recognized to obtain the text content and melody content, and the subtitles are generated based on the above text content and melody content, and then the subtitles are displayed on the display interface. Since the displayed subtitles are Carrying melody content, therefore, the subtitle not only displays the text content of the audio, but also displays the melody content that cannot be reflected in the text content, which reduces the loss of audio content as much as possible and fully reflects the audio content , to avoid displaying subtitles based on audio, reflecting the fact that the audio content is single.
在一些实施例中,接收音频内容,其中,音频内容可以是多种类型的音频,例如,可以是一段录音、一首歌、一段视频,等等。音频内容的格式也可以为多种,例如MP3(Moving Picture Experts Group Audio Layer 3)格式,WMA(Windows Media Audio)格式,等等。In some embodiments, audio content is received, wherein the audio content can be various types of audio, for example, it can be a recording, a song, a video, and so on. The format of audio content can also be multiple, for example MP3 (Moving Picture Experts Group Audio Layer 3) format, WMA (Windows Media Audio) format, etc.
在一些实施例中,响应于字幕添加操作,对音频内容进行识别,得到文本内容。字幕添加操作可以是基于对预定的控件的操作,也可以系统默认配置的,例如,一接收到音频内容即自动触发该操作。因此,可以基于不同的场景需要灵活设置。在对音频内容进行识别的情况下,可以采用多种方式,例如,可以依据多种智能语音处理软件实现。另外,对音频内容进行识别的情况下,可以是识别实时的音频内容,也可以是非实时的,根据需求而定。In some embodiments, in response to the subtitle adding operation, the audio content is identified to obtain the text content. The subtitle adding operation may be based on an operation on a predetermined control, or may be configured by default in the system, for example, the operation is automatically triggered upon receiving audio content. Therefore, it can be flexibly set based on the needs of different scenarios. In the case of recognizing the audio content, various methods can be adopted, for example, it can be implemented according to various intelligent voice processing software. In addition, in the case of identifying the audio content, it may be to identify real-time audio content, or it may be non-real-time, depending on requirements.
在一些实施例中,响应于旋律识别操作,对音频内容的旋律信息进行识别,得到旋律内容。旋律识别操作可以是基于对旋律选择控件的操作,也可以系统默认配置的,例如,该旋律识别操作和上述字幕添加操作可以统一为一个操作,即响应于接收到该字幕添加操作,触发旋律识别的功能,从而简化操作流程,避免二次操作。其中,音频内容的旋律信息包括多种,例如,在音频内容中,所能表达出的旋律信息为多种。例如,可以根据音频内容的基频与音调、谐波与音色、幅度与音强、音宽与频带等等特征,分析出多种旋律信息。比如,在音频内容为一首歌的情况下,可以根据音频内容的频率判断出歌曲的旋律,之后,可以自动生成五线谱或者是在歌曲的每个字幕上显示对应的简谱等等。In some embodiments, in response to the melody recognition operation, the melody information of the audio content is recognized to obtain the melody content. The melody recognition operation can be based on the operation of the melody selection control, or it can be configured by default by the system. For example, the melody recognition operation and the above subtitle adding operation can be unified into one operation, that is, the melody recognition is triggered in response to receiving the subtitle adding operation functions, thereby simplifying the operation process and avoiding secondary operations. Wherein, the melody information of the audio content includes multiple types, for example, in the audio content, there are multiple types of melody information that can be expressed. For example, a variety of melody information can be analyzed according to the fundamental frequency and pitch, harmonics and timbre, amplitude and sound intensity, sound width and frequency band of the audio content. For example, when the audio content is a song, the melody of the song can be judged according to the frequency of the audio content, and then a staff notation can be automatically generated or a corresponding numbered notation can be displayed on each subtitle of the song, etc.
在一些实施例中,基于文本内容和旋律内容,生成字幕并在显示界面上显示。即可以在字幕上携带有旋律内容。例如,音频内容为音乐,在显示字幕的情况下,可以在显示的字幕上显示该音乐的旋律,此时,这些旋律可以有多种表示方法,比如,用1、2、3、4、5、6、7代表音阶中的7个基本级,或由C、D、E、F、G、A、B表示。因为在音频内容为音乐的情况下,歌词中的一个字可能会有不同的旋律,进而谱写出一段乐曲。例如:“啊”,这一个字在很多乐曲中得以引用,虽然字相同,但是表示出了不同的旋律,此时,就可以在“啊”字的上方标出此字的旋律,而且,在很多由乐器演奏的乐曲中,没有字幕,但是有旋律,此时,就可依据在字幕中的旋律信息来获取该音频内容,使得字幕更为完整地体现音频内容。In some embodiments, subtitles are generated and displayed on the display interface based on the textual content and the melodic content. That is, the melody content can be carried on the subtitles. For example, if the audio content is music, in the case of displaying subtitles, the melodies of the music can be displayed on the displayed subtitles. At this time, these melodies can be expressed in multiple ways, such as using 1, 2, 3, 4, 5 , 6, 7 represent the 7 basic levels in the scale, or represented by C, D, E, F, G, A, B. Because when the audio content is music, a word in the lyrics may have a different melody, and then compose a piece of music. For example: "Ah", this word is quoted in many pieces of music. Although the word is the same, it expresses different melodies. At this time, the melody of this word can be marked above the word "Ah", and, in Many musical pieces played by musical instruments do not have subtitles but have melodies. At this time, the audio content can be obtained according to the melody information in the subtitles, so that the subtitles can more completely reflect the audio content.
在一些实施例中,在显示界面上显示字幕。其中,在显示界面上显示的字幕包括在文本内容的上方或下方显示旋律内容。能够使得字幕展现出有关音频内容的更多信息,使得用户的观感更加丰富,提升用户的体验感。而且在用户为音频内容的剪辑者的情况下,用户可以简易地生成带音乐的旋律信息(例如,音乐简谱)的STT字幕,提高了观看字幕的趣味性,而且使得STT字幕的表达能力更强,极大地提高了用户剪辑视频的积极性和音频内容相关作品的质量。In some embodiments, subtitles are displayed on the display interface. Wherein, the subtitle displayed on the display interface includes displaying the melody content above or below the text content. The subtitles can display more information about the audio content, which enriches the user's perception and enhances the user's experience. And in the case that the user is an editor of audio content, the user can easily generate STT subtitles with melody information (for example, musical notation) with music, which improves the fun of watching subtitles and makes STT subtitles more expressive , greatly improving the enthusiasm of users to edit videos and the quality of works related to audio content.
在一些实施例中,基于文本内容和旋律内容,生成字幕并在显示界面上显示的情况下,可以采用多种方式,例如,将文本内容拆分为独立文字,并记录各独立文字在音频内容中的时间信息;对音频内容的旋律信息进行识别,得到旋律内容包括:分别基于各独立文字在音频内容中的时间信息,选择与时间信息对应的部分音频内容的旋律信息进行识别,得到各独立文字对应的独立旋律内容,其中,各独立文字对应的独立旋律内容构成文本内容对应的旋律内容;基于各独立文字与对应的独立旋律内容,生成字幕并在 显示界面上显示。例如,在一首歌曲中,可以将识别出来的每一个文字作为一个独立文字,记录该独立文字在音频内容中的时间信息。之后,根据各独立文字在音频内容中的时间信息,选择与时间信息对应的部分音频内容的旋律信息进行识别,得到各独立文字对应的独立旋律内容,文本内容所包括的各独立文字对应的独立旋律内容构成了整首歌的旋律。采用针对文本内容中包括的文字逐个对应旋律,即通过精确地得到各独立文字对应的独立旋律内容,可以使得依据该独立旋律内容得到整个音频内容所表达出的旋律内容更为清楚,从而实现对音频内容的较为全面的展示。In some embodiments, when subtitles are generated based on the text content and melody content and displayed on the display interface, various methods can be used, for example, the text content is split into independent words, and each independent word is recorded in the audio content The time information in the audio content; identify the melody information of the audio content, and obtain the melody content including: respectively based on the time information of each independent text in the audio content, select the melody information of the part of the audio content corresponding to the time information for identification, and obtain the independent The independent melody content corresponding to the text, wherein the independent melody content corresponding to each independent text constitutes the melody content corresponding to the text content; based on each independent text and the corresponding independent melody content, subtitles are generated and displayed on the display interface. For example, in a song, each recognized character can be regarded as an independent character, and the time information of the independent character in the audio content can be recorded. Afterwards, according to the time information of each independent character in the audio content, select the melody information of a part of the audio content corresponding to the time information for identification, and obtain the independent melody content corresponding to each independent character, and the independent melody content corresponding to each independent character included in the text content. The melodic content constitutes the melody of the whole song. By using the melody corresponding to the text included in the text content one by one, that is, by accurately obtaining the independent melody content corresponding to each independent text, the melody content expressed by the entire audio content based on the independent melody content can be obtained more clearly, so as to achieve A more comprehensive display of audio content.
在一些实施例中,在时间信息包括各独立文字在音频内容中的开始时间点,和时长的情况下,分别基于各独立文字在音频内容中的时间信息,选择与时间信息对应的部分音频内容的旋律信息进行识别,得到各独立文字对应的独立旋律内容的情况下,可以采用以下方式:分别基于各独立文字在音频内容中的开始时间点,以及时长,选择与开始时间点和时长对应的部分音频内容;对部分音频内容进行处理,得到部分音频内容的频谱分布;基于频谱分布,得到各独立文字对应的独立旋律内容。其中,每个文字起始时间和每个文字的时长,可以以秒为单位或者更小的时间单元进行记录。分别基于各独立文字在音频内容中的开始时间点,以及时长,选择与开始时间点和时长对应的部分音频内容得到对应的独立旋律内容,由于该部分音频内容是与独立文字对应的开始时间点和时长确定的,因此,能够使得得到的独立旋律内容是与该独立文字对应的。在对该部分音频内容进行处理,得到该部分音频内容的频谱分布的情况下,也可以采用多种方式,例如,可以对部分音频内容进行快速傅里叶变换,得到该部分音频内容的频谱分布。具体可以采用以下操作:先确定各独立文字对应的音频信号,例如,在取出文本内容当中的每一个独立文字对应的起始时间和时长的基础上,根据这个起始时间和时长,得到原音频文件当中该时间段内的音频信号,并且将这个时间段的音频信号作为快速傅立叶变换算法的输入,通过快速傅立叶变换算法识别出原音频文件该时间段的频谱分布。之后,依据该频谱分布,得到各独立文字对应的独立旋律内容。In some embodiments, when the time information includes the start time point and duration of each independent text in the audio content, select the part of the audio content corresponding to the time information based on the time information of each independent text in the audio content In the case of identifying the melody information of each independent text and obtaining the independent melody content corresponding to each independent text, the following method can be adopted: based on the start time point and duration of each independent text in the audio content, select the corresponding to the start time point and duration Part of the audio content; process the part of the audio content to obtain the frequency spectrum distribution of the part of the audio content; based on the frequency spectrum distribution, obtain the independent melody content corresponding to each independent text. Wherein, the start time of each character and the duration of each character may be recorded in seconds or smaller time units. Based on the start time point and duration of each independent text in the audio content, select the part of the audio content corresponding to the start time point and duration to obtain the corresponding independent melody content, because this part of the audio content is the start time point corresponding to the independent text and duration, therefore, the obtained independent melody content can be made to correspond to the independent text. In the case of processing the part of the audio content to obtain the spectral distribution of the part of the audio content, various methods can also be used. For example, a fast Fourier transform can be performed on the part of the audio content to obtain the spectral distribution of the part of the audio content . Specifically, the following operations can be used: first determine the audio signal corresponding to each independent text, for example, on the basis of taking out the start time and duration corresponding to each independent text in the text content, according to the start time and duration, the original audio is obtained The audio signal in the time period in the file, and the audio signal in this time period is used as the input of the fast Fourier transform algorithm, and the spectral distribution of the original audio file in the time period is identified by the fast Fourier transform algorithm. Afterwards, according to the spectrum distribution, the independent melody content corresponding to each independent character is obtained.
在一些实施例中,基于频谱分布,得到各独立文字对应的独立旋律内容的情况下,也可以采用多种方式,例如:在音频内容为音乐,独立旋律内容为音乐旋律的情况下,确定频谱分布中的最高频率为各独立文字的主频率;将主频率转换为音乐文字信息,其中,音乐文字信息表征各独立文字的音乐旋律。其中,该音乐文字信息的形式可以多种,例如,可以是数字形式的音乐简谱,也可以是符号形式的五线谱,等等。通过将频谱分布中最强的频率作为该各独立文字对应的时间点的主频率,相对于其它表示方式而言,能够更为准确地体现音频内容的音频特征信息。In some embodiments, when the independent melody content corresponding to each independent text is obtained based on the frequency spectrum distribution, various methods can also be adopted, for example: when the audio content is music and the independent melody content is music melody, determine the frequency spectrum The highest frequency in the distribution is the main frequency of each independent text; the main frequency is converted into music text information, wherein the music text information represents the music melody of each independent text. The music text information can be in various forms, for example, it can be musical notation in digital form, or stave notation in symbolic form, and so on. By using the strongest frequency in the spectral distribution as the main frequency of the time point corresponding to each independent character, compared with other representations, the audio feature information of the audio content can be more accurately reflected.
图3是根据一示例性实施例示出的一种字幕显示方法二的流程图,如图3所示,该方法用于上述的计算机终端中,包括以下步骤S31至S33。Fig. 3 is a flow chart of a subtitle display method 2 according to an exemplary embodiment. As shown in Fig. 3, the method is used in the above-mentioned computer terminal, and includes the following steps S31 to S33.
在步骤S31中,在显示界面上播放视频,其中,视频中包括音频内容;In step S31, the video is played on the display interface, wherein the video includes audio content;
在步骤S32中,接收字幕显示指令;In step S32, receive subtitle display instruction;
在步骤S33中,响应字幕显示指令,在显示界面上显示字幕,其中,字幕包括:文本内容和旋律内容,文本内容通过对音频内容进行识别得到,旋律内容通过对音频内容的旋律信息进行识别得到。In step S33, in response to the subtitle display instruction, the subtitle is displayed on the display interface, wherein the subtitle includes: text content and melody content, the text content is obtained by identifying the audio content, and the melody content is obtained by identifying the melody information of the audio content .
采用上述处理,通过在显示界面上显示包括音频内容的视频内容,接收并响应字幕显示指令,在显示界面上显示基于文本内容与旋律内容生成的字幕,其中,文本内容与旋律内容是对音频内容进行识别得到的,因此,该字幕不仅显示了音频的文本内容,而且对文本内容所不能体现的旋律内容也进行了显示,尽可能地减少了音频内容的丢失,较为完整地体现了音频内容,避免了基于音频显示字幕,体现音频内容单一的事实。Using the above processing, by displaying video content including audio content on the display interface, receiving and responding to subtitle display instructions, and displaying subtitles generated based on text content and melody content on the display interface, wherein the text content and melody content are related to the audio content Therefore, the subtitle not only displays the text content of the audio, but also displays the melody content that cannot be reflected in the text content, which reduces the loss of the audio content as much as possible, and reflects the audio content more completely. It avoids displaying subtitles based on audio and reflects the fact that the audio content is single.
在相关技术中,通过移动端视频剪辑软件,利用语音识别功能识别出来的STT字幕表达出来的信息没有音频内容来得丰富,音频内容中除文字以外的信息会在语音识别过程中丢失。例如,在用户添加一段歌曲内容的情况下,STT字幕只能表达出文字的内容, 如音乐旋律等信息在STT字幕当中是无法表达的。而这些音乐旋律信息,本身也是这段音频的信息内容之一。基于此,在本公开实施方式中,提供了一种字幕显示方法,在该方法中,在生成STT字幕的同时,将音频内容的音乐旋律信息在字幕上表达出来。In related technologies, the information expressed by the STT subtitles recognized by the speech recognition function through the mobile video editing software is not as rich as the audio content, and the information in the audio content other than text will be lost in the speech recognition process. For example, when a user adds a piece of song content, STT subtitles can only express text content, and information such as music melodies cannot be expressed in STT subtitles. And these music melody information itself is also one of the information content of this audio. Based on this, in an embodiment of the present disclosure, a method for displaying subtitles is provided, in which method, while generating STT subtitles, the music melody information of the audio content is expressed on the subtitles.
例如,通过移动端视频剪辑软件,识别用户歌唱内容,并且在显示的字幕中添加音乐简谱。该方法通过频谱识别算法,将音频当中旋律以音乐简谱的形式添加到STT字幕当中,使得STT字幕的表达能力更强,更有趣味性,更能提高视频作品的传播范围。For example, through the mobile terminal video editing software, identify the user singing content, and add music notation to the displayed subtitles. This method uses the spectrum recognition algorithm to add the melody in the audio to the STT subtitles in the form of musical notation, which makes the STT subtitles more expressive, more interesting, and can improve the spread of video works.
图4是根据本公开实施方式的字幕显示方法的流程图,如图4所示,基于用户进行有关音频内容的视频剪辑的场景,下面进行详细介绍:FIG. 4 is a flow chart of a subtitle display method according to an embodiment of the present disclosure. As shown in FIG. 4 , based on a scene where a user performs video clipping on audio content, the following details are introduced:
1)用户使用移动端视频剪辑软件导入一段音频内容。1) The user uses a mobile video editing software to import a piece of audio content.
2)当用户选择添加STT字幕的时候,询问用户是否需要将音频旋律识别出来并且添加到字幕当中,在用户选择不使用该功能的情况下,则直接添加STT字幕即可。在用户选择该功能的情况下,则进行步骤3)。2) When the user chooses to add STT subtitles, ask the user whether to recognize the audio melody and add it to the subtitles. If the user chooses not to use this function, just add STT subtitles directly. If the user selects this function, then proceed to step 3).
3)通过语音识别技术,将音频内容识别成文本,并且在识别过程中,需要记录音频当中每一个文字的起始时间和时长(单位是秒),将这些文本信息、每个字起始时间和文字的时长,以json的文本的形式保存,保存的形式如下所示:3) Through speech recognition technology, the audio content is recognized as text, and in the recognition process, it is necessary to record the start time and duration (in seconds) of each word in the audio, and record the text information, the start time of each word and the duration of the text, saved in the form of json text, the saved form is as follows:
Figure PCTCN2022076656-appb-000001
Figure PCTCN2022076656-appb-000001
Figure PCTCN2022076656-appb-000002
Figure PCTCN2022076656-appb-000002
需要说明的是,在这个json当中,识别出来的每一个文字作为数组当中的一个元素,元素当中还记录了这个文字的起始时间(start_time)和每个文字的时长(duration)。其中melody字段代表这个字所在时间点的旋律,该旋律会在下文的处理和得到。It should be noted that in this json, each recognized text is used as an element in the array, and the start time (start_time) and duration (duration) of each text are also recorded in the element. Among them, the melody field represents the melody at the time point of the word, and the melody will be processed and obtained below.
4)遍历这个json根节点的数组,取出数组当中的每个元素(每一个文字)对应的起始时间和时长,根据这个起始时间和时长,到原音频文件当中获取该时间段所有声音信号,并且将这段时间的声音信号作为FFT算法的输入,通过FFT算法识别出原音频文件该时间段的频谱分布。之后,将该时间段频谱分布最强的频率作为该时间点的主频率,并且将主频率以简谱的形式记录到json当中,字段为melody,经过步骤4)之后,json的内容变为如下所示:4) Traverse the array of the json root node, take out the start time and duration corresponding to each element (each text) in the array, and according to the start time and duration, get all the sound signals in the time period from the original audio file , and the sound signal during this period is used as the input of the FFT algorithm, and the frequency spectrum distribution of the original audio file during this period is identified through the FFT algorithm. Afterwards, the frequency with the strongest spectrum distribution in this time period is taken as the main frequency at this time point, and the main frequency is recorded in json in the form of numbered musical notation, and the field is melody. After step 4), the content of json becomes as follows Show:
Figure PCTCN2022076656-appb-000003
Figure PCTCN2022076656-appb-000003
Figure PCTCN2022076656-appb-000004
Figure PCTCN2022076656-appb-000004
Figure PCTCN2022076656-appb-000005
Figure PCTCN2022076656-appb-000005
5)当视频剪辑软件添加STT字幕的时候,在STT字幕的上方添加音乐简谱的字幕,通过这种方式,就可以生成带有音乐简谱的STT字幕。5) When the video editing software adds STT subtitles, add the subtitles of musical notation above the STT subtitles. In this way, the STT subtitles with musical notation can be generated.
在本公开实施方式中:In this disclosed embodiment:
(1)视频剪辑软件用户可以通过移动端视频剪辑软件简易地生成带音乐简谱信息的STT字幕,提高了视频作品的趣味性;(1) Video editing software users can easily generate STT subtitles with musical notation information through mobile video editing software, which improves the interest of video works;
(2)使得STT字幕的表达能力更强,极大地提高了用户剪辑视频的积极性和剪辑作品的质量。(2) The expressive ability of STT subtitles is enhanced, which greatly improves the user's enthusiasm for editing videos and the quality of edited works.
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本公开并不受所描述的动作顺序的限制,因为依据本公开,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于示例性的,所涉及的动作和模块并不一定是本公开所必须的。It should be noted that for the foregoing method embodiments, for the sake of simple description, they are all expressed as a series of action combinations, but those skilled in the art should know that the present disclosure is not limited by the described action sequence. Because of this disclosure, certain steps may be performed in other orders or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification are all exemplary, and the actions and modules involved are not necessarily required by the present disclosure.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到根据上述实施例的方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本公开的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个计算机可读存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本公开各个实施例的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by means of software plus a necessary general-purpose hardware platform, and of course also by hardware, but in many cases the former is better implementation. Based on such an understanding, the technical solution of the present disclosure can be embodied in the form of a software product in essence or the part that contributes to the prior art, and the computer software product is stored in a computer-readable storage medium (such as ROM/RAM, magnetic disk, optical disk), several instructions are included to make a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) execute the method of each embodiment of the present disclosure.
根据本公开实施例,还提供了一种用于实施上述字幕显示方法一的装置,图5是根据一示例性实施例示出的字幕显示装置一的装置框图。参照图5,该装置包括:第一接收模块502,第一识别模块504,第二识别模块506和处理模块508,下面对该装置进行说明。According to an embodiment of the present disclosure, a device for implementing the first subtitle display method above is also provided. FIG. 5 is a device block diagram of a subtitle display device 1 according to an exemplary embodiment. Referring to FIG. 5 , the device includes: a first receiving module 502 , a first identification module 504 , a second identification module 506 and a processing module 508 , and the device will be described below.
第一接收模块502,用于接收音频内容;第一识别模块504,连接于上述第一接收模块502,用于响应于字幕添加操作,对音频内容进行识别,得到文本内容;第二识别模块506,连接于上述第一识别模块504,用于响应于旋律识别操作,对音频内容的旋律信息进行识别,得到旋律内容;处理模块508,连接于上述第二识别模块506,用于基于文本内容和旋律内容,生成字幕并在显示界面上显示。The first receiving module 502 is used to receive audio content; the first identification module 504 is connected to the above-mentioned first receiving module 502, and is used to identify the audio content in response to the subtitle adding operation to obtain text content; the second identification module 506 , connected to the above-mentioned first recognition module 504, used to identify the melody information of the audio content in response to the melody recognition operation, and obtain the melody content; the processing module 508, connected to the above-mentioned second recognition module 506, used for based on the text content and melody content, generate subtitles and display them on the display interface.
在一些实施例中,处理模块508包括:拆分单元和第一处理单元,其中,拆分单元,用于将文本内容拆分为独立文字,并记录各独立文字在音频内容中的时间信息;第二识别模块,还用于分别基于各独立文字在音频内容中的时间信息,选择与时间信息对应的部分音频内容的旋律信息进行识别,得到各独立文字对应的独立旋律内容,其中,各独立文字对应的独立旋律内容构成文本内容对应的旋律内容;第一处理单元,用于基于各独立文字与对应的独立旋律内容,生成字幕并在显示界面上显示。In some embodiments, the processing module 508 includes: a splitting unit and a first processing unit, wherein the splitting unit is configured to split the text content into independent characters, and record the time information of each independent character in the audio content; The second recognition module is also used to select the melody information of the part of the audio content corresponding to the time information based on the time information of each independent character in the audio content for recognition, and obtain the independent melody content corresponding to each independent character, wherein each independent The independent melody content corresponding to the text constitutes the melody content corresponding to the text content; the first processing unit is configured to generate subtitles based on each independent text and the corresponding independent melody content and display them on the display interface.
在一些实施例中,处理模块508还包括:显示单元,用于在文本内容的上方或下方显示旋律内容。In some embodiments, the processing module 508 further includes: a display unit, configured to display the melody content above or below the text content.
在一些实施例中,第二识别模块506包括:选择单元,用于在时间信息包括各独立文字在音频内容中的开始时间点,和时长的情况下,分别基于各独立文字在音频内容中的开始时间点,以及时长,选择与开始时间点和时长对应的部分音频内容;第二处理单元,用于对部分音频内容进行处理,得到部分音频内容的频谱分布;第三处理单元,用于基于频谱分布,得到各独立文字对应的独立旋律内容。In some embodiments, the second identification module 506 includes: a selection unit, configured to, when the time information includes the start time point and duration of each independent text in the audio content, respectively based on the time points of each independent text in the audio content The start time point and the duration select part of the audio content corresponding to the start time point and duration; the second processing unit is used to process the part of the audio content to obtain the spectral distribution of the part of the audio content; the third processing unit is used to process the part of the audio content based on spectrum distribution to obtain the independent melody content corresponding to each independent text.
在一些实施例中,第三处理单元包括:确定子单元,用于在音频内容为音乐,独立旋律内容为音乐旋律的情况下,确定频谱分布中的最高频率为各独立文字的主频率;转换子单元,用于将主频率转换为音乐文字信息,其中,音乐文字信息表征各独立文字的音乐旋律。In some embodiments, the third processing unit includes: a determining subunit, configured to determine that the highest frequency in the spectrum distribution is the main frequency of each independent text when the audio content is music and the independent melody content is music melody; The subunit is used to convert the main frequency into music text information, wherein the music text information represents the music melody of each independent text.
此处需要说明的是,上述第一接收模块502,第一识别模块504,第二识别模块506和处理模块508对应于上述实施例中的步骤S21至步骤S24,上述模块与对应的步骤所实现的实例和应用场景相同,但不限于上述实施例所公开的内容。需要说明的是,上述模块作为装置的一部分可以运行在实施例提供的计算机终端10中。It should be noted here that the first receiving module 502, the first identification module 504, the second identification module 506 and the processing module 508 correspond to steps S21 to S24 in the above embodiment, and the above modules and corresponding steps are implemented The examples and application scenarios are the same, but are not limited to the content disclosed in the above embodiments. It should be noted that, as a part of the device, the above modules can run in the computer terminal 10 provided in the embodiment.
根据本公开实施例,还提供了一种用于实施上述字幕显示方法二的装置,图6是根据一示例性实施例示出的字幕显示装置二的装置框图。参照图6,该装置包括:播放模块602,第二接收模块604和显示模块606,下面对该装置进行说明。According to an embodiment of the present disclosure, a device for implementing the second method for displaying subtitles is also provided. FIG. 6 is a device block diagram of the second method for displaying subtitles according to an exemplary embodiment. Referring to FIG. 6 , the device includes: a playback module 602 , a second receiving module 604 and a display module 606 , and the device will be described below.
播放模块602,用于在显示界面上播放视频,其中,视频中包括音频内容;第二接收模块604,连接于上述播放模块602,用于接收字幕显示指令;显示模块606,连接于上述第二接收模块604,用于响应字幕显示指令,在显示界面上显示字幕,其中,字幕包括:文本内容和旋律内容,文本内容通过对音频内容进行识别得到,旋律内容通过对音频内容的旋律信息进行识别得到。The playing module 602 is used to play the video on the display interface, wherein the video includes audio content; the second receiving module 604 is connected to the above-mentioned playing module 602 and is used to receive subtitle display instructions; the display module 606 is connected to the above-mentioned second The receiving module 604 is used to respond to the subtitle display instruction and display the subtitle on the display interface, wherein the subtitle includes: text content and melody content, the text content is obtained by identifying the audio content, and the melody content is obtained by identifying the melody information of the audio content get.
此处需要说明的是,上述播放模块602,第二接收模块604和显示模块606对应于上述实施例中的步骤S31至步骤S33,上述模块与对应的步骤所实现的实例和应用场景相同,但不限于上述实施例所公开的内容。需要说明的是,上述模块作为装置的一部分可以运行在实施例提供的计算机终端10中。It should be noted here that the playing module 602, the second receiving module 604 and the display module 606 correspond to steps S31 to S33 in the above embodiment, and the examples and application scenarios implemented by the above modules are the same as those of the corresponding steps, but It is not limited to the content disclosed in the above embodiments. It should be noted that, as a part of the device, the above modules can run in the computer terminal 10 provided in the embodiment.
关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。Regarding the apparatus in the foregoing embodiments, the specific manner in which each module executes operations has been described in detail in the embodiments related to the method, and will not be described in detail here.
本公开的实施例可以提供一种电子设备,该电子设备可以是一种终端,也可以是一种服务器。例如,在该电子设备为一种终端的情况下,该终端可以是计算机终端群中的任意一个计算机终端设备。在一些实施例中,上述终端也可以为移动终端等终端设备。Embodiments of the present disclosure may provide an electronic device, and the electronic device may be a terminal or a server. For example, in the case that the electronic device is a terminal, the terminal may be any computer terminal device in the group of computer terminals. In some embodiments, the foregoing terminal may also be a terminal device such as a mobile terminal.
在一些实施例中,上述终端可以位于计算机网络的多个网络设备中的至少一个网络设备。In some embodiments, the above-mentioned terminal may be located in at least one network device among multiple network devices of the computer network.
在一些实施例中,图7是根据一示例性实施例示出的一种终端的结构框图。如图7所示,该终端可以包括:一个或多个(图中仅示出一个)处理器71、用于存储处理器可执行指令的存储器72;其中,处理器被配置为执行指令,以实现上述任一项的字幕显示方法。In some embodiments, Fig. 7 is a structural block diagram of a terminal according to an exemplary embodiment. As shown in FIG. 7, the terminal may include: one or more (only one is shown in the figure) processors 71, and a memory 72 for storing processor-executable instructions; wherein, the processors are configured to execute instructions to A subtitle display method of any one of the above items is realized.
其中,存储器可用于存储软件程序以及模块,如本公开实施例中的字幕显示方法和装置对应的程序指令/模块,处理器通过运行存储在存储器内的软件程序以及模块,从而 执行各种功能应用以及数据处理,即实现上述的字幕显示方法。存储器可包括高速随机存储器,还可以包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器可进一步包括相对于处理器远程设置的存储器,这些远程存储器可以通过网络连接至计算机终端。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。Wherein, the memory can be used to store software programs and modules, such as program instructions/modules corresponding to the subtitle display method and device in the embodiments of the present disclosure, and the processor executes various functional applications by running the software programs and modules stored in the memory. And data processing, that is to realize the above subtitle display method. The memory may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory may further include a memory located remotely from the processor, and these remote memories may be connected to the computer terminal through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
处理器可以通过传输装置调用存储器存储的信息及应用程序,以执行下述步骤:接收音频内容;响应于字幕添加操作,对音频内容进行识别,得到文本内容;响应于旋律识别操作,对音频内容的旋律信息进行识别,得到旋律内容;基于文本内容和旋律内容,生成字幕并在显示界面上显示。The processor can call the information stored in the memory and the application program through the transmission device to perform the following steps: receive the audio content; respond to the subtitle adding operation, identify the audio content to obtain the text content; respond to the melody recognition operation, and identify the audio content The melody information is identified to obtain the melody content; based on the text content and melody content, subtitles are generated and displayed on the display interface.
在一些实施例中,上述处理器还可以执行如下步骤的程序代码:基于文本内容和旋律内容,生成字幕并在显示界面上显示,包括:将文本内容拆分为独立文字,并记录各独立文字在音频内容中的时间信息;对音频内容的旋律信息进行识别,得到旋律内容包括:分别基于各独立文字在音频内容中的时间信息,选择与时间信息对应的部分音频内容的旋律信息进行识别,得到各独立文字对应的独立旋律内容,其中,各独立文字对应的独立旋律内容构成文本内容对应的旋律内容;基于各独立文字与对应的独立旋律内容,生成字幕并在显示界面上显示。In some embodiments, the above-mentioned processor can also execute the program code of the following steps: based on the text content and the melody content, generating subtitles and displaying them on the display interface, including: splitting the text content into independent words, and recording each independent word Time information in the audio content; identifying the melody information of the audio content, and obtaining the melody content includes: respectively based on the time information of each independent text in the audio content, selecting the melody information of a part of the audio content corresponding to the time information for identification, The independent melody content corresponding to each independent text is obtained, wherein the independent melody content corresponding to each independent text constitutes the melody content corresponding to the text content; based on each independent text and the corresponding independent melody content, subtitles are generated and displayed on the display interface.
在一些实施例中,上述处理器还可以执行如下步骤的程序代码:在时间信息包括各独立文字在音频内容中的开始时间点,和时长的情况下,分别基于各独立文字在音频内容中的时间信息,选择与时间信息对应的部分音频内容的旋律信息进行识别,得到各独立文字对应的独立旋律内容,包括:分别基于各独立文字在音频内容中的开始时间点,以及时长,选择与开始时间点和时长对应的部分音频内容;对部分音频内容进行处理,得到部分音频内容的频谱分布;基于频谱分布,得到各独立文字对应的独立旋律内容。In some embodiments, the above-mentioned processor can also execute the program code of the following steps: when the time information includes the start time point and duration of each independent word in the audio content, respectively based on the time of each independent word in the audio content Time information, select the melody information of part of the audio content corresponding to the time information for identification, and obtain the independent melody content corresponding to each independent text, including: based on the start time point and duration of each independent text in the audio content, selection and start Part of the audio content corresponding to the time point and duration; process the part of the audio content to obtain the spectrum distribution of the part of the audio content; based on the spectrum distribution, obtain the independent melody content corresponding to each independent text.
在一些实施例中,上述处理器还可以执行如下步骤的程序代码:基于频谱分布,得到各独立文字对应的独立旋律内容,包括:在音频内容为音乐,独立旋律内容为音乐旋律的情况下,确定频谱分布中的最高频率为各独立文字的主频率;将主频率转换为音乐文字信息,其中,音乐文字信息表征各独立文字的音乐旋律。In some embodiments, the above-mentioned processor can also execute the program code of the following steps: based on the frequency spectrum distribution, obtain the independent melody content corresponding to each independent text, including: when the audio content is music, and the independent melody content is music melody, The highest frequency in the frequency spectrum distribution is determined as the main frequency of each independent text; the main frequency is converted into music text information, wherein the music text information represents the music melody of each independent text.
在一些实施例中,上述处理器还可以执行如下步骤的程序代码:音乐文字信息包括以下至少之一:数字形式的简谱,符号形式的五线谱。In some embodiments, the above-mentioned processor can also execute the program code of the following steps: the music text information includes at least one of the following: numbered musical notation in digital form, and stave notation in symbolic form.
在一些实施例中,上述处理器还可以执行如下步骤的程序代码:在显示界面上显示字幕,包括:在文本内容的上方或下方显示旋律内容。In some embodiments, the above-mentioned processor may also execute the program code for the following steps: displaying subtitles on the display interface, including: displaying melody content above or below the text content.
处理器可以通过传输装置调用存储器存储的信息及应用程序,以执行下述步骤:在显示界面上播放视频,其中,视频中包括音频内容;接收字幕显示指令;响应字幕显示指令,在显示界面上显示字幕,其中,字幕包括:文本内容和旋律内容,文本内容通过对音频内容进行识别得到,旋律内容通过对音频内容的旋律信息进行识别得到。The processor can call the information stored in the memory and the application program through the transmission device to perform the following steps: play a video on the display interface, wherein the video includes audio content; receive a subtitle display instruction; respond to the subtitle display instruction, and display on the display interface Displaying subtitles, wherein the subtitles include: text content and melody content, the text content is obtained by identifying the audio content, and the melody content is obtained by identifying the melody information of the audio content.
如上,该电子设备还可以是一种服务器,本公开的实施例提供了一种服务器,图8是根据一示例性实施例示出的一种服务器的结构框图。如图8所示,该服务器17可以包括:一个或多个(图中仅示出一个)处理组件81、用于存储处理组件81可执行指令的存储器82、提供电源的电源组件83,实现与外部网络通信的网络接口84和与外部进行数据传输的I/O输入输出接口85;其中,处理组件81被配置为执行指令,以实现上述任一项的字幕显示方法。As above, the electronic device may also be a server. Embodiments of the present disclosure provide a server. FIG. 8 is a structural block diagram of a server according to an exemplary embodiment. As shown in Figure 8, the server 17 may include: one or more (only one is shown in the figure) processing components 81, a memory 82 for storing executable instructions of the processing components 81, a power supply component 83 for providing power, and realizing the same A network interface 84 for external network communication and an I/O input and output interface 85 for data transmission with the outside; wherein, the processing component 81 is configured to execute instructions to implement any one of the subtitle display methods above.
其中,存储器可用于存储软件程序以及模块,如本公开实施例中的字幕显示方法和装置对应的程序指令/模块,处理器通过运行存储在存储器内的软件程序以及模块,从而执行各种功能应用以及数据处理,即实现上述的字幕显示方法。存储器可包括高速随机存储器,还可以包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器可进一步包括相对于处理器远程设置的存储器,这些远程存储器可以通过网络连接至计算机终端。上述网络的实例包括但不限于 互联网、企业内部网、局域网、移动通信网及其组合。Wherein, the memory can be used to store software programs and modules, such as program instructions/modules corresponding to the subtitle display method and device in the embodiments of the present disclosure, and the processor executes various functional applications by running the software programs and modules stored in the memory. And data processing, that is to realize the above subtitle display method. The memory may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory may further include a memory located remotely from the processor, and these remote memories may be connected to the computer terminal through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
处理组件可以通过传输装置调用存储器存储的信息及应用程序,以执行下述步骤:接收音频内容;响应于字幕添加操作,对音频内容进行识别,得到文本内容;响应于旋律识别操作,对音频内容的旋律信息进行识别,得到旋律内容;基于文本内容和旋律内容,生成字幕并在显示界面上显示。The processing component can call the information stored in the memory and the application program through the transmission device to perform the following steps: receive the audio content; respond to the subtitle adding operation, identify the audio content to obtain the text content; respond to the melody recognition operation, and identify the audio content The melody information is identified to obtain the melody content; based on the text content and melody content, subtitles are generated and displayed on the display interface.
在一些实施例中,上述处理组件还可以执行如下步骤的程序代码:基于文本内容和旋律内容,生成字幕并在显示界面上显示,包括:将文本内容拆分为独立文字,并记录各独立文字在音频内容中的时间信息;对音频内容的旋律信息进行识别,得到旋律内容包括:分别基于各独立文字在音频内容中的时间信息,选择与时间信息对应的部分音频内容的旋律信息进行识别,得到各独立文字对应的独立旋律内容,其中,各独立文字对应的独立旋律内容构成文本内容对应的旋律内容;基于各独立文字与对应的独立旋律内容,生成字幕并在显示界面上显示。In some embodiments, the above-mentioned processing component can also execute the program code of the following steps: based on the text content and the melody content, generate subtitles and display them on the display interface, including: splitting the text content into independent words, and recording each independent word Time information in the audio content; identifying the melody information of the audio content, and obtaining the melody content includes: respectively based on the time information of each independent text in the audio content, selecting the melody information of a part of the audio content corresponding to the time information for identification, The independent melody content corresponding to each independent text is obtained, wherein the independent melody content corresponding to each independent text constitutes the melody content corresponding to the text content; based on each independent text and the corresponding independent melody content, subtitles are generated and displayed on the display interface.
在一些实施例中,上述处理组件还可以执行如下步骤的程序代码:在时间信息包括各独立文字在音频内容中的开始时间点,和时长的情况下,分别基于各独立文字在音频内容中的时间信息,选择与时间信息对应的部分音频内容的旋律信息进行识别,得到各独立文字对应的独立旋律内容,包括:分别基于各独立文字在音频内容中的开始时间点,以及时长,选择与开始时间点和时长对应的部分音频内容;对部分音频内容进行处理,得到部分音频内容的频谱分布;基于频谱分布,得到各独立文字对应的独立旋律内容。In some embodiments, the above-mentioned processing component can also execute the program code of the following steps: when the time information includes the start time point and duration of each independent word in the audio content, respectively based on the time of each independent word in the audio content Time information, select the melody information of part of the audio content corresponding to the time information for identification, and obtain the independent melody content corresponding to each independent text, including: based on the start time point and duration of each independent text in the audio content, selection and start Part of the audio content corresponding to the time point and duration; process the part of the audio content to obtain the spectrum distribution of the part of the audio content; based on the spectrum distribution, obtain the independent melody content corresponding to each independent text.
在一些实施例中,上述处理组件还可以执行如下步骤的程序代码:基于频谱分布,得到各独立文字对应的独立旋律内容,包括:在音频内容为音乐,独立旋律内容为音乐旋律的情况下,确定频谱分布中的最高频率为各独立文字的主频率;将主频率转换为音乐文字信息,其中,音乐文字信息表征各独立文字的音乐旋律。In some embodiments, the above-mentioned processing component can also execute the program code of the following steps: based on the frequency spectrum distribution, obtain the independent melody content corresponding to each independent text, including: when the audio content is music, and the independent melody content is music melody, The highest frequency in the frequency spectrum distribution is determined as the main frequency of each independent text; the main frequency is converted into music text information, wherein the music text information represents the music melody of each independent text.
在一些实施例中,上述处理组件还可以执行如下步骤的程序代码:音乐文字信息包括以下至少之一:数字形式的简谱,符号形式的五线谱。In some embodiments, the above-mentioned processing component can also execute the program code of the following steps: the music text information includes at least one of the following: numbered musical notation in digital form, and stave notation in symbolic form.
在一些实施例中,上述处理组件还可以执行如下步骤的程序代码:在显示界面上显示字幕,包括:在文本内容的上方或下方显示旋律内容。In some embodiments, the above-mentioned processing component may also execute the program code of the following steps: displaying the subtitles on the display interface includes: displaying the melody content above or below the text content.
处理组件可以通过传输装置调用存储器存储的信息及应用程序,以执行下述步骤:在显示界面上播放视频,其中,视频中包括音频内容;接收字幕显示指令;响应字幕显示指令,在显示界面上显示字幕,其中,字幕包括:文本内容和旋律内容,文本内容通过对音频内容进行识别得到,旋律内容通过对音频内容的旋律信息进行识别得到。The processing component can call the information stored in the memory and the application program through the transmission device to perform the following steps: play the video on the display interface, wherein the video includes audio content; receive a subtitle display instruction; respond to the subtitle display instruction, and display on the display interface Displaying subtitles, wherein the subtitles include: text content and melody content, the text content is obtained by identifying the audio content, and the melody content is obtained by identifying the melody information of the audio content.
本领域普通技术人员可以理解,图7,图8所示的结构仅为示意,例如,上述终端也可以是智能手机(如Android手机、iOS手机等)、平板电脑、掌声电脑以及移动互联网设备(Mobile Internet Devices,MID)、PAD等终端设备。图7,图8其并不对上述电子装置的结构造成限定。例如,还可包括比图7,图8中所示更多或者更少的组件(如网络接口、显示装置等),或者具有与图7,图8所示不同的配置。Those of ordinary skill in the art can understand that the structures shown in Fig. 7 and Fig. 8 are only schematic. For example, the above-mentioned terminal can also be a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, an applause computer, and a mobile Internet device ( Mobile Internet Devices, MID), PAD and other terminal equipment. 7 and 8 do not limit the structure of the above-mentioned electronic device. For example, more or less components (such as network interface, display device, etc.) than those shown in FIG. 7 and FIG. 8 may be included, or configurations different from those shown in FIGS. 7 and 8 may be included.
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令终端设备相关的硬件来完成,该程序可以存储于一计算机可读存储介质中,该计算机可读存储介质可以包括:闪存盘、只读存储器(Read-Only Memory,ROM)、随机存取器(Random Access Memory,RAM)、磁盘或光盘等。Those skilled in the art can understand that all or part of the steps in the various methods of the above embodiments can be completed by instructing hardware related to the terminal device through a program, and the program can be stored in a computer-readable storage medium, and the computer can The read storage medium may include: a flash disk, a read-only memory (Read-Only Memory, ROM), a random access device (Random Access Memory, RAM), a magnetic disk or an optical disk, and the like.
在示例性实施例中,还提供了一种包括指令的计算机可读存储介质,当计算机可读存储介质中的指令由终端的处理器执行时,使得终端能够执行上述任一项的字幕显示方法。在一些实施例中,计算机可读存储介质可以是非临时性计算机可读存储介质,例如,非临时性计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。In an exemplary embodiment, a computer-readable storage medium including instructions is also provided, and when the instructions in the computer-readable storage medium are executed by the processor of the terminal, the terminal is able to perform any one of the subtitle display methods above . In some embodiments, the computer-readable storage medium may be a non-transitory computer-readable storage medium, for example, the non-transitory computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk and optical data storage devices, etc.
在一些实施例中,上述计算机可读存储介质可以用于保存上述实施例所提供的字幕显示方法所执行的程序代码。In some embodiments, the above-mentioned computer-readable storage medium may be used to store program codes executed by the subtitle display method provided in the above-mentioned embodiments.
在一些实施例中,上述计算机可读存储介质可以位于计算机网络中计算机终端群中的任意一个计算机终端中,或者位于移动终端群中的任意一个移动终端中。In some embodiments, the above-mentioned computer-readable storage medium may be located in any computer terminal in the group of computer terminals in the computer network, or in any mobile terminal in the group of mobile terminals.
在一些实施例中,计算机可读存储介质被设置为存储用于执行以下步骤的程序代码:接收音频内容;响应于字幕添加操作,对音频内容进行识别,得到文本内容;响应于旋律识别操作,对音频内容的旋律信息进行识别,得到旋律内容;基于文本内容和旋律内容,生成字幕并在显示界面上显示。In some embodiments, the computer-readable storage medium is configured to store program codes for performing the following steps: receiving audio content; in response to subtitle addition operations, identifying the audio content to obtain text content; in response to the melody recognition operation, The melody information of the audio content is identified to obtain the melody content; based on the text content and the melody content, subtitles are generated and displayed on the display interface.
在一些实施例中,计算机可读存储介质被设置为存储用于执行以下步骤的程序代码:基于文本内容和旋律内容,生成字幕并在显示界面上显示,包括:将文本内容拆分为独立文字,并记录各独立文字在音频内容中的时间信息;对音频内容的旋律信息进行识别,得到旋律内容包括:分别基于各独立文字在音频内容中的时间信息,选择与时间信息对应的部分音频内容的旋律信息进行识别,得到各独立文字对应的独立旋律内容,其中,各独立文字对应的独立旋律内容构成文本内容对应的旋律内容;基于各独立文字与对应的独立旋律内容,生成字幕并在显示界面上显示。In some embodiments, the computer-readable storage medium is configured to store program codes for performing the following steps: generating subtitles based on the text content and the melody content and displaying them on the display interface, including: splitting the text content into independent words , and record the time information of each independent text in the audio content; identify the melody information of the audio content, and obtain the melody content including: based on the time information of each independent text in the audio content, select part of the audio content corresponding to the time information The melody information of each independent text is identified to obtain the independent melody content corresponding to each independent text, wherein, the independent melody content corresponding to each independent text constitutes the melody content corresponding to the text content; based on each independent text and the corresponding independent melody content, subtitles are generated and displayed on the displayed on the interface.
在一些实施例中,计算机可读存储介质被设置为存储用于执行以下步骤的程序代码:在时间信息包括各独立文字在音频内容中的开始时间点,和时长的情况下,分别基于各独立文字在音频内容中的时间信息,选择与时间信息对应的部分音频内容的旋律信息进行识别,得到各独立文字对应的独立旋律内容,包括:分别基于各独立文字在音频内容中的开始时间点,以及时长,选择与开始时间点和时长对应的部分音频内容;对部分音频内容进行处理,得到部分音频内容的频谱分布;基于频谱分布,得到各独立文字对应的独立旋律内容。In some embodiments, the computer-readable storage medium is configured to store program codes for performing the following steps: when the time information includes the start time point and duration of each independent text in the audio content, respectively based on each independent text The time information of the text in the audio content, the melody information of the part of the audio content corresponding to the time information is selected for identification, and the independent melody content corresponding to each independent text is obtained, including: based on the start time point of each independent text in the audio content, and the duration, select part of the audio content corresponding to the start time point and duration; process the part of the audio content to obtain the spectrum distribution of the part of the audio content; based on the spectrum distribution, obtain the independent melody content corresponding to each independent text.
在一些实施例中,计算机可读存储介质被设置为存储用于执行以下步骤的程序代码:基于频谱分布,得到各独立文字对应的独立旋律内容,包括:在音频内容为音乐,独立旋律内容为音乐旋律的情况下,确定频谱分布中的最高频率为各独立文字的主频率;将主频率转换为音乐文字信息,其中,音乐文字信息表征各独立文字的音乐旋律。In some embodiments, the computer-readable storage medium is configured to store program codes for performing the following steps: Obtain the independent melody content corresponding to each independent text based on the frequency spectrum distribution, including: when the audio content is music, the independent melody content is In the case of music melody, determine the highest frequency in the spectrum distribution as the main frequency of each independent character; convert the main frequency into music text information, wherein the music text information represents the music melody of each independent text.
在一些实施例中,计算机可读存储介质被设置为存储用于执行以下步骤的程序代码:音乐文字信息包括以下至少之一:数字形式的简谱,符号形式的五线谱。In some embodiments, the computer-readable storage medium is configured to store program codes for performing the following steps: the music text information includes at least one of the following: numbered musical notation in digital form, stave notation in symbolic form.
在一些实施例中,计算机可读存储介质被设置为存储用于执行以下步骤的程序代码:在显示界面上显示字幕,包括:在文本内容的上方或下方显示旋律内容。In some embodiments, the computer-readable storage medium is configured to store program codes for performing the following steps: displaying subtitles on the display interface, including: displaying melody content above or below the text content.
在一些实施例中,计算机可读存储介质被设置为存储用于执行以下步骤的程序代码:在显示界面上播放视频,其中,视频中包括音频内容;接收字幕显示指令;响应字幕显示指令,在显示界面上显示字幕,其中,字幕包括:文本内容和旋律内容,文本内容通过对音频内容进行识别得到,旋律内容通过对音频内容的旋律信息进行识别得到。In some embodiments, the computer-readable storage medium is configured to store program codes for performing the following steps: playing a video on a display interface, wherein the video includes audio content; receiving a subtitle display instruction; responding to the subtitle display instruction, The subtitle is displayed on the display interface, wherein the subtitle includes: text content and melody content, the text content is obtained by identifying the audio content, and the melody content is obtained by identifying the melody information of the audio content.
在示例性实施例中,还提供了一种计算机程序产品,当计算机程序产品中的计算机程序由终端的处理器执行时,使得终端能够执行上述任一项的字幕显示方法。In an exemplary embodiment, a computer program product is also provided, and when the computer program in the computer program product is executed by the processor of the terminal, the terminal is enabled to execute any one of the subtitle display methods above.
在本申请所提供的几个实施例中,应该理解到,所揭露的技术内容,可通过其它的方式实现。其中,以上所描述的装置实施例仅仅是示意性的,例如单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,单元或模块的间接耦合或通信连接,可以是电性或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed technical content can be realized in other ways. Wherein, the device embodiments described above are only illustrative, such as the division of units, which is only a logical function division, and there may be other division methods in actual implementation, for example, multiple units or components can be combined or integrated into Another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of units or modules may be in electrical or other forms.
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本公开实施例方案的目的。A unit described as a separate component may or may not be physically separated, and a component displayed as a unit may or may not be a physical unit, that is, it may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solutions of the embodiments of the present disclosure.
另外,在本公开各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单 元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
在集成的单元以软件功能单元的形式实现并作为独立的产品销售或使用的情况下,可以存储在一个计算机可读取存储介质中。基于这样的理解,本公开的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个计算机可读存储介质中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本公开各个实施例方法的全部或部分步骤。而前述的计算机可读存储介质包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。In a case where an integrated unit is realized in the form of a software function unit and sold or used as an independent product, it may be stored in one computer-readable storage medium. Based on this understanding, the technical solution of the present disclosure is essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a computer-readable The storage medium includes several instructions to enable a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods in various embodiments of the present disclosure. The aforementioned computer-readable storage medium includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk, etc., which can store programs. The medium of the code.
本公开所有实施例均可以单独被执行,也可以与其他实施例相结合被执行,均视为本公开要求的保护范围。All the embodiments of the present disclosure can be implemented independently or in combination with other embodiments, which are all regarded as the scope of protection required by the present disclosure.

Claims (17)

  1. 一种字幕显示方法,包括:A subtitle display method, comprising:
    接收音频内容;receive audio content;
    响应于字幕添加操作,对所述音频内容进行识别,得到文本内容;Responding to the operation of adding subtitles, identifying the audio content to obtain text content;
    响应于旋律识别操作,对所述音频内容的旋律信息进行识别,得到旋律内容;Responding to the melody identification operation, identifying the melody information of the audio content to obtain the melody content;
    基于所述文本内容和所述旋律内容,生成字幕并在显示界面上显示。Based on the text content and the melody content, generate subtitles and display them on a display interface.
  2. 根据权利要求1所述的方法,其中,基于所述文本内容和所述旋律内容,生成字幕并在显示界面上显示,包括:The method according to claim 1, wherein, based on the text content and the melody content, generating subtitles and displaying them on a display interface includes:
    将所述文本内容拆分为独立文字,并记录各独立文字在所述音频内容中的时间信息;Split the text content into independent words, and record the time information of each independent word in the audio content;
    对所述音频内容的旋律信息进行识别,得到旋律内容包括:分别基于所述各独立文字在所述音频内容中的时间信息,选择与所述时间信息对应的部分音频内容的旋律信息进行识别,得到所述各独立文字对应的独立旋律内容,其中,所述各独立文字对应的独立旋律内容构成所述文本内容对应的旋律内容;Identifying the melody information of the audio content, and obtaining the melody content includes: selecting and identifying the melody information of a part of the audio content corresponding to the time information based on the time information of the independent characters in the audio content, respectively, Obtaining the independent melody content corresponding to each independent character, wherein the independent melody content corresponding to each independent character constitutes the melody content corresponding to the text content;
    基于所述各独立文字与对应的所述独立旋律内容,生成字幕并在所述显示界面上显示。Based on the independent characters and the corresponding independent melody content, subtitles are generated and displayed on the display interface.
  3. 根据权利要求2所述的方法,其中,在所述时间信息包括所述各独立文字在所述音频内容中的开始时间点,和时长的情况下,分别基于所述各独立文字在所述音频内容中的时间信息,选择与所述时间信息对应的部分音频内容的旋律信息进行识别,得到所述各独立文字对应的独立旋律内容,包括:The method according to claim 2, wherein, when the time information includes the start time point and duration of the independent characters in the audio content, based on the audio content of the independent characters respectively For the time information in the content, the melody information of the part of the audio content corresponding to the time information is selected for identification, and the independent melody content corresponding to each independent text is obtained, including:
    分别基于所述各独立文字在所述音频内容中的所述开始时间点,以及所述时长,选择与所述开始时间点和所述时长对应的部分音频内容;Respectively based on the start time point and the duration of each independent text in the audio content, select a part of the audio content corresponding to the start time point and the duration;
    对所述部分音频内容进行处理,得到所述部分音频内容的频谱分布;Processing the part of the audio content to obtain the spectrum distribution of the part of the audio content;
    基于所述频谱分布,得到所述各独立文字对应的独立旋律内容。Based on the spectrum distribution, the independent melody content corresponding to the independent characters is obtained.
  4. 根据权利要求3所述的方法,其中,所述基于所述频谱分布,得到所述各独立文字对应的独立旋律内容,包括:The method according to claim 3, wherein said obtaining the independent melody content corresponding to each independent text based on said frequency spectrum distribution comprises:
    在所述音频内容为音乐,所述独立旋律内容为音乐旋律的情况下,确定所述频谱分布中的最高频率为所述各独立文字的主频率;In the case where the audio content is music and the independent melody content is music melody, determine that the highest frequency in the spectrum distribution is the main frequency of each independent text;
    将所述主频率转换为音乐文字信息,其中,所述音乐文字信息表征所述各独立文字的音乐旋律。converting the main frequency into music text information, wherein the music text information represents the music melody of each independent text.
  5. 根据权利要求4所述的方法,其中,所述音乐文字信息包括以下至少之一:The method according to claim 4, wherein the music text information includes at least one of the following:
    数字形式的简谱,符号形式的五线谱。Numbered musical notation and symbolic musical notation.
  6. 根据权利要求1至5中任一项所述的方法,其中,所述在显示界面上显示所述字幕,包括:The method according to any one of claims 1 to 5, wherein said displaying said subtitles on a display interface comprises:
    在所述文本内容的上方或下方显示所述旋律内容。The melody content is displayed above or below the text content.
  7. 一种字幕显示方法,包括:A subtitle display method, comprising:
    在显示界面上播放视频,其中,所述视频中包括音频内容;Playing a video on the display interface, wherein the video includes audio content;
    接收字幕显示指令;Receive subtitle display instructions;
    响应所述字幕显示指令,在所述显示界面上显示字幕,其中,所述字幕包括:文本内容和旋律内容,所述文本内容通过对所述音频内容进行识别得到,所述旋律内容通过对所述音频内容的旋律信息进行识别得到。Responding to the subtitle display instruction, display subtitles on the display interface, wherein the subtitles include: text content and melody content, the text content is obtained by identifying the audio content, and the melody content is obtained by identifying the It can be obtained by identifying the melody information of the audio content.
  8. 一种字幕显示装置,包括:A subtitle display device, comprising:
    第一接收模块,用于接收音频内容;a first receiving module, configured to receive audio content;
    第一识别模块,用于响应于字幕添加操作,对所述音频内容进行识别,得到文本内容;The first identification module is used to identify the audio content and obtain the text content in response to the subtitle adding operation;
    第二识别模块,用于响应于旋律识别操作,对所述音频内容的旋律信息进行识别,得到旋律内容;The second identification module is used to identify the melody information of the audio content in response to the melody identification operation to obtain the melody content;
    处理模块,用于基于所述文本内容和所述旋律内容,生成字幕并在显示界面上显示。A processing module, configured to generate subtitles and display them on a display interface based on the text content and the melody content.
  9. 根据权利要求8所述的装置,其中,所述处理模块包括:拆分单元和第一处理单元,其中,The device according to claim 8, wherein the processing module comprises: a splitting unit and a first processing unit, wherein,
    所述拆分单元,用于将所述文本内容拆分为独立文字,并记录各独立文字在所述音频内容中的时间信息;The splitting unit is configured to split the text content into independent characters, and record the time information of each independent character in the audio content;
    所述第二识别模块,还用于分别基于所述各独立文字在所述音频内容中的时间信息,选择与所述时间信息对应的部分音频内容的旋律信息进行识别,得到所述各独立文字对应的独立旋律内容,其中,所述各独立文字对应的独立旋律内容构成所述文本内容对应的旋律内容;The second recognition module is further configured to select the melody information of a part of the audio content corresponding to the time information based on the time information of the independent characters in the audio content for recognition, and obtain the independent characters Corresponding independent melody content, wherein the independent melody content corresponding to each independent text constitutes the melody content corresponding to the text content;
    所述第一处理单元,用于基于所述各独立文字与对应的所述独立旋律内容,生成字幕并在所述显示界面上显示。The first processing unit is configured to generate subtitles and display them on the display interface based on the independent characters and the corresponding independent melody content.
  10. 根据权利要求9所述的装置,其中,所述第二识别模块包括:The device according to claim 9, wherein the second identification module comprises:
    选择单元,用于在所述时间信息包括所述各独立文字在所述音频内容中的开始时间点,和时长的情况下,分别基于所述各独立文字在所述音频内容中的所述开始时间点,以及所述时长,选择与所述开始时间点和所述时长对应的部分音频内容;A selection unit, configured to, when the time information includes the start time point and duration of the independent characters in the audio content, respectively based on the start of the independent characters in the audio content time point, and the duration, select part of the audio content corresponding to the start time point and the duration;
    第二处理单元,用于对所述部分音频内容进行处理,得到所述部分音频内容的频谱分布;a second processing unit, configured to process the part of the audio content to obtain the spectrum distribution of the part of the audio content;
    第三处理单元,用于基于所述频谱分布,得到所述各独立文字对应的独立旋律内容。The third processing unit is configured to obtain independent melody content corresponding to each independent text based on the frequency spectrum distribution.
  11. 根据权利要求10所述的装置,其中,所述第三处理单元包括:The device according to claim 10, wherein the third processing unit comprises:
    确定子单元,用于在所述音频内容为音乐,所述独立旋律内容为音乐旋律的情况下,确定所述频谱分布中的最高频率为所述各独立文字的主频率;A determination subunit is configured to determine that the highest frequency in the spectral distribution is the main frequency of each independent text when the audio content is music and the independent melody content is music melody;
    转换子单元,用于将所述主频率转换为音乐文字信息,其中,所述音乐文字信息表征所述各独立文字的音乐旋律。The conversion subunit is configured to convert the main frequency into music text information, wherein the music text information represents the music melody of each independent text.
  12. 根据权利要求11所述的装置,其中,所述音乐文字信息包括以下至少之一:The device according to claim 11, wherein the music text information includes at least one of the following:
    数字形式的简谱,符号形式的五线谱。Numbered musical notation and symbolic musical notation.
  13. 根据权利要求8至12中任一项所述的装置,其中,所述处理模块包括:The device according to any one of claims 8 to 12, wherein the processing module comprises:
    显示单元,用于在所述文本内容的上方或下方显示所述旋律内容。A display unit, configured to display the melody content above or below the text content.
  14. 一种字幕显示装置,包括:A subtitle display device, comprising:
    播放模块,用于在显示界面上播放视频,其中,所述视频中包括音频内容;A playback module, configured to play a video on the display interface, wherein the video includes audio content;
    第二接收模块,用于接收字幕显示指令;The second receiving module is used to receive subtitle display instructions;
    显示模块,用于响应所述字幕显示指令,在所述显示界面上显示字幕,其中,所述字幕包括:文本内容和旋律内容,所述文本内容通过对所述音频内容进行识别得到,所述旋律内容通过对所述音频内容的旋律信息进行识别得到。A display module, configured to display subtitles on the display interface in response to the subtitle display instruction, wherein the subtitles include: text content and melody content, the text content is obtained by identifying the audio content, the The melody content is obtained by identifying the melody information of the audio content.
  15. 一种电子设备,包括:An electronic device comprising:
    处理器;processor;
    用于存储所述处理器可执行指令的存储器;memory for storing said processor-executable instructions;
    其中,所述处理器被配置为执行所述指令,以实现以下步骤:Wherein, the processor is configured to execute the instructions to achieve the following steps:
    接收音频内容;receive audio content;
    响应于字幕添加操作,对所述音频内容进行识别,得到文本内容;Responding to the operation of adding subtitles, identifying the audio content to obtain text content;
    响应于旋律识别操作,对所述音频内容的旋律信息进行识别,得到旋律内容;Responding to the melody identification operation, identifying the melody information of the audio content to obtain the melody content;
    基于所述文本内容和所述旋律内容,生成字幕并在显示界面上显示;Based on the text content and the melody content, generate subtitles and display them on the display interface;
    或者实现以下步骤:Or implement the following steps:
    在显示界面上播放视频,其中,所述视频中包括音频内容;Playing a video on the display interface, wherein the video includes audio content;
    接收字幕显示指令;Receive subtitle display instructions;
    响应所述字幕显示指令,在所述显示界面上显示字幕,其中,所述字幕包括:文本内容和旋律内容,所述文本内容通过对所述音频内容进行识别得到,所述旋律内容通过 对所述音频内容的旋律信息进行识别得到。Responding to the subtitle display instruction, display subtitles on the display interface, wherein the subtitles include: text content and melody content, the text content is obtained by identifying the audio content, and the melody content is obtained by identifying the It can be obtained by identifying the melody information of the audio content.
  16. 一种计算机可读存储介质,其中,当所述计算机可读存储介质中的指令由电子设备的处理器执行时,使得电子设备能够执行以下步骤:A computer-readable storage medium, wherein, when the instructions in the computer-readable storage medium are executed by a processor of the electronic device, the electronic device is enabled to perform the following steps:
    接收音频内容;receive audio content;
    响应于字幕添加操作,对所述音频内容进行识别,得到文本内容;Responding to the operation of adding subtitles, identifying the audio content to obtain text content;
    响应于旋律识别操作,对所述音频内容的旋律信息进行识别,得到旋律内容;Responding to the melody identification operation, identifying the melody information of the audio content to obtain the melody content;
    基于所述文本内容和所述旋律内容,生成字幕并在显示界面上显示;Based on the text content and the melody content, generate subtitles and display them on the display interface;
    或者执行以下步骤:Or perform the following steps:
    在显示界面上播放视频,其中,所述视频中包括音频内容;Playing a video on the display interface, wherein the video includes audio content;
    接收字幕显示指令;Receive subtitle display instructions;
    响应所述字幕显示指令,在所述显示界面上显示字幕,其中,所述字幕包括:文本内容和旋律内容,所述文本内容通过对所述音频内容进行识别得到,所述旋律内容通过对所述音频内容的旋律信息进行识别得到。Responding to the subtitle display instruction, display subtitles on the display interface, wherein the subtitles include: text content and melody content, the text content is obtained by identifying the audio content, and the melody content is obtained by identifying the It can be obtained by identifying the melody information of the audio content.
  17. 一种计算机程序产品,包括计算机程序,其中,所述计算机程序被处理器执行时实现以下步骤:A computer program product, comprising a computer program, wherein the computer program implements the following steps when executed by a processor:
    接收音频内容;receive audio content;
    响应于字幕添加操作,对所述音频内容进行识别,得到文本内容;Responding to the operation of adding subtitles, identifying the audio content to obtain text content;
    响应于旋律识别操作,对所述音频内容的旋律信息进行识别,得到旋律内容;Responding to the melody identification operation, identifying the melody information of the audio content to obtain the melody content;
    基于所述文本内容和所述旋律内容,生成字幕并在显示界面上显示;Based on the text content and the melody content, generate subtitles and display them on the display interface;
    或者实现以下步骤:Or implement the following steps:
    在显示界面上播放视频,其中,所述视频中包括音频内容;Playing a video on the display interface, wherein the video includes audio content;
    接收字幕显示指令;Receive subtitle display instructions;
    响应所述字幕显示指令,在所述显示界面上显示字幕,其中,所述字幕包括:文本内容和旋律内容,所述文本内容通过对所述音频内容进行识别得到,所述旋律内容通过对所述音频内容的旋律信息进行识别得到。Responding to the subtitle display instruction, display subtitles on the display interface, wherein the subtitles include: text content and melody content, the text content is obtained by identifying the audio content, and the melody content is obtained by identifying the It can be obtained by identifying the melody information of the audio content.
PCT/CN2022/076656 2021-07-30 2022-02-17 Subtitle display method and device WO2023005193A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110876235.7A CN113781988A (en) 2021-07-30 2021-07-30 Subtitle display method, subtitle display device, electronic equipment and computer-readable storage medium
CN202110876235.7 2021-07-30

Publications (1)

Publication Number Publication Date
WO2023005193A1 true WO2023005193A1 (en) 2023-02-02

Family

ID=78836292

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/076656 WO2023005193A1 (en) 2021-07-30 2022-02-17 Subtitle display method and device

Country Status (2)

Country Link
CN (1) CN113781988A (en)
WO (1) WO2023005193A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113781988A (en) * 2021-07-30 2021-12-10 北京达佳互联信息技术有限公司 Subtitle display method, subtitle display device, electronic equipment and computer-readable storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101719366A (en) * 2009-12-16 2010-06-02 德恩资讯股份有限公司 Method for editing and displaying musical notes and music marks and accompanying video system
CN102820027A (en) * 2012-06-21 2012-12-12 福建星网视易信息系统有限公司 Accompaniment subtitle display system and method
CN105609106A (en) * 2015-12-16 2016-05-25 魅族科技(中国)有限公司 Event recording document generation method and apparatus
US20170092274A1 (en) * 2015-09-24 2017-03-30 Otojoy LLC Captioning system and/or method
CN108289244A (en) * 2017-12-28 2018-07-17 努比亚技术有限公司 Video caption processing method, mobile terminal and computer readable storage medium
CN113781988A (en) * 2021-07-30 2021-12-10 北京达佳互联信息技术有限公司 Subtitle display method, subtitle display device, electronic equipment and computer-readable storage medium

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5334716B2 (en) * 2009-07-03 2013-11-06 日本放送協会 Character information presentation control device and program
KR101325722B1 (en) * 2012-02-16 2013-11-08 김태민 Apparatus for generating musical note fit in user's song and method for the same
WO2014088036A1 (en) * 2012-12-04 2014-06-12 独立行政法人産業技術総合研究所 Singing voice synthesizing system and singing voice synthesizing method
US20180366097A1 (en) * 2017-06-14 2018-12-20 Kent E. Lovelace Method and system for automatically generating lyrics of a song
CN107316642A (en) * 2017-06-30 2017-11-03 联想(北京)有限公司 Video file method for recording, audio file method for recording and mobile terminal
KR102523135B1 (en) * 2018-01-09 2023-04-21 삼성전자주식회사 Electronic Device and the Method for Editing Caption by the Device
CN111326164B (en) * 2020-01-21 2023-03-21 大连海事大学 Semi-supervised music theme extraction method
CN112669811B (en) * 2020-12-23 2024-02-23 腾讯音乐娱乐科技(深圳)有限公司 Song processing method and device, electronic equipment and readable storage medium
CN113112969B (en) * 2021-03-23 2024-04-05 平安科技(深圳)有限公司 Buddhism music notation method, device, equipment and medium based on neural network

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101719366A (en) * 2009-12-16 2010-06-02 德恩资讯股份有限公司 Method for editing and displaying musical notes and music marks and accompanying video system
CN102820027A (en) * 2012-06-21 2012-12-12 福建星网视易信息系统有限公司 Accompaniment subtitle display system and method
US20170092274A1 (en) * 2015-09-24 2017-03-30 Otojoy LLC Captioning system and/or method
CN105609106A (en) * 2015-12-16 2016-05-25 魅族科技(中国)有限公司 Event recording document generation method and apparatus
CN108289244A (en) * 2017-12-28 2018-07-17 努比亚技术有限公司 Video caption processing method, mobile terminal and computer readable storage medium
CN113781988A (en) * 2021-07-30 2021-12-10 北京达佳互联信息技术有限公司 Subtitle display method, subtitle display device, electronic equipment and computer-readable storage medium

Also Published As

Publication number Publication date
CN113781988A (en) 2021-12-10

Similar Documents

Publication Publication Date Title
CN107464554B (en) Method and device for generating speech synthesis model
US10776422B2 (en) Dual sound source audio data processing method and apparatus
CN109543064B (en) Lyric display processing method and device, electronic equipment and computer storage medium
EP3007163B1 (en) Asynchronous chorus method and device
CN105810211B (en) A kind of processing method and terminal of audio data
CN108831437B (en) Singing voice generation method, singing voice generation device, terminal and storage medium
US9666208B1 (en) Hybrid audio representations for editing audio content
WO2021083071A1 (en) Method, device, and medium for speech conversion, file generation, broadcasting, and voice processing
JP2019505874A (en) Song determination method and apparatus, and storage medium
CN104282322B (en) A kind of mobile terminal and its method and apparatus for identifying song climax parts
CN111292717B (en) Speech synthesis method, speech synthesis device, storage medium and electronic equipment
US11295069B2 (en) Speech to text enhanced media editing
US10971125B2 (en) Music synthesis method, system, terminal and computer-readable storage medium
CN104123938A (en) Voice control system, electronic device and voice control method
CN111798821B (en) Sound conversion method, device, readable storage medium and electronic equipment
CN110324718A (en) Audio-video generation method, device, electronic equipment and readable medium
CN108744521A (en) The method and device of game speech production, electronic equipment, storage medium
JP2020003774A (en) Method and apparatus for processing speech
WO2023005193A1 (en) Subtitle display method and device
CN112420015A (en) Audio synthesis method, device, equipment and computer readable storage medium
CN111105781B (en) Voice processing method, device, electronic equipment and medium
CN105869614B (en) Audio file deriving method and device
KR102020341B1 (en) System for realizing score and replaying sound source, and method thereof
JP2006189799A (en) Voice inputting method and device for selectable voice pattern
CN113223496A (en) Voice skill testing method, device and equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22847813

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE