CN112684967A

CN112684967A - Method for displaying subtitles and electronic equipment

Info

Publication number: CN112684967A
Application number: CN202110264894.5A
Authority: CN
Inventors: 谭泳发
Original assignee: Honor Device Co Ltd
Current assignee: Honor Device Co Ltd
Priority date: 2021-03-11
Filing date: 2021-03-11
Publication date: 2021-04-20

Abstract

The application provides a method and electronic equipment for displaying subtitles, which can simplify the operation of using subtitle functions by a user, accelerate the efficiency of opening and using the subtitle functions by the user, can be applied to scenes such as watching videos, recording videos, playing audios and recording audios, and relates to the voice recognition technology in the field of artificial intelligence. The method is applied to a first terminal and comprises the following steps: detecting whether the first terminal has voice input or voice output, and detecting whether the current voice comprises character voice under the condition that the voice input or the voice output is detected; under the condition that the current voice is detected to include the character voice, detecting a first language to which the current character voice belongs; displaying a first interface, wherein the first interface is used for prompting a user to start a subtitle function; receiving a first operation for starting a subtitle function input by the user; and displaying a subtitle after receiving the first operation, wherein the subtitle is obtained by converting the voice of the character in the first language.

Description

Method for displaying subtitles and electronic equipment

Technical Field

The present application relates to the field of terminal technologies, and in particular, to a method and an electronic device for displaying subtitles.

Background

At present, a user can play videos and listen to audios by using an intelligent terminal such as a mobile phone. Taking a scene that a user watches a video through a mobile phone as an example, when the user watches the video through the mobile phone, the user may not be able to clearly listen to or understand the content of the person speaking in the video due to environmental factors or personal reasons.

In some schemes, when a user needs to use a real-time subtitle function, the user needs to find a corresponding menu or option and find and turn on a real-time subtitle switch in the menu or option. Typically, these menus or options are set hidden and the user needs to search repeatedly to locate the position of the real-time subtitle switch. For instance, in one example, the real-time caption switch is provided in a pull-down menu. Then, when the user needs to use the real-time subtitle function, the user first needs to call up a pull-down menu through a slide-down operation. The pull-down menus may be multiple, and there are many selection items in each pull-down menu, so that a user needs to position a real-time caption switch from the multiple selection items of the pull-down menus, turn on the switch, and the time for searching, finding and turning on the real-time caption switch is long. Moreover, the user needs to repeatedly switch among a plurality of pull-down menus, which results in complicated operation.

Disclosure of Invention

The embodiment of the application provides a method for displaying subtitles and an electronic device, which can improve the human-computer interaction efficiency of the electronic device.

In order to achieve the purpose, the technical scheme is as follows:

in a first aspect, a method for subtitle display is provided. The method is applied to the first terminal or a component or device (such as a chip or a server of the first terminal, etc.) capable of facilitating the first terminal function. The method comprises the following steps: detecting whether the first terminal has voice input or voice output, detecting whether the current voice comprises character voice under the condition that the voice input or the voice output is detected, detecting a first language to which the current character voice belongs under the condition that the current voice comprises the character voice, and displaying a first interface, wherein the first interface is used for prompting a user to start a subtitle function; receiving a first operation for starting a subtitle function input by the user; and displaying a subtitle after receiving the first operation, wherein the subtitle is obtained by converting the voice of the character in the first language.

Therefore, according to the technical scheme of the embodiment of the application, when the condition that the subtitle is displayed is detected, for example, when the current audio comprises the character sound, the user can be prompted to start the subtitle function, and the subtitle can be displayed when the first operation for starting the subtitle function, which is input by the user, is detected. That is to say, the user does not need to position the real-time caption switch from a plurality of selection items of a plurality of pull-down menus, and does not need to repeatedly switch among the pull-down menus, so that the operation required by the user for opening the caption function is simplified, the time for opening the real-time caption function by the user is shortened, or the efficiency for opening and using the caption function by the user can be improved.

In one possible design, prior to the displaying the first interface, the method further includes: judging whether the first language belongs to one of preset type languages, wherein the preset type language is a language which can support voice recognition by the subtitle function of the first terminal, or the preset type language is a language which is excluded from the languages currently used by the first terminal in the languages which can support voice recognition by the subtitle function of the first terminal, or the preset type language is a language which is excluded from the native language of the first terminal user in the languages which can support voice recognition by the subtitle function of the first terminal.

In one possible design frame, the displaying a first interface includes: and displaying the first interface under the condition that the first language is judged to belong to one of the preset type languages.

In one possible design, prior to the displaying the subtitles, the method further comprises: after receiving the first operation, confirming whether the language currently used by the first terminal is the same as the first language, and if the language currently used by the first terminal is different from the first language, prompting a user whether to select a caption language as the language currently used by the first terminal;

the displaying the subtitles includes: and displaying the subtitle after receiving the operation which is input by the user and used for selecting the subtitle language to be the language currently used by the first terminal, wherein the language of the subtitle is the language currently used by the first terminal.

In one design, the displaying the first interface includes: and when the first language belongs to one of the preset type languages and the language currently used by the first terminal is different from the first language, displaying the first interface, wherein the first interface is used for prompting a user whether to start a subtitle function and whether to select a subtitle with the language being the language currently used by the first terminal (multiple languages can be provided on the first interface for the user to select, the multiple languages comprise the first language and the language currently used by the first terminal, and the user can select one or more of the multiple languages, and further, the language currently used by the first terminal can be highlighted or preferentially displayed on the first interface). The receiving of the first operation for starting the subtitle function input by the user includes receiving the first operation for starting the subtitle function input by the user and selecting a subtitle with a language currently used by the first terminal. The displaying the subtitles includes: the subtitles include (may be only include) subtitles in a language currently used by the first terminal. Wherein the preset type language can refer to the description elsewhere herein.

In one possible design, the converting the subtitles from the character sound in the first language includes:

the subtitles are subtitles in a first language;

or, the subtitles are subtitles in a second language, and the second language is different from the first language;

or, the subtitle is a multi-language subtitle, and the multi-language subtitle at least comprises the first language and the second language.

In one possible design, the second language includes a language set by the user or the first terminal.

In one possible design, the method further includes:

and displaying a second interface, wherein the second interface is used for prompting a user to set a subtitle language.

In one possible design, prior to detecting the first language in which the current character sound belongs, the method further includes:

detecting whether a first preset condition is met, wherein the first preset condition comprises any one or combination of more than one of the following items: the video type application program is started, a webpage which accords with a website format of a video website is opened, the started application program calls a video interface, the current display frame rate accords with a preset frame rate, the interaction rule between the started application program and a user meets the preset rule, the screen display state is a full screen or horizontal screen display state, the network flow is suddenly increased, or the audio type application program is started.

In one possible design, detecting a first language in which a current character sound belongs includes: and under the condition that the first preset condition is met, triggering and detecting the first language to which the current character sound belongs.

In one possible design, detecting whether the current sound includes a human voice includes: and under the condition that the first preset condition is met, triggering and detecting whether the current sound comprises the character sound.

In one possible design, detecting whether the first terminal has a voice input or a voice output includes: and under the condition that the first preset condition is met, triggering and detecting whether the first terminal has sound input or sound output.

In one possible design, the method further includes: and sending an instruction to a second terminal, wherein the instruction is used for instructing the second terminal to display the subtitles.

In one possible design, the method further includes:

displaying a third interface, wherein the third interface is used for prompting a user whether to display the subtitles on a second terminal;

the sending the instruction to the second terminal includes: and sending the instruction to the second terminal under the condition that a second operation input by the user is detected, wherein the second operation is used for indicating that the subtitles are displayed on the second terminal.

In one possible design, in a multiparty communication scenario, the subtitle display area of the second terminal is divided into one or more subtitle windows, different subtitle windows corresponding to different communication objects, the different communication objects including the user, and the method further includes at least one of:

different subtitle windows have different user interface UI effects;

the positions of different caption windows are different, and the position of each caption window is close to the head portrait or the picture of the communication object corresponding to the caption window.

In one possible design, in a multiparty communication scenario, a caption display area of the first terminal is divided into one or more caption windows, different caption windows corresponding to different communication objects, the different communication objects including the user, and the method further includes at least one of:

different subtitle windows have different user interface UI effects;

In one possible design, the second terminal is a large screen device.

In one possible design, the method further includes: if the second preset condition is met, closing the subtitle function and stopping displaying the subtitles;

the second preset condition comprises any one or combination of more of the following items: and in the first time period, no sound is detected, the operation of indicating to close the caption function input by the user is detected, the video type application program is closed, the webpage conforming to the website format of the video website is closed, the display frame rate in the second time period does not conform to the preset frame rate, and the audio type application program is closed.

Some of the above implementations may be combined with each other without conflict between the schemes.

In a second aspect, an electronic device is provided. The electronic device includes: a processor, a memory and a display screen, the memory and the display screen being coupled to the processor, the memory being configured to store computer program code, the computer program code comprising computer instructions that, when read from the memory by the processor, cause the electronic device to perform the method for subtitle display according to any one of the possible implementations of the first aspect.

In a third aspect, a computer-readable storage medium is provided, on which a computer program or instructions are stored, which, when run on a computer, cause the computer to perform the method for subtitle display according to any one of the possible implementations of the first aspect.

In a fourth aspect, there is provided a computer program product comprising: computer program or instructions for causing a computer to perform the method for subtitle display according to any one of the possible implementations of the first aspect, when the computer program or instructions is run on a computer.

In a fifth aspect, an embodiment of the present application provides a chip system, which includes at least one processor and at least one interface circuit, where the at least one interface circuit is configured to perform a transceiving function and send an instruction to the at least one processor, and when the at least one processor executes the instruction, the at least one processor performs the method for displaying subtitles as described in the first aspect and any one of the possible implementations.

Drawings

Fig. 1 is a first schematic structural diagram of an electronic device according to an embodiment of the present disclosure;

fig. 2 is a block diagram of a software structure of an electronic device according to an embodiment of the present disclosure;

fig. 3 is an interface diagram related to a subtitle display method according to an embodiment of the present application;

fig. 4 is a schematic flowchart of a subtitle display method according to an embodiment of the present application;

fig. 5 is a scene schematic diagram of character sound detection and language detection performed by a model according to an embodiment of the present application;

fig. 6 is a schematic flowchart of a subtitle display method according to an embodiment of the present application;

fig. 7-21 are schematic interface diagrams related to a subtitle display method according to an embodiment of the present application;

fig. 22 is a second schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 23 is a schematic structural diagram of a chip system according to an embodiment of the present application.

Detailed Description

The following describes a subtitle display method and an electronic device provided by an embodiment of the present application in detail with reference to the accompanying drawings.

The terms "comprising" and "having," and any variations thereof, as referred to in the description of the present application, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements but may alternatively include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be noted that in the embodiments of the present application, words such as "exemplary" or "for example" are used to indicate examples, illustrations or explanations. Any embodiment or design described herein as "exemplary" or "e.g.," is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present concepts related in a concrete fashion.

In the description of the present application, the meaning of "a plurality" means two or more unless otherwise specified. "and/or" herein is merely an association describing an associated object, and means that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone.

In the embodiment of the present application, sometimes a subscript such as W1 may be mistaken for a non-subscript form such as W1, and its intended meaning is consistent when the distinction is not emphasized.

First, for the sake of understanding, the following description is made of terms and concepts related to the embodiments of the present application.

（1）ASR

The user can use the device to perform audio and video related operations. Audio operations include, but are not limited to, playing audio, recording audio, and the like. A scene in which a user performs an audio operation using a device is referred to as an audio scene (or a voice scene). Video operations include, but are not limited to, playing a video, recording a video, live video, and the like. The scene in which the user performs a video operation using the device is referred to as a video scene. In a video scene, the played video data may include both voice data (or audio data) and image data. Thus, in this case, the video scene can also be summarized as an audio scene.

When a user needs to use the real-time caption function, the user needs to recognize the voice in the audio and video. In some aspects, the system may capture audio data within the device and employ ASR techniques to translate speech sounds into text that is output to a display screen in the form of subtitles for viewing by a user. Under the ideal state, the translated text content and the original language content are in one-to-one correspondence, and semantic rewriting is not carried out.

In some current real-time caption schemes, when a user needs to enjoy real-time captions, a real-time caption switch provided by a system can be manually turned on, and the speech content of a character is translated into text by a mobile phone and displayed on a screen in a caption mode. When the user does not watch the video, the sight of the user can be shielded by the real-time subtitles, so that the normal use of the mobile phone by the user is prevented, and the use experience of the user is reduced. Therefore, when the user does not use the real-time caption function any more, the real-time caption switch needs to be manually turned off.

That is to say, when a user starts to watch a video and finishes watching the video, the user needs to actively turn on or turn off the real-time subtitle function provided by the system in time, and the operation process is complicated.

In order to solve the above technical problem, embodiments of the present application provide a subtitle display method and an electronic device, which can be applied to scenes such as video playing, video call recording, video recording, live video broadcasting, audio playing, audio recording and the like. In a scene using the real-time caption function, the operation complexity of a user can be reduced, and the use experience of the user is improved. The video call includes, but is not limited to, a carrier network-based video call (such as a voice over long-term evolution (VOLTE) video call) or an internet-based video call (such as a video call through a clear connection application).

The subtitle display method provided by the embodiment of the application can be applied to the electronic device 100 or a system comprising the electronic device 100.

Optionally, the electronic device 100 may specifically be a mobile phone, a tablet computer, an in-vehicle device, an Augmented Reality (AR)/Virtual Reality (VR) device, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a Personal Digital Assistant (PDA), an artificial intelligence (artificial intelligence) device, a wearable device, and other terminal devices having a voice recognition function, and the wearable device may be a smart watch, a smart bracelet, a wireless headset, smart glasses, a smart headset, a blood glucose meter, a blood pressure meter, and the like. The embodiment of the present application does not set any limit to the specific type of the electronic device 100.

Fig. 1 shows a schematic structural diagram of an electronic device 100. The electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a Universal Serial Bus (USB) interface 130, a charge management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a sensor module 190, a key 190, a motor 191, an indicator 192, a camera 193, a display screen 194, a Subscriber Identification Module (SIM) card interface 195, and the like.

Processor 110 may include one or more processing units, such as: the processor 110 may include an Application Processor (AP), a modem processor, a Graphics Processing Unit (GPU), an Image Signal Processor (ISP), a controller, a memory, a video codec, a Digital Signal Processor (DSP), a baseband processor, and/or a neural-Network Processing Unit (NPU), etc. The different processing units may be separate devices or may be integrated into one or more processors.

The controller may be, among other things, a neural center and a command center of the electronic device 100. The controller can generate an operation control signal according to the instruction operation code and the timing signal to complete the control of instruction fetching and instruction execution.

A memory may also be provided in processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that have just been used or recycled by the processor 110. If the processor 110 needs to reuse the instruction or data, it can be called directly from memory. Avoiding repeated accesses reduces the latency of the processor 110, thereby increasing the efficiency of the system.

In some embodiments of the present application, the electronic device 100 may utilize the processor 110 to identify a voice stream.

The charging management module 140 is configured to receive charging input from a charger. The charger may be a wireless charger or a wired charger.

The power management module 141 is used for connecting the battery 142, the charging management module 140 and the processor 110. The power management module 141 receives input from the battery 142 and/or the charge management module 140 and provides power to the processor 110, the internal memory 121, the external memory, the display 194, the camera 193, the wireless communication module 160, and the like.

The wireless communication function of the electronic device 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, a modem processor, a baseband processor, and the like.

The mobile communication module 150 may provide a solution including 2G/3G/4G/5G wireless communication applied to the electronic device 100. The mobile communication module 150 may include at least one filter, a switch, a power amplifier, a Low Noise Amplifier (LNA), and the like.

The wireless communication module 160 may provide a solution for wireless communication applied to the electronic device 100, including Wireless Local Area Networks (WLANs) (e.g., wireless fidelity (Wi-Fi) networks), bluetooth (bluetooth, BT), Global Navigation Satellite System (GNSS), Frequency Modulation (FM), Near Field Communication (NFC), Infrared (IR), and the like.

The electronic device 100 implements display functions via the GPU, the display screen 194, and the application processor. The GPU is a microprocessor for image processing, and is connected to the display screen 194 and an application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. The processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.

The display screen 194 is used to display images, video, and the like. In some embodiments, the electronic device 100 may include 1 or N display screens 194, with N being a positive integer greater than 1. In some embodiments of the present application, the display screen 194 may be used to display subtitles translated from speech.

The electronic device 100 may implement a photographing function through the ISP, the camera 193, the video codec, the GPU, the display screen 194, the application processor, and the like.

The camera 193 is used to capture still images or video.

The external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to extend the memory capability of the electronic device 100.

The internal memory 121 may be used to store computer-executable program code, which includes instructions. The internal memory 121 may include a program storage area and a data storage area. The internal memory 121 may include a high-speed random access memory, and may further include a nonvolatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (UFS), and the like. The processor 110 executes various functional applications of the electronic device 100 and data processing by executing instructions stored in the internal memory 121 and/or instructions stored in a memory provided in the processor. In some embodiments of the present application, the internal memory 121 may be used to store some machine learning models for implementing Automatic Speech Recognition (ASR).

The audio module 170 includes a speaker, a receiver, a microphone, an earphone interface, and the like.

The audio module 170 serves to convert digital audio data into analog audio electrical signal output and also serves to convert analog audio electrical signal input into digital audio data, and the audio module 170 may include an analog/digital converter and a digital/analog converter. In some embodiments of the present application, the audio module 170 may be used to capture voice streams, audio signals, voice signals, and the like.

In some embodiments, electronic device 100 may implement audio functionality through audio module 170, as well as an application processor and the like. Such as music playing, recording, etc.

The sensor module 190 may include a pressure sensor, a gyroscope sensor, an air pressure sensor, a magnetic sensor, an acceleration sensor, a distance sensor, a proximity light sensor, a fingerprint sensor, a temperature sensor, a touch sensor, an ambient light sensor, a bone conduction sensor, and the like.

It is to be understood that the illustrated structure of the embodiment of the present application does not specifically limit the electronic device 100. In other embodiments of the present application, electronic device 100 may include more or fewer components than shown, or some components may be combined, some components may be split, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The software system of the electronic device 100 may employ a layered architecture, an event-driven architecture, a micro-core architecture, a micro-service architecture, or a cloud architecture. The embodiment of the present invention uses an Android system with a layered architecture as an example to exemplarily illustrate a software structure of the electronic device 100.

Fig. 2 is a block diagram of a software configuration of the electronic apparatus 100 according to the embodiment of the present invention.

The layered architecture divides the software into several layers, each layer having a clear role and division of labor. The layers communicate with each other through a software interface. In some embodiments, the Android system is divided into four layers, an application layer, an application framework layer, an Android runtime (Android runtime) and system library, and a kernel layer from top to bottom.

The application layer may include a series of application packages.

As shown in fig. 2, the application package may include calendar, map, WLAN, short message, gallery, navigation, first application, and other applications.

In some embodiments of the present application, the first application includes a voice-related application. A voice-related application refers to an application through which voice can be output (e.g., playing voice through the electronic device) or input to the electronic device (e.g., in a video scene, the camera application can call the microphone to collect voice information of the user through a corresponding driver). The first application may be, for example but not limited to, video, camera, music, talk.

The first application may be a pre-installed application or an application downloaded through a third party application store. The embodiment of the present application does not limit the specific implementation of the first application.

In some embodiments of the present application, audio may be output or input through some of these applications, and when the electronic device detects the audio, ASR technology may be used to convert part of the content in the audio, such as the speech of a person in the audio (abbreviated as "voice"), into subtitles, and the subtitles may be displayed on the display screen.

Taking the example that the user watches the network video, the user can watch the network video through a browser, or through a video player (such as youku, love art, and the like), or through other application programs (such as watching videos pushed in microblogs and WeChat).

The application framework layer provides an Application Programming Interface (API) and a programming framework for the application program of the application layer. The application framework layer includes a number of predefined functions.

As shown in FIG. 2, the application framework layers may include a window manager, content provider, view system, phone manager, resource manager, notification manager, and the like.

In some embodiments of the present application, the framework layer further comprises a video interface. A video interface is also referred to herein as a video service or video module or other name. And the video interface can be used for executing operations such as decoding and the like for certain video application programs through a decoder and the like of the corresponding drive calling system so as to play the decoded video.

Wherein not all video applications may call the system's decoder through the system's video interface. Some video applications can integrate decoders and call their own decoders to perform video decoding, so as to play corresponding videos.

In some embodiments of the present application, the framework layer further includes a sound interface (or sound module) for detecting sound input into the electronic device or output from the electronic device.

In some embodiments of the present application, the framework layer further includes a frame rate interface (or frame rate module, not shown in fig. 2) for querying the display frame rate.

In some embodiments of the present application, the framework layer further includes a traffic interface (or traffic module, not shown in fig. 2) for querying real-time network traffic.

Optionally, the framework layer may further include other interfaces or modules required for implementing the technical solution of the embodiment of the present application.

The video interface, the audio interface, the frame rate interface, the traffic interface, and the like may be collectively referred to as a first module. The first module can be integrated with the functions of the video interface, the sound interface, the frame rate interface and the flow interface, and can also be split into a plurality of sub-modules, and the plurality of sub-modules respectively realize the functions of the video interface, the sound interface, the frame rate interface, the flow interface and the like.

In some embodiments, the plurality of sub-modules of the first module may be located at different layers. For example, some sub-modules are located at the frame level and some at the other level. The embodiment of the application does not limit the specific implementation level and the specific implementation details of each sub-module.

For example, a glory mobile phone, a video application pre-installed in the mobile phone may call a decoder of the system through a video interface of the system, and a third-party application in the mobile phone, such as a microblog, may call the decoder to complete video decoding, so as to play a video.

The window manager is used for managing window programs. The window manager can obtain the size of the display screen, judge whether a status bar exists, lock the screen, intercept the screen and the like.

The content provider is used to store and retrieve data and make it accessible to applications. The data may include video, images, audio, calls made and received, browsing history and bookmarks, phone books, etc.

The view system includes visual controls such as controls to display text, controls to display pictures, and the like. The view system may be used to build applications. The display interface may be composed of one or more views. For example, the display interface including the short message notification icon may include a view for displaying text and a view for displaying pictures.

The phone manager is used to provide communication functions of the electronic device 100. Such as management of call status (including on, off, etc.).

The resource manager provides various resources for the application, such as localized strings, icons, pictures, layout files, video files, and the like.

The notification manager enables the application to display notification information in the status bar, can be used to convey notification-type messages, can disappear automatically after a short dwell, and does not require user interaction. Such as a notification manager used to inform download completion, message alerts, etc. The notification manager may also be a notification that appears in the form of a chart or scroll bar text at the top status bar of the system, such as a notification of a background running application, or a notification that appears on the screen in the form of a dialog window. For example, text information is presented in the status bar, a warning tone is sounded, and an indicator lamp is blinked.

The Android Runtime comprises a core library and a virtual machine. The Android runtime is responsible for scheduling and managing an Android system.

The core library comprises two parts: one part is a function which needs to be called by java language, and the other part is a core library of android.

The application layer and the application framework layer run in a virtual machine. And executing java files of the application program layer and the application program framework layer into a binary file by the virtual machine. The virtual machine is used for performing the functions of object life cycle management, stack management, thread management, safety and exception management, garbage collection and the like.

The system library may include a plurality of functional modules. For example: surface managers (surface managers), Media Libraries (Media Libraries), three-dimensional graphics processing Libraries (e.g., OpenGL ES), 2D graphics engines (e.g., SGL), and the like.

The surface manager is used to manage the display subsystem and provide fusion of 2D and 3D layers for multiple applications.

The media library supports a variety of commonly used audio, video format playback and recording, and still image files, among others. The media library may support a variety of audio-video encoding formats, such as MPEG4, h.264, MP3, AAC, AMR, JPG, PNG, and the like.

The three-dimensional graphic processing library is used for realizing three-dimensional graphic drawing, image rendering, synthesis, layer processing and the like.

The 2D graphics engine is a drawing engine for 2D drawing.

The kernel layer is a layer between hardware and software. The inner core layer at least comprises a display driver, a camera driver, an audio driver and a sensor driver.

The subtitle display method provided by the embodiment of the present application will be described below by taking an electronic device as a terminal having a structure shown in fig. 1 and 2, and specifically taking an electronic device as a mobile phone as an example. The technical scheme of the embodiment of the application can be used for various voice recognition related scenes, such as scenes of watching local or network videos, playing local or network audios, live videos, video calls and the like.

For example, taking the user watching the network video as an example, as shown in (1) of fig. 3, after the user currently watches the network video, and the mobile phone detects that the currently played network video includes a voice, an interface 309 (i.e., a first interface) is displayed to prompt the user whether to open a real-time subtitle (also called an Artificial Intelligence (AI) subtitle). Interface 309 includes player interface 301. Illustratively, as shown in (1) of fig. 3, when an operation input by the user to instruct opening of the real-time subtitle is detected, such as a click operation of an "open" button in the control 302 by the user, the mobile phone displays a subtitle display control 303 as shown in (2) of fig. 3, and the subtitle display control 303 is used for displaying the subtitle obtained by converting the human voice.

Optionally, the operation of opening the real-time subtitle by the user is not limited to the operation of clicking the control 302, and may also be an operation of sliding, separating, and the like the control 302.

In some embodiments, control 302 may be implemented as shown in FIG. 3. In other embodiments, the control 302 may be implemented in other manners, such as the control 302 may be a floating control. Unlike the control 302 shown in fig. 3, which needs to click a button at a fixed position (e.g., an on "button") to trigger the opening of the real-time subtitle, the user can open the real-time subtitle by, for example, clicking any position in the hover control.

Optionally, within a period of time after the control 302 is displayed, if the mobile phone does not detect the operation of the user on the control 302, the display of the control 302 is stopped, so as to prevent the control 302 from blocking the sight of the user. And, the real-time subtitle may not be displayed.

Optionally, in order to make the user notice the control 302 and enable the real-time subtitle function to be turned on in time, the control 302 may be displayed with a preset icon effect. For example, the control 302 is displayed in a flashing manner, the control 302 is displayed in a vivid color, the control 302 is displayed in a dynamic animation effect, and the like, and the specific presentation manner of the control 302 is not limited.

Optionally, the subtitle display control 303 may be a hover control. The user can control the position and size of the subtitle display control 303 by operation to prevent the subtitle display control 303 from obstructing the user's view.

It can be seen that, in the embodiment of the present application, the language of the voice can be detected, and the voice is converted into the corresponding subtitle and output to the display screen based on the detection result (i.e., the specific language, such as chinese or english) and the preset subtitle rule.

The preset subtitle rule includes, but is not limited to, any one or a combination of multiple rules as follows: the language of the caption is the same as the language of the sound source, the caption language is the superposition of the sound source language and other preset languages, the caption language is the superposition of the sound source language and the user preference language, and the caption language is the user preference language. In other words, the caption rules include one or more of: whether the current language is displayed, whether other languages than the current language are displayed, and the types of the other languages. In general, single or multi-language subtitles may be displayed.

The single language subtitle case is divided into the following two cases:

case 1: and if the caption rule is to display the current language, the language of the real-time caption displayed by the electronic equipment is the current language. Alternatively, the current language may refer to the language in which the electronic device detects the voice of the character, or the current language may refer to the language currently used by the electronic device, for example, the language set by the "language and input method" setting item.

Case 2: if the caption rule is to display other languages than the current language (such as chinese), and the other languages are english, the real-time caption displayed by the electronic device is an english caption.

Multi-language subtitles, such as the bilingual subtitle case: if the caption rule is to display the current language and simultaneously display other languages (such as Chinese) besides the current language, and the other languages are English, the real-time caption displayed by the electronic equipment is a Chinese-English bilingual caption.

In some embodiments, the technical solution of the embodiments of the present application can be represented by fig. 4. The following describes a specific implementation process of the technical solution of the embodiment of the present application with reference to fig. 4. As shown in fig. 4, the technical solution provided by the embodiment of the present application includes the following steps:

s101, the electronic equipment detects sound.

The electronic device is a terminal such as a mobile phone. The electronic device may be the structure of fig. 1 or fig. 2.

As a possible implementation, the electronic device detects the sound through the system interface. Optionally, the system interface is located at the framework layer, or at other layers. Or, the electronic device detects the sound in other ways, and the embodiment of the present application does not limit specific implementation of detecting the sound.

The sound detected by the electronic device may be sound input into the electronic device, such as sound input by a microphone in a video recording or live scene, and may also be sound output by the electronic device, such as sound output by a speaker in a movie video scene.

S102, the electronic equipment judges whether the sound is detected, if so, the step S103 is executed, and if not, the sound is continuously detected.

Optionally, the determining, by the electronic device, whether the sound is detected includes: the electronic device determines whether a voice input or a voice output is detected.

S103, the electronic equipment judges whether the sound is human voice. If yes, step S104 is executed, and if the voice is not a human voice, the voice detection is continued.

As a possible implementation manner, the electronic device determines whether the detected sound is a human voice according to a preset algorithm, or determines whether the detected sound includes a human voice according to the preset algorithm. For example, the electronic device stores a machine learning model for detecting a type of sound, and the electronic device detects whether the sound is human voice using the model. The model can be obtained through data sample and preset algorithm training. The data samples may be sound samples. The sound samples include positive samples and negative samples. The positive samples are samples of human voice, and the negative samples are samples of non-human voice, such as wind and rain in nature.

Illustratively, as shown in fig. 5, the electronic device inputs the detected sound into model 1, resulting in an output indicating that the detected sound is a human speech sound.

S104, the electronic equipment detects and judges whether the language of the voice is a preset type language, if so, the step S105 is executed, and if not, the voice can be continuously detected.

Specifically, the electronic device detects a first language to which the current character sound belongs, and determines whether the first language belongs to one of preset type languages. The preset type language is a language in which the subtitle function of the first terminal can support voice recognition, or the preset type language is a language excluding a language currently used by the first terminal from languages in which the subtitle function of the first terminal can support voice recognition, or the preset type language is a language excluding a native language of a first terminal user from languages in which the subtitle function of the first terminal can support voice recognition. Each preset type language is described as follows.

As a possible design, the preset type language may be a language in which the caption function of the electronic device can support voice recognition, and may be understood as a language in which the caption function of the electronic device (some operations corresponding to the caption function may be specifically implemented by the electronic device and/or the server) can support voice recognition, for example, the electronic device may support voice recognition on audio of english, chinese, russian, and japanese, which indicates that the language in which the caption function of the electronic device can support voice recognition is english, chinese, russian, and japanese.

In another design, the predetermined type language may be a language other than a language currently used by the electronic device from among languages in which the subtitle function on the electronic device may support speech recognition. The language currently used by the electronic device may be a language set by "language and input method" in the setting, such as chinese (simplified). For example, if the electronic device detects that the first language corresponding to the current character sound is chinese (chinese is one of languages in which the caption function of the electronic device can support speech recognition), and the language currently used by the electronic device is also chinese, in this case, the user can understand the character sound, so the probability that the user turns on the caption function is low, and in this case, the electronic device may not actively prompt the user to turn on the caption function, so the effectiveness of human-computer interaction may be improved, and ineffective interaction may be reduced. For another example, if the electronic device detects that the first language corresponding to the current character sound is english (english is one of languages in which the subtitle function of the electronic device can support speech recognition), and the language currently used by the electronic device is chinese, in this case, the user is likely to understand chinese but not other languages, so the user is likely not to understand english, and therefore the possibility of opening the subtitle function by the user is high, and in this case, the electronic device may actively prompt the user to open the subtitle function. Also, the language of the subtitles displayed by the electronic device may be a language currently used by the electronic device.

In another design, the predetermined type language may be a language excluding the native language of the user of the electronic device from the languages in which the caption function of the electronic device can support speech recognition, and may be understood similarly with reference to the example in the previous paragraph. The electronic equipment can determine the native language of the user of the electronic equipment by combining information such as user portrait and the like. Or the electronic equipment determines the native language of the user of the electronic equipment through the collected recording file and the like.

As a possible implementation manner, the electronic device recognizes the language of the detected voice according to a preset algorithm, and can determine whether the language of the voice is a language that the electronic device can support voice recognition. For example, a machine learning model for detecting a voice language is stored in the electronic device, and the electronic device recognizes the language of a human voice using the model. The model can be obtained through data sample and preset algorithm training. The data sample may be a plurality of language samples. The language samples may include samples in various languages, such as, but not limited to, english sound samples, chinese sound samples, french sound samples.

Illustratively, as shown in fig. 5, the electronic device inputs the detected human voice data into model 2, and obtains an output result through model 2 operation, wherein the output result indicates that the detected human voice is a language capable of supporting voice recognition.

S105, the electronic equipment displays a first interface.

The first interface is used for prompting whether the real-time subtitle function is started or not to a user.

For example, when it is detected that the language (i.e., the first language) corresponding to the voice is one of the preset type languages, the electronic device may display a first interface to the user to suggest that the user turns on the real-time subtitle function, otherwise, the electronic device may not display the first interface to the user to suggest that the user turns on the real-time subtitle function, so as to reduce the probability of displaying the subtitle incorrectly.

In addition to detecting whether the language corresponding to the voice (i.e., the first language) is one of the preset type languages, optionally, the first terminal may further detect whether the first language is the same as the language currently used by the first terminal.

Optionally, when it is detected that the first language belongs to one of the preset type languages and the language currently used by the first terminal is different from the first language, the first terminal displays the first interface, where the first interface is used to prompt a user whether to start a subtitle function and whether to select a language that is a subtitle of the language currently used by the first terminal. Illustratively, as shown in (1) of fig. 16, the mobile phone detects a video including a character sound, and if the language of the character sound is english, and the language currently used by the mobile phone is chinese, the mobile phone prompts a first interface 1005, and the first interface 1005 includes a control 1004. When an operation of the user such as clicking an "on" button is detected, the mobile phone displays subtitles in the currently used language (i.e., chinese).

In some existing real-time caption schemes, even if the language corresponding to the voice is a language in which the caption function of the mobile phone does not support voice recognition, the mobile phone can recognize the voice and display the caption according to the language in which the caption function of the mobile phone supports voice recognition, so that the displayed caption has a high error rate, and the wrong caption shields the sight of a user, so that the user is inconvenient to watch a screen by using the mobile phone. By using the subtitle display method of the embodiment of the application, the electronic equipment does not prompt a user to start a real-time subtitle function or start the real-time subtitle function for the character audio which cannot be successfully identified, so that the probability of displaying wrong subtitles is reduced, the sight of the user is not shielded, and the normal use of the user is not influenced.

Moreover, in some embodiments, because the human voice is subjected to language detection and recognition, only the human voice is a language which can be supported by the voice recognition of the electronic equipment and is converted into text to form the subtitle, and therefore the subtitle correctness is improved.

And S106, under the condition that the instruction of starting the real-time caption function is detected, the electronic equipment starts the real-time caption function and displays the real-time caption.

The real-time caption may be in a first language, and is recognized by a character voice (i.e., a language of a speaker) in the first language.

Optionally, the real-time subtitles are subtitles in a second language; the second language is the same or different from the first language. In other words, the embodiment of the present application may recognize the speaking language of the speaker as the subtitle of the language, or may convert (or translate) the speaking language of the speaker into the subtitle of another language.

Or, optionally, the real-time subtitles are multi-language subtitles; the multiple languages include at least the first language and a second language, the second language being different from the first language. Illustratively, the speaker's spoken language may be converted into, for example, a bilingual caption, which includes the speaker's spoken language.

The specific language and setting method of the real-time subtitles are described in detail in the following embodiments.

Optionally, in the case of executing S105, that is, in the case that the electronic device prompts the user whether to start the real-time subtitle function, the electronic device needs to wait for the user to instruct to start the real-time subtitle function, and in the case that an instruction (that is, a first operation) that the user instructs to start the real-time subtitle function is detected, for example, the user clicks a start button in the control 302 shown in fig. 3, the electronic device starts the real-time subtitle function, and displays the real-time subtitle. Or after detecting the video including the voice, the electronic device does not prompt the user whether to start the real-time caption function, but directly starts the real-time caption function, and displays the real-time caption according to the set caption rule.

It should be noted that, in the embodiment of the present application, the real-time subtitle related setting item set by the user may also be set by default, and the embodiment of the present application does not limit a specific setting manner.

The real-time subtitles are subtitles formed by converting human voices, and do not include subtitles formed by converting other non-human voices. And the real-time subtitle is a subtitle formed according to the subtitle rule.

Optionally, S106 may also be replaced by: and in the case of detecting an instruction of starting the real-time caption function instructed by the user, further prompting the user to select a caption language to be displayed. For example, when the language currently used by the electronic device is chinese, the voice recognized by the electronic device is english, and english belongs to one of languages that the subtitle function on the electronic device can support voice recognition but is not the language currently used by the electronic device, the electronic device may prompt the user whether to start the real-time subtitle function, and after the user confirms to start the real-time subtitle function, the electronic device may further prompt the user whether the subtitle language is to be displayed in chinese (the electronic device may further confirm whether the language currently used by the electronic device is the same as the language corresponding to the voice recognized by the electronic device under the condition that the user confirms to start the real-time subtitle function, and if the language currently used by the electronic device is different from the language recognized by the electronic device, the electronic device may further prompt the user whether to select the subtitle language currently used by the electronic device, or may also understand whether to prompt the user to translate the voice recognized subtitle into another language And operating the language currently used by the terminal to display the caption, wherein the language of the caption is the language currently used by the first terminal. Illustratively, as shown in (1) of fig. 15, the mobile phone detects a video including a character sound, and assuming that the language of the character sound is english, and the language currently used by the mobile phone is chinese, the mobile phone displays an interface 312 shown in (2) of fig. 15, where the interface 312 includes a control 1006. When an operation such as clicking the "yes" button by the user is detected, the mobile phone displays the subtitle shown in (3) of fig. 15 in the currently used language (i.e., chinese).

If the language currently used is detected to be the same as the language corresponding to the voice recognized by the electronic equipment, the electronic equipment does not need to further prompt the user whether to select the caption language as the language currently used by the electronic equipment, but can directly display the caption recognized by the voice).

In other embodiments, corresponding to a case that the first interface may be used to prompt a user whether to turn on a subtitle function and to prompt whether to select a subtitle with a language currently used by the first terminal, the receiving the user input by the first operation for turning on the subtitle function includes: and receiving a first operation which is input by the user and used for starting a subtitle function and selecting a subtitle with a language currently used by the first terminal. The displaying the subtitles includes: the subtitles include subtitles of a language currently used by the first terminal. Illustratively, as shown in (1) of fig. 16, the cellular phone detects an operation (i.e., a first operation) of the user such as clicking an "on" button, and then, as shown in (2) of fig. 16, the cellular phone displays subtitles in the currently used language (i.e., chinese).

The interfaces are all exemplary interfaces, and the embodiments of the present application do not limit the specific forms of the interfaces. For example, the "open" and "cancel" buttons in the control 1004 shown in (1) of fig. 16 may be replaced by several button combinations of "open subtitle function and display subtitle in the currently used language (such as english)", "open subtitle function and set subtitle language", and "cancel". Alternatively, there are other interface implementations.

Generally, as shown in fig. 5, the process of determining whether the audio is a human speech sound through a model and determining the language of the human speech sound through the model has a large amount of calculation and high power consumption due to the various algorithms and models involved.

And S107, the electronic equipment closes the real-time caption function under the condition that the real-time caption closing function is triggered is met.

It can be understood that when the condition for triggering the closing of the real-time caption function is satisfied, the electronic device closes the real-time caption function and stops displaying the real-time caption.

The conditions for triggering the closing of the real-time caption function may be: during a period of time (i.e., a first period of time), no sound is detected by the electronic device. The period of time may be set.

Alternatively, the condition for triggering the closing of the real-time subtitle function may be: the electronic device detects an operation of the user input indicating to turn off the real-time subtitle function. For example, as shown in (1) of fig. 3 or (2) of fig. 3, when a user's operation, such as clicking, on the option with the cross sign "X" in the control 303 is detected, the mobile phone stops displaying the real-time subtitles. The operation of instructing to close the real-time subtitle function input by the user may also be other operations, and the embodiment of the present application is not limited.

Alternatively, in a video scene, the condition triggering the closing of the real-time caption function includes, but is not limited to, one or more of the following: the user closes the video type application, the user closes the related web page of the video website, and the display frame rate in the second time period no longer conforms to the video frame rate or the animation frame rate (or, in a time period, the display frame rate does not conform to the video frame rate or the animation frame rate).

Alternatively, in an audio scene, the condition triggering the closing of the real-time subtitle function may be an audio application closed by the user.

The condition for triggering the closing of the real-time subtitle function may also be other conditions, and the embodiment of the present application is not limited.

Therefore, in the embodiment of the application, various modes for triggering the closing of the real-time caption are designed, and the mode for closing the real-time caption is more flexible.

The prior art provides a real-time Caption (Live Caption) display scheme in an unobstructed scene, which is designed mainly for disabled people (e.g., deaf-mute people), and can convert almost all audio into captions, so that a mobile phone is required to detect and recognize sound all the time, and once sound is detected, the sound is immediately converted into text and displayed to a user in a Caption mode. In the scheme, whether the type of the voice is the voice or not is continuously detected, the continuous detection time is long, and the power consumption is serious. The detection of this scheme identifies and displays the subtitles for the time shown in table 1.

TABLE 1

In order to reduce power consumption of the electronic device in the subtitle display process, optionally, the embodiment of the application may determine an audio scene or a video scene that a user may need for subtitle display, and determine whether to trigger steps such as detecting voice, detecting language, and the like in combination with a scene determination result, so as to reduce power consumption caused by one or more of unnecessary voice detection, language detection, and the like in a non-preset video scene and a non-preset audio scene.

The preset video scene includes, but is not limited to, playing a video, recording a video, live broadcasting a video, and the like. The preset audio scenes include, but are not limited to, playing a recording, listening to an audio program using an audio application (e.g., himalayan, audiobook software, etc.), and the like.

First, taking a video scene as an example, as a possible implementation manner, a preset condition may be set, and the electronic device determines that a user is about to enter or is already in the video scene when the preset condition is met.

Wherein the preset condition includes but is not limited to one or more of the following:

a1) the user-initiated application (App) is in the white list of the video class application.

The system may be provided with different types of application whitelists according to different policies. For example, the white list is set according to the application type, and the video application is classified into an application white list. Alternatively, there may be other methods for classifying the application white list, which is not limited in this embodiment of the present application.

b1) And the data in the App conform to the definition of the system video website.

The data in App can be, but is not limited to, a browser website, for example. For example, the system may be pre-configured with the website format of the video website. Com, aiqiyi.com, tengxun.com, etc., when a user opens a web site format that conforms to a predefined video web site, such as the following web site:

https:// m.youku.com/alipay _ video/id _ adbd5cc3e8e64e668546.html _ spm = a2hww.12518357.drawer6.dzj2_6, since the web address includes the predefined string youku.com, it indicates that the user opened a web page for a certain video in the video web site.

c1) The currently used App calls a video interface of the system, or the current display frame rate (also called screen frame rate) conforms to a preset frame rate.

It should be noted that, for a video application calling a system video interface, when it calls the system video interface, it means that a user is performing an operation such as playing a video through the video application.

For a video application of a non-call system video interface, such as a microblog, whether a user is about to enter or is in a video scene can be judged by judging the current display frame rate. For example, if the display frame rate is set to 30 frames per second as the preset video or animation frame rate, when the display frame rate is 30 frames per second of the stable output, it indicates that the user is using the electronic device, such as playing video.

The preset frame rate may be a specific frame rate value, or may be a frame rate value range, or an enumeration of frame rate values. The preset frame rate may be set according to actual conditions, and is not limited herein.

As a possible implementation, the display frame rate may be queried through an interface provided by the system. Optionally, the interface is located at a frame layer or other layers, which are not limited herein.

d1) And the interaction rule of the user and the App accords with a preset rule.

The predetermined rule may be, for example but not limited to, sliding up and down and staying in the current interface for a period of time. The time period may be set by a user or a system.

e1) And detecting that the screen display state is a full screen display state or a horizontal screen display state.

The method for detecting the screen display state is a full screen display state, and may include, but is not limited to, the following scenarios:

in one scenario, the electronic device is not currently in a full-screen display state, the electronic device detects an operation of switching the screen display state by a user, for example, an operation of rotating the electronic device by a certain angle changes from a vertical holding to a horizontal holding, and the electronic device can switch the screen display state to the full-screen display state in response to the operation by the user.

In one scenario, a user may be watching a video or performing other services, the electronic device is in a full-screen display state, and the electronic device may obtain that the current full-screen display state is achieved.

f1) Network traffic is increasing rapidly.

It can be understood that when the user starts the video application to watch the video, the network traffic will usually suddenly increase, and therefore, whether the user is in the video scene can be determined according to whether the network traffic suddenly increases.

As one possible implementation, the system provides an interface for monitoring network traffic. Optionally, the interface is located at the frame layer. Alternatively, the interface is located at other layers of the system. Alternatively, the electronic device may monitor network traffic in other manners. The embodiment of the present application does not limit the specific implementation manner of detecting the network traffic.

The conditions a1) -f 1) can be combined in various ways, and the electronic device does not need to detect all the conditions.

It should be noted that, in this embodiment of the application, the electronic device detects the order of the a1) -f 1), and may detect whether some of the conditions are met at the same time, or detect whether some of the conditions are met first and then detect whether other conditions are met.

Generally, the detection of one or more of the above conditions can improve the accuracy of video scene recognition and exclude some non-video scenes that do not require a real-time caption function.

For example, although there are character sounds in a game scene occasionally, since the character sounds are basically in common language expressions, users usually do not have a need to convert speech into text to form subtitles. The electronic device may then exclude the game scene by detecting one or more of the conditions described above.

For another example, the electronic device detects an advertisement in a video, does not convert the advertisement sound into a text, and does not form a corresponding subtitle.

Taking an audio scene as an example, as a possible implementation manner, a preset condition may be set, and the electronic device determines that the user is about to enter the video scene when the preset condition is met.

a2) the App started by the user is in the audio application white list.

The system may be provided with different types of application whitelists according to different policies. For example, the white list is set according to the application type, and the audio class application is classified into an application white list. Alternatively, there may be other methods for classifying the application white list, which is not limited in this embodiment of the present application. For example, the system adds applications such as himalaya, radio stations, etc. to the audio class application white list.

Optionally, in an audio scene, a part of sound may be excluded, the part of sound is not converted into text, and a subtitle corresponding to the sound is not formed.

For example, when a user listens to a song, the song itself usually carries lyrics, and in this case, a displayable real-time caption does not need to be formed for the user. Specifically, the electronic device may recognize the character sounds in the song without converting the character sounds to text, thereby not forming the corresponding real-time subtitles.

For another example, the electronic device may recognize an advertisement in the audio without converting the advertisement sound into a caption.

In conjunction with the above preset condition for detecting a video scene or an audio scene, fig. 6 shows another exemplary flow of the subtitle display method provided by the embodiment of the present application. The subtitle display method further comprises the following steps: s201, the electronic equipment detects whether a preset condition is met. If the preset condition is met, the user is indicated to have higher probability of entering a video scene such as playing video or entering an audio scene such as playing audio, or the user is already in the video or audio scene, and then the subsequent steps are triggered to be continuously executed. For example, step S102 shown in fig. 6 is triggered to be executed. If the preset condition is not met, the user does not have the requirements of playing videos, audios and the like, and the real-time subtitles do not need to be displayed for the user, then the electronic equipment can continue to detect the sound.

It should be noted that in fig. 6, S201 (whether the electronic device detects that the preset condition is satisfied) is executed after S101 and before S102 as an example. In other embodiments, the execution timing of S201 may be other, for example, whether the preset condition of the video scene or the audio scene is met may be detected while detecting the sound. Alternatively, S201 is split into multiple sub-steps, which may be performed independently. The order between the plurality of sub-steps and S101-S107 is not limited. For example, a part of the preset conditions (for example, whether to start a video application, and whether to display in a full screen currently) is determined, then sound is detected, and then another part of the preset conditions (for example, whether to call a video interface of the system at a display frame rate) is determined. The embodiment of the present application does not limit the execution sequence between step S201 and other steps, and also does not limit the number of sub-steps that S201 can be disassembled into, and the execution sequence between the sub-steps.

Optionally, the detecting of the first language to which the current character sound belongs in the above embodiment may refer to: and under the condition that the first preset condition is met, triggering and detecting the first language to which the current character sound belongs. Thus, power consumption caused by continuous detection of human voice language is reduced.

Optionally, detecting the character sound in the current audio may refer to: and under the condition that the first preset condition is met, triggering to detect the character sound in the current audio. Thus, power consumption due to continuous detection and recognition of human voice can be reduced.

Optionally, detecting audio may refer to: and under the condition that the first preset condition is detected to be met, triggering to detect audio. Thus, power consumption due to continuous detection of the recognition sound can be reduced.

Illustratively, as shown in (1) of fig. 17, the mobile phone detects an operation of opening a video application by the user. Then, as shown in (2) of fig. 17, the mobile phone detects that the current state of the mobile phone is a full-screen display state, which means that the user is likely to play a video, and then the mobile phone may trigger operations such as detecting a sound. As shown in (3) of fig. 17, when the mobile phone detects a voice, detects that the voice is a voice, and detects the language of the voice, it may prompt the user to turn on the real-time subtitle function. As shown in (4) of fig. 17, when the user agrees to turn on the real-time subtitle function, the mobile phone displays the real-time subtitle.

Therefore, in the embodiment of the present application, it may be determined whether the user is about to enter the video or audio scene or is already in the video or audio scene, and only when it is determined that the user is about to enter the video or audio scene or is already in the video or audio scene, the operation with high power consumption, such as detecting the language, is triggered. That is, steps such as language detection need not be triggered all the time, only if a trigger condition is satisfied. The specific implementation of detecting, identifying and displaying subtitles in the embodiment of the present application can be seen from table 1 in comparison with the prior art. It is thus clear that compare in prior art, the technical scheme of this application embodiment, it is short to detect the discernment time to avoid or reduce unnecessary consumption, promote electronic equipment duration.

Exemplarily, as shown in table 1, in the embodiments of the present application, detection of the identification sound may be triggered only when a video scene is detected, and compared with the technical solutions shown in table 1, the technical solutions of the embodiments of the present application have shorter detection time and lower detection power consumption. In addition, in the embodiment of the application, the subtitles are displayed only when the character sound is in a language supported by the voice recognition of the electronic equipment, so that the probability of subtitle display errors is reduced.

In the embodiment of the application, the related setting items of the real-time subtitles can be set. The real-time subtitle related setting items include, but are not limited to: sound source, sound source language, caption rule. The caption rule is the language of the caption that needs to be displayed.

Due to space limitations, the real-time caption related setting items are not exhaustive.

Alternatively, the real-time subtitle related setting item may be set by a user or by default.

Illustratively, as shown in (2) of fig. 3, a control 304 is displayed on the mobile phone, and the control 304 is used for setting the real-time caption function. The mobile phone detects an operation such as clicking on the control 304 by the user, and displays the setting interface 305 shown in (1) of fig. 7. The user can set one or more of sound source, sound source language, and subtitle rule through the setting interface 305.

Taking the sound source set by the user as an example, as shown in (1) of fig. 7, when an operation of the user such as clicking on the control 308 is detected, the mobile phone displays a setting interface shown in (2) of fig. 7, through which the user can select the sound source. The sound source is a sound source for converting into a real-time subtitle.

Alternatively, the sound source may be media sound played by the mobile phone, or sound input to the mobile phone (such as sound input to the mobile phone when recording video), or other sound such as microphone sound. In (2) of fig. 7, the sound source is taken as the media sound, wherein,

indicating that the media sound is selected, or that the media sound switch is on.

Fig. 7 illustrates an example in which the setting options such as media sound and microphone are not in the same interface as the real-time subtitle setting interface, and the setting interface shown in fig. 7 is an exemplary interface. In other embodiments, the setting options of media sound, microphone, and the like shown in (2) of fig. 7 may also be set in the real-time subtitle setting interface shown in (1) of fig. 7.

Taking the user setting the sound source language as an example, as shown in (1) of fig. 8, when the user is detected to operate, such as clicking on the control 306, the mobile phone displays a setting interface shown in (2) of fig. 8, through which the user can select the sound source language. Alternatively, as shown in (2) of fig. 8, the sound source language may be a language set by the user as chinese, english, or the like, or may be a current language detected by the system (i.e., a language of a human voice in audio currently detected by the system).

Optionally, the mobile phone may detect the language of the voice, convert the voice into the real-time subtitles based on the language and the subtitle rules, and display the real-time subtitles to the user. The caption rule may be displaying a caption in a first language, or displaying a caption in a second language (different from the first language), or displaying a caption including the first language and the second language. The second language includes a language set by the user or the first terminal. For example, the second language may be a language set by the user, such as chinese and english, or the second language may be a user-preferred language automatically detected by the first terminal, or a system default language (i.e., a language currently used by the electronic device), or the like. The timing at which the user or the first terminal sets the second language is not limited.

Taking the setting of the subtitle rule by the user as an example, as shown in (1) of fig. 9, when detecting the operation of the user such as clicking on the control 307, the mobile phone displays a setting interface 301 (i.e., a second interface) shown in (2) of fig. 9, through which the user can set the subtitle rule.

In fig. 9 (2), the "type of other language" menu is a submenu of the "display other language than the current language" switch. As shown in (2) of fig. 9, the "type of other language" menu may be located in the same interface as the "display other language than the current language" switch. In other embodiments, the "type of other language" menu may be provided in a different interface than the "display other language than the current language" switch.

The "type of other language" menu may include one or more options. For example, languages such as chinese and english may be included. Options such as "user preferred language", system detection, etc. may also be included.

Taking the caption rule as shown in (1) of fig. 10, that is, the language of the real-time caption is the same as the language of the voice (that is, the language of the caption is the current language of the voice detected by the system), as shown in (2) of fig. 10 and (3) of fig. 10, when the voice in the currently played video is detected to be chinese, the mobile phone converts the voice into a chinese caption 701 and displays the chinese caption 701 on the display screen, so that the user can view the chinese real-time caption. Thus, the original-taste subtitles in the original language of the sound can be provided to the user as much as possible.

In other embodiments, the user may also set a real-time caption that is different from the source language of the sound. In other words, a voice in a certain language can be translated into real-time subtitles in a different language. Taking the subtitle rule as shown in (1) of fig. 11, that is, the real-time subtitle is set to be english, for example, as shown in (2) of fig. 11 and (3) of fig. 11, when the mobile phone detects that the voice in the currently played video is chinese, the voice is converted into an english subtitle 801 according to the setting, and the english subtitle is displayed on the display screen, so that the user can view the english real-time subtitle. Therefore, the voice in the video can be translated into the language which is easy to understand or needed by the user, and the watching experience of the user is improved.

Optionally, the video mentioned in the embodiment of the present application may have embedded subtitles, or the video may not have subtitles.

In other embodiments, the handset may also determine a user preferred language in conjunction with the user representation. The cell phone may gather user-related information to determine a user representation and determine a user-preferred language based on the user representation. The user preferred language may be a language in which the user has been using more frequently in the last period of time. For example, the user is an english learner, then the user may have english as the language used for a long period of time, such as watching a video with english sound sources or a video with english subtitles often for a period of time. Then english is the preferred language for that user.

Alternatively, the user-preferred language may be a language that is not used by the user but is used by the user with a high frequency in the representation. For example, although the user does not play the english video frequently, the user often searches for the english related material and often uses an english related application (e.g., a channel dictionary, a jinshan dictionary, etc.).

Illustratively, the user may set the language of the real-time subtitles to a preferred language. Then, when the mobile phone detects that the voice in the video is english, the voice is converted into a subtitle in a preferred language (japanese is assumed). Therefore, the user can watch the real-time subtitles of the preferred language, the watching requirements of the user are met, and the watching experience of the user is improved.

In other embodiments, the mobile phone may further determine different preferred languages according to different scenes, that is, the preferred languages are not fixed, and the preferred languages of the user may be updated, so as to convert the voice into real-time subtitles in the corresponding preferred languages, thereby meeting the viewing requirements of the user. In other words, the language type of the real-time subtitles may be adaptively adjusted according to user preferences.

In other embodiments, the mobile phone may prompt the user whether to display the real-time subtitles in a certain language before displaying the real-time subtitles in the language, and if the user selects to display the real-time subtitles in the language, the mobile phone displays the real-time subtitles in the language, otherwise, the mobile phone displays the subtitles in the default language. That is, before the mobile phone displays the real-time subtitles, the user may select the language of the real-time subtitles that the user desires to display currently, or the mobile phone may recommend the language of the real-time subtitles to the user.

For example, as shown in fig. 13 (1) to 13 (2), after the mobile phone detects a video including a human voice, it may display a control 1001 for querying the user about the language of the real-time subtitles. When an operation of the user, such as clicking on the control 1001, is detected, the mobile phone displays the real-time subtitle 901 in the recommended language, i.e., english. When a user's operation such as double-clicking the control 1001 is detected, the handset displays real-time subtitles in the default language. Wherein the default language may be user set or system set. Therefore, under different operation conditions of a user, the mobile phone can display real-time subtitles of different languages, the language of the real-time subtitles is not limited to a preset certain fixed language, and the mode of displaying the real-time subtitles is more flexible.

For another example, as shown in (1) of fig. 14, after the mobile phone detects a video including a human voice, a control 1003 may be displayed for prompting the user to select a language of the real-time subtitles. When an operation of the user such as clicking the control 1003 is detected, the mobile phone displays the interface 311 (i.e. the second interface) shown in (2) of fig. 14, the interface 311 includes a control 1002, and when an operation of the user selecting a subtitle language (such as selecting english) through the operation such as clicking the control 1002 is detected, the mobile phone displays an english real-time subtitle shown in (3) of fig. 14.

In the embodiment of the present application, interfaces that can be used to set a subtitle language are collectively referred to as a second interface. The method comprises various levels of interfaces for setting caption languages. For example, still taking fig. 14 as an example, the control 1002 may be further configured as a hidden language option, and when the user clicks the control 1002, a specific chinese, english, or more language option is popped up. The popped-up language option may be referred to as the next level interface of control 1002.

Optionally, as shown in fig. 14, after the user sets the subtitle language (for example, english), the mobile phone may remember the selection of the user this time, and may not prompt the user to set the subtitle language next time, and directly use the selected language this time for the subtitle language next time.

Optionally, the mobile phone may further prompt the user whether to remember the subtitle language set this time, and when the user selects to remember the subtitle language set this time, the mobile phone may use the subtitle language set locally for the next subtitle display.

Fig. 13 and 14, fig. 15, fig. 16, etc. are merely exemplary interfaces, and besides the way of displaying the control 1001, the control 1002, and the control 1003 prompting the user to select which language to display the real-time subtitles, the mobile phone may also prompt the user to select which language to display the real-time subtitles, or say that the language of the selectable real-time subtitles is recommended to the user through other interfaces. For example, the user may be prompted to select the language of the real-time subtitles to be displayed in a drop-down menu or drop-down list.

In other embodiments, a "bilingual" like real-time caption may also be provided. Meaning that the handset displays two or more real-time subtitles, optionally, in one design, one subtitle displayed by the handset is in the audio source language and one subtitle is in the language selected by the user (or system default or system configuration). Illustratively, referring to (1) of fig. 12, the user sets the language of the real-time subtitles to be an overlay of the source language of the sound and the selected language. Then, when the mobile phone recognizes that the voice in the video is chinese as shown in (2) of fig. 12, the chinese subtitle shown in (3) of fig. 12 and the english subtitle preset by the user may be displayed. Therefore, the user can watch the real-time subtitles of multiple languages, and the user can more intuitively understand the meaning of the played video through the subtitles.

In still other embodiments, the "bilingual" real-time subtitles may further include subtitles in the source language of the sound and subtitles in the user-preferred language.

The subtitle display method provided by the embodiment of the application is introduced mainly by taking the video watched by the user as an example, and the technical scheme of the embodiment of the application can also be used for scenes such as live broadcast, video recording and the like. Taking a video recording scene as an example, a user starts a camera application and starts a video recording function. As shown in fig. 18 (1), the user can turn on the video recording and rotate the mobile phone to switch the screen to the landscape display shown in fig. 18 (2). As shown in (2) of fig. 18, when the human voice is detected, the language of the human voice is a recognizable language (for example, chinese), and the current frame rate is detected to be stable at the preset video frame rate, the mobile phone prompts the user to turn on the real-time caption function. In response to the user agreeing to the operation of turning on the real-time subtitle function, the mobile phone displays the chinese subtitle as shown in (3) of fig. 18.

In other embodiments, the embodiments of the present application may also be applied to video and audio scenes on a large-screen device, such as a screen-shot scene in a cross-device scene. Such as by projecting the subtitles that the first terminal is about to display or is displaying onto a second terminal that is large-screen for display. Due to the fact that the screen size of the large-screen device is relatively large, the situation that a user watches clearer video picture details can be improved.

Illustratively, as shown in (1) of fig. 19, the mobile phone detects a media file including a character sound and detects that there is a currently existing networked large-screen device (such as, but not limited to, a smart screen), and the mobile phone displays an interface 1902 for prompting the user whether to display real-time subtitles on the smart screen. Interface 1902 includes a control 1901. When detecting that the user agrees to an operation (i.e., a second operation) of displaying the real-time subtitles on the smart screen, for example, detecting that the user clicks the control 1901, the mobile phone sends an instruction to the smart screen (i.e., a second terminal) for instructing the smart screen to display the real-time subtitles. In this embodiment, the mobile phone may call resources such as related hardware of the smart screen, for example, call the display screen of the smart screen to display the real-time subtitles shown in (2) of fig. 19.

Control 1901 is merely an example. The control 1901 may also be a control such as "detect a networked smart screen, whether to display real-time subtitles on the smart screen", or "detect a networked smart screen, click open to display real-time subtitles on the smart screen, and set the language of the real-time subtitles", etc. The embodiments of the present application do not limit this.

In other embodiments, the subtitle language after screen projection can be selected in the setting interface. For example, a subtitle language option for screen projection exists in a setting interface of the system, and the subtitle language option is used for setting the subtitle language on a large-screen device. Or, the user can be recommended or prompted with the subtitle language for screen projection after the user is prompted to screen-project the displayed subtitles. The setting mode of the subtitle language for screen projection is not limited in the embodiment of the application.

Alternatively, the position, size, etc. of the displayed subtitles may be changed. For example, still referring to fig. 19, the user may move the subtitle position in a direction such as the arrow shown by the dotted line through the space gesture operation.

Alternatively, subtitles for different objects may also be distinguished, such as in a multiparty communication scenario. For example, as shown in fig. 20 (1) and 20 (2), a plurality of subtitle windows are displayed on a large-screen device, and the positions of the subtitle windows are not limited. Different caption windows may correspond to different communication objects including a user of the first terminal.

Optionally, different subtitle windows have different User Interface (UI) effects. For example, a large screen device displays a caption window of a speaking party on a screen, and the caption window of the party without sound or speaking is faded or not displayed or hidden. As another example, the different subtitle windows may be different colors. The caption window of the speaker is more colorful so as to attract the attention of the user. For another example, the outer border of the caption window of the speaker has a flickering dynamic effect.

Optionally, the positions of different caption windows are different, and the position of each caption window is close to the avatar or the picture of the communication object corresponding to the caption window. Illustratively, referring to fig. 21 (1) and 21 (2), Bob's caption window is close to Bob's avatar or picture. The caption window of Andy is close to the avatar or the picture of Andy.

Optionally, the first terminal (or the small-screen device) may also have a UI effect or a subtitle window position similar to the second terminal. That is, in a multiparty communication scenario, the subtitle display area of the first terminal is divided into one or more subtitle windows, different subtitle windows corresponding to different communication objects, and the method further includes at least one of:

different subtitle windows have different UI effects; the positions of different caption windows are different, and the position of each caption window is close to the head portrait or the picture of the communication object corresponding to the caption window.

Optionally, the cross-device scene may also be a vehicle-mounted scene, and the mobile phone may display the picture of the mobile phone on a display screen of the vehicle-mounted device. Alternatively, the cross-device scenario may be other as well.

In order to implement the above-described cross-device scenario, a registration procedure may be completed in advance between the mobile phone and other devices. For example, taking a car machine and a mobile phone to realize cross-device display of real-time subtitles as an example, related hardware in the car machine needs to be registered in the mobile phone in advance, and the mobile phone forms a virtual drive of the related hardware. Subsequently, the mobile phone can call related hardware through the virtual drive to complete corresponding functions. For example, the mobile phone registers display screen information of the car machine and forms a virtual drive corresponding to the display screen, so that the mobile phone can call the display screen of the car machine through the virtual drive to realize display functions such as real-time subtitle display.

It will be appreciated that in order to implement the above-described functions, the electronic device comprises corresponding hardware and/or software modules for performing the respective functions. The present application is capable of being implemented in hardware or a combination of hardware and computer software in conjunction with the exemplary algorithm steps described in connection with the embodiments disclosed herein. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, with the embodiment described in connection with the particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In this embodiment, the electronic device may be divided into functional modules according to the above method example, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated module may be implemented in the form of hardware. It should be noted that the division of the modules in this embodiment is schematic, and is only a logic function division, and there may be another division manner in actual implementation.

As shown in fig. 22, an embodiment of the present application discloses a schematic structural diagram of an electronic device. The electronic device 1800 is operable to implement the methods described in the method embodiments above. Illustratively, the electronic device 1800 may specifically include: a processing unit 1801, a display unit 1802;

the processing unit 1801 is configured to support the electronic device 1800 to perform steps S101-S104 and S107 in fig. 4. And/or the processing unit 1801 is configured to support the electronic device 1800 for performing step S201 in fig. 6. And/or the processing unit 1801 is further configured to support the electronic device 1800 for performing other steps performed by the electronic device in this embodiment of the application. The display unit 1802 is used to support the electronic apparatus 1800 in performing the subtitle display function in steps S105, S106 in fig. 4. And/or the display unit 1802 may also be used to support the electronic device 1800 for performing other steps performed by the electronic device in embodiments of the present application.

Optionally, the electronic device 1800 shown in fig. 22 may further include a communication unit 1803, where the communication unit 1803 is configured to support the electronic device 1800 to perform a step of communication between the electronic device and another electronic device in this embodiment, for example, support the first terminal to send an instruction to the second terminal.

Optionally, the electronic device 1800 shown in fig. 22 may further include a storage unit (not shown in fig. 22) that stores programs or instructions. The program or instructions, when executed by the processing unit 1801, enable the electronic device 1800, shown in fig. 22, to perform the method for subtitle display, shown in fig. 4 and 6.

Technical effects of the electronic device 1800 shown in fig. 22 can refer to technical effects of the subtitle display method shown in fig. 4 and 6, which are not described herein again.

The processing unit 1801 involved in the electronic device 1800 shown in fig. 22 may be implemented by a processor or a processor-related circuit component, and may be a processor or a processing module. The communication unit 1803 may be implemented by a transceiver or transceiver-related circuit component, and may be a transceiver or transceiver module. The display unit 1802 may be implemented by display screen related components, which may include a display screen.

The embodiment of the present application further provides a chip system, as shown in fig. 23, which includes at least one processor 1601 and at least one interface circuit 1602. The processor 1601 and the interface circuit 1602 may be interconnected by a line. For example, the interface circuit 1602 may be used to receive signals from other devices. Also for example, the interface circuit 1602 may be used to send signals to other devices, such as the processor 1601. Illustratively, the interface circuit 1602 may read instructions stored in memory and send the instructions to the processor 1601. The instructions, when executed by the processor 1601, may cause the electronic device to perform the various steps performed by the electronic device 100 (e.g., the first terminal) in the embodiments described above. Of course, the chip system may further include other discrete devices, which is not specifically limited in this embodiment of the present application.

Optionally, the system on a chip may have one or more processors. The processor may be implemented by hardware or by software. When implemented in hardware, the processor may be a logic circuit, an integrated circuit, or the like. When implemented in software, the processor may be a general-purpose processor implemented by reading software code stored in a memory.

Optionally, the memory in the system-on-chip may also be one or more. The memory may be integrated with the processor or may be separate from the processor, which is not limited in this application. For example, the memory may be a non-transitory processor, such as a read only memory ROM, which may be integrated with the processor on the same chip or separately disposed on different chips, and the type of the memory and the arrangement of the memory and the processor are not particularly limited in this application.

The system-on-chip may be, for example, a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), a system on chip (SoC), a Central Processing Unit (CPU), a Network Processor (NP), a digital signal processing circuit (DSP), a Microcontroller (MCU), a Programmable Logic Device (PLD), or other integrated chips.

It will be appreciated that the steps of the above described method embodiments may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in a processor.

The embodiment of the present application provides a computer-readable storage medium, on which a computer program or an instruction is stored, and when the computer program or the instruction runs on a computer, the computer is caused to execute the subtitle display method according to the above method embodiment.

An embodiment of the present application provides a computer program product, including: computer program or instructions which, when run on a computer, cause the computer to perform the subtitle display method as described in the above method embodiments.

In addition, embodiments of the present application also provide an apparatus, which may be specifically a component or a module, and may include a processor and a memory connected to each other; the memory is used for storing computer execution instructions, and when the device runs, the processor can execute the computer execution instructions stored in the memory, so that the device can execute the subtitle display method in the above-mentioned method embodiments.

In addition, the terminal device, the computer-readable storage medium, the computer program product, or the chip provided in the embodiments of the present application are all configured to execute the corresponding method provided above, so that the beneficial effects achieved by the terminal device, the computer-readable storage medium, the computer program product, or the chip may refer to the beneficial effects in the corresponding method provided above, and are not described herein again.

Through the above description of the embodiments, it is clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the above described functions. For the specific working processes of the system, the apparatus and the unit described above, reference may be made to the corresponding processes in the foregoing method embodiments, and details are not described here again.

In the embodiments provided in the present application, it should be understood that the disclosed method can be implemented in other ways. The embodiments may be combined with each other or referenced to each other without conflict. The above-described terminal device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical function division, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of modules or units through some interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) or a processor to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: flash memory, removable hard drive, read only memory, random access memory, magnetic or optical disk, and the like.

The above description is only an embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions within the technical scope of the present disclosure should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for displaying subtitles, applied to a first terminal, comprises the following steps:

detecting whether the first terminal has voice input or voice output, and detecting whether the current voice comprises character voice under the condition that the voice input or the voice output is detected;

under the condition that the current voice is detected to include the character voice, detecting a first language to which the current character voice belongs;

displaying a first interface, wherein the first interface is used for prompting a user to start a subtitle function;

receiving a first operation for starting a subtitle function input by the user;

and displaying a subtitle after receiving the first operation, wherein the subtitle is obtained by converting the voice of the character in the first language.

2. The method of claim 1, wherein prior to said displaying the first interface, the method further comprises: judging whether the first language belongs to one of preset type languages, wherein the preset type language is a language which can support voice recognition by the subtitle function of the first terminal, or the preset type language is a language which is excluded from the languages currently used by the first terminal in the languages which can support voice recognition by the subtitle function of the first terminal, or the preset type language is a language which is excluded from the native language of the first terminal user in the languages which can support voice recognition by the subtitle function of the first terminal.

3. The method of claim 2, wherein displaying the first interface comprises: displaying the first interface under the condition that the first language belongs to one of the preset type languages and the language currently used by the first terminal is different from the first language, wherein the first interface is used for prompting a user whether to start a subtitle function and whether to select a language as a subtitle of the language currently used by the first terminal;

the receiving of the first operation for starting a subtitle function input by the user comprises: receiving a first operation which is input by the user and used for starting a subtitle function and selecting a subtitle with a language currently used by the first terminal;

the displaying the subtitles includes: the subtitles include subtitles of a language currently used by the first terminal.

4. The method of claim 2,

the displaying a first interface includes: and displaying the first interface under the condition that the first language is judged to belong to one of the preset type languages.

5. The method of claim 4, wherein prior to said displaying subtitles, the method further comprises: after receiving the first operation, confirming whether the language currently used by the first terminal is the same as the first language, and if the language currently used by the first terminal is different from the first language, prompting a user whether to select a caption language as the language currently used by the first terminal;

the displaying the subtitles includes: and displaying the caption when receiving the operation which is input by the user and used for selecting the caption language to be the language currently used by the first terminal, wherein the language of the caption is the language currently used by the first terminal.

6. The method of claim 5, wherein converting the subtitles from the character sound in the first language comprises:

the subtitles are subtitles in a first language;

7. The method of claim 6, wherein the second language comprises a language set by the user or the first terminal.

8. The method of claim 7, further comprising:

9. The method of claim 1, wherein prior to detecting the first language in which the current character sound belongs, the method further comprises:

10. The method of claim 9, wherein detecting the first language in which the current character sound belongs comprises: and under the condition that the first preset condition is met, triggering and detecting the first language to which the current character sound belongs.

11. The method of claim 9, wherein detecting whether the current sound includes a human voice comprises: and under the condition that the first preset condition is met, triggering and detecting whether the current sound comprises the character sound.

12. The method of claim 9, wherein detecting whether the first terminal has voice input or voice output comprises: and under the condition that the first preset condition is met, triggering and detecting whether the first terminal has sound input or sound output.

13. The method of claim 1, further comprising: and sending an instruction to a second terminal, wherein the instruction is used for instructing the second terminal to display the subtitles.

14. The method of claim 13, further comprising:

15. The method of claim 13, wherein in a multi-party communication scenario, the caption display area of the second terminal is divided into one or more caption windows, wherein different caption windows correspond to different communication objects, wherein the different communication objects comprise the user, and wherein the method further comprises at least one of:

different subtitle windows have different user interface UI effects;

16. The method of claim 1, wherein in a multiparty communication scenario, the caption display area of the first terminal is divided into one or more caption windows, different caption windows corresponding to different communication objects, the different communication objects including the user, the method further comprising at least one of:

different subtitle windows have different UI effects;

17. The method of claim 1, further comprising: if the second preset condition is met, closing the subtitle function and stopping displaying the subtitles;

18. An electronic device, comprising: a processor, a memory and a display screen, the memory and the display screen being coupled to the processor, the memory for storing computer program code, the computer program code comprising computer instructions that, when read from the memory by the processor, cause the electronic device to perform the method for subtitle display according to any of claims 1-17.

19. A computer-readable storage medium, having stored thereon a computer program or instructions, which, when run on a computer, cause the computer to perform the method for subtitle display according to any one of claims 1-17.

20. A computer program product, the computer program product comprising: computer program or instructions for causing a computer to carry out the method for subtitle display according to any one of claims 1-17 when the computer program or instructions is run on the computer.