WO2021057957A1 - Video call method and apparatus, computer device and storage medium - Google Patents

Video call method and apparatus, computer device and storage medium Download PDF

Info

Publication number
WO2021057957A1
WO2021057957A1 PCT/CN2020/118049 CN2020118049W WO2021057957A1 WO 2021057957 A1 WO2021057957 A1 WO 2021057957A1 CN 2020118049 W CN2020118049 W CN 2020118049W WO 2021057957 A1 WO2021057957 A1 WO 2021057957A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
text
language
voice
video frame
Prior art date
Application number
PCT/CN2020/118049
Other languages
French (fr)
Chinese (zh)
Inventor
严伟波
Original Assignee
深圳市万普拉斯科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市万普拉斯科技有限公司 filed Critical 深圳市万普拉斯科技有限公司
Publication of WO2021057957A1 publication Critical patent/WO2021057957A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/141Systems for two-way working between two video terminals, e.g. videophone
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/04Real-time or near real-time messaging, e.g. instant messaging [IM]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/07User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail characterised by the inclusion of specific contents
    • H04L51/10Multimedia information
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/431Generation of visual interfaces for content selection or interaction; Content or additional data rendering
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/431Generation of visual interfaces for content selection or interaction; Content or additional data rendering
    • H04N21/4312Generation of visual interfaces for content selection or interaction; Content or additional data rendering involving specific graphical features, e.g. screen layout, special fonts or colors, blinking icons, highlights or animations
    • H04N21/4316Generation of visual interfaces for content selection or interaction; Content or additional data rendering involving specific graphical features, e.g. screen layout, special fonts or colors, blinking icons, highlights or animations for displaying supplemental content in a region of the screen, e.g. an advertisement in a separate window
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/485End-user interface for client configuration
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/485End-user interface for client configuration
    • H04N21/4856End-user interface for client configuration for language selection, e.g. for the menu or subtitles
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/485End-user interface for client configuration
    • H04N21/4858End-user interface for client configuration for modifying screen layout parameters, e.g. fonts, size of the windows
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/141Systems for two-way working between two video terminals, e.g. videophone
    • H04N7/147Communication arrangements, e.g. identifying the communication as a video-communication, intermediate storage of the signals

Definitions

  • This application relates to a video call method, device, computer equipment and storage medium.
  • members of the call can only break away from the instant messaging client during the video call, and use a third-party translation device to translate the voice data from other members; the translation results from the third-party translation device are waiting to be heard Then, make a voice reply based on the translation result.
  • This application provides a video call method.
  • the method includes: collecting the first voice and source video frame generated by the target member in the video call; and converting the first voice according to the preset target languages pointed to by the other members participating in the video call. , Get the first text; synthesize the source video frame with the first text corresponding to each target language to obtain the target video frame corresponding to each target language; and send the obtained target video frame of each target language to the corresponding Opposing member.
  • This application also provides a video call method, including: obtaining the first voice and source video frame generated by the target member in the video call; and converting the first voice according to the preset target languages pointed to by other members participating in the video call. , Get the first text; synthesize the source video frame with the first text corresponding to each target language to obtain the target video frame corresponding to each target language; and send the obtained target video frame of each target language to other member
  • This application also provides a video call device.
  • the device includes: a first text generation module for collecting the first voice and source video frame generated by the target member in the video call; The target language converts the first voice to obtain the first text; the target video frame synthesis module is used to synthesize the source video frame with the first text corresponding to each target language to obtain the target video corresponding to each target language Frames; and a page display module for sending the obtained target video frames in each target language to the corresponding counterparty member.
  • This application also provides a computer device, including a memory, a processor, and a computer program stored on the memory and running on the processor.
  • the processor implements the following steps when the processor executes the computer program: collecting the first video generated by the target member during a video call A voice and source video frame; the first voice is converted according to the preset target languages pointed to by the other members participating in the video call to obtain the first text; the source video frame is respectively converted to the first text corresponding to each target language Synthesize, obtain the target video frame corresponding to each target language; and send the obtained target video frame of each target language to the corresponding counterparty member.
  • This application also provides a computer-readable storage medium on which a computer program is stored.
  • the computer program is executed by the processor, the following steps are implemented: collecting the first voice and source video frame generated by the target member in the video call; The target language pointed to by the other members participating in the video call converts the first voice to obtain the first text; the source video frame is synthesized with the first text corresponding to each target language to obtain the target corresponding to each target language Video frame; and send the obtained target video frame of each target language to the corresponding counterparty member.
  • Fig. 1 is an application scenario diagram of a video call method in an embodiment
  • FIG. 2 is a schematic flowchart of a video call method in an embodiment
  • Figure 3 is a schematic diagram of a language configuration page in an embodiment
  • Figure 4 is a schematic diagram of a target video frame in an embodiment
  • Figure 5 is a schematic diagram of a pop-up window displaying a second text in an embodiment
  • FIG. 6 is a schematic diagram of displaying the second text in the form of a prompt message in an embodiment
  • Figure 7 is a schematic diagram of a video frame display area in an embodiment
  • FIG. 8 is a structural block diagram of a video call device in an embodiment
  • FIG. 9 is a structural block diagram of a video call device in another embodiment.
  • Fig. 10 is a diagram of the internal structure of a computer device in an embodiment.
  • the method described above not only needs to rely on a third-party translation device, and the communication cost is high; it also needs to constantly switch between the terminal and the third-party translation device, and the operation is complicated.
  • due to the need to wait for the translation result returned by the third-party translation device it causes multiple pauses during the video call, prolongs the duration of the entire video call, and causes a waste of video call link resources.
  • Fig. 1 is an application environment diagram of a video call method in an embodiment.
  • the video call method is applied to a video call system.
  • the video call system includes a first terminal 102, a server 104, and a second terminal 106.
  • the first terminal 102 communicates with the server 104 through the network
  • the second terminal 106 communicates with the server 104 through the network.
  • the first terminal 102 and the second terminal 106 may be mobile phones, tablet computers, portable wearable devices, or the like.
  • the first terminal 102 is a terminal corresponding to the target member in the video call system
  • the second terminal 106 is a terminal corresponding to the counterpart member in the video call system.
  • the first terminal 102 and the second terminal 106 respectively run instant messaging applications.
  • the first terminal 102 can establish a video call link with the second terminal 106.
  • Video calls can be divided into two-person video calls and multi-person video calls according to the number of participating member IDs.
  • Multi-person video calls can be group calls.
  • the member ID is used to uniquely identify the call member, which can be numbers, letters, or symbols.
  • the second terminal 106 may be implemented by a single terminal, and when it is a multi-person video call, the second terminal 106 may be implemented by multiple terminals.
  • the instant messaging application in the first terminal 102 can integrate a subtitle synthesis plug-in, which is used to convert and translate the collected first voice into multiple language versions of the first text, and use different versions of the first text as the subtitle content and target
  • the source video frames generated by the members in the video call are synthesized to obtain the target video frame, and the target video frame is forwarded to the second terminal 106 corresponding to the other member through the server 104.
  • the server 104 may be implemented as an independent server or a server cluster composed of multiple servers.
  • first, second, etc. used in this application can be used herein to describe various elements, but these elements are not limited by these terms. These terms are only used to distinguish the first element from another element.
  • the first terminal may be referred to as the second terminal, and similarly, the second terminal may be referred to as the first terminal. Both the first terminal and the second terminal are terminals, but they are not the same terminal.
  • a video call method is provided. Taking the method applied to the first terminal in FIG. 1 as an example for description, the method includes the following steps:
  • S202 Collect the first voice and source video frame generated by the target member in the video call.
  • the first voice refers to the voice data of the target member collected by the audio collecting component based on the first terminal corresponding to the target member during the video call.
  • the audio collection component refers to the relevant hardware used in the terminal to collect audio data, such as a microphone.
  • the source video frame refers to the image information about the target member collected by the first terminal based on the image collection component, such as the camera.
  • the first terminal detects whether there is a start instruction generated for the subtitle synthesis plug-in. If the start instruction is detected, the first terminal starts the subtitle synthesis plug-in and turns on the subtitle synthesis function.
  • the first terminal has an icon for turning on the subtitle synthesis plug-in, and the target member can actively click the plug-in icon before or during the video call to turn on the subtitle synthesis function.
  • the first terminal when the first terminal detects that the target member starts the video call, the first terminal automatically calls the start interface of the subtitle synthesis plug-in to start the subtitle synthesis function.
  • caption synthesis plug-in sends an image reading instruction to the image acquisition component and an audio reading instruction to the audio acquisition component to read the source video frames collected by the image acquisition component and the first voice collected by the audio acquisition component.
  • the caption synthesis plug-in may determine whether the target member grants the image acquisition component the permission to collect the image information of the target member before sending the image reading instruction to the image acquisition component. If permission is not granted, the subtitle synthesis plug-in will automatically replace the source video frame with a preset picture. For example, when the target member does not grant the corresponding acquisition permission of the image acquisition component, the subtitle synthesis plug-in can subsequently use the preset pure black image as the source video frame.
  • the preset picture is set in advance, so that when the image capture component fails to successfully capture the source video frame, the subtitle synthesis plug-in can still perform the synthesis process of the target video frame normally according to the preset picture, so that the other members can still Communicate smoothly with target members based on the content of the subtitles in the target video frame.
  • S204 Convert the first voice according to the preset target languages pointed to by the opposite members participating in the video call to obtain the first text.
  • FIG. 3 is a schematic diagram of a language configuration page in an embodiment.
  • the first terminal can obtain the member identification of each opposing member participating in the video call, and generate the language configuration page as shown in FIG. 3 based on the member identification.
  • the target member can select the source language corresponding to the first voice to be recognized (denoted as the first target language) and the target language corresponding to the other member (denoted as the second target language) on the language configuration page.
  • Chinese may be selected as the first target language, and English as the second target language, and the terminal will convert the first voice of the Chinese language into the corresponding English text when translating.
  • the subtitle synthesis plug-in recognizes the first voice according to the first target language, and converts the first voice into the first text corresponding to the first target language according to the recognition result.
  • the subtitle synthesis plug-in checks whether the second target language is the same as the first target language. If not, the subtitle synthesis plug-in counts the language version types of the second target language, and the second target language translation and the first target language based on different language version types Corresponding to the first text, the first text corresponding to the second target language is obtained.
  • the first terminal may send the language configuration information to the second terminal, so that the second terminal correspondingly displays the language configuration information.
  • the opposing member finds that the second target language set by the target member is wrong, the opposing member can simply prompt the target member through the instant messaging application.
  • the target member can trigger the target language change operation according to the prompt of the opposing member.
  • the subtitle synthesis plug-in continuously monitors the user's operation behavior.
  • the subtitle synthesis plug-in displays the language change page.
  • the target member can re-determine the second target language corresponding to the opposing member on the language change page, and then the subtitles are synthesized
  • the plug-in converts the first voice according to the reselected second target language to obtain the corresponding first text.
  • the target member by correspondingly displaying the language configuration information configured by the target member on the opposite terminal, the target member can change the language configuration information in time when the language configuration information is found to be incorrect, thereby improving the efficiency of video calls.
  • the subtitle synthesis plug-in recognizes the first voice based on the first target language, and directly converts the recognized first voice into the corresponding first text according to the second target language.
  • the subtitle synthesis plug-in buffers the current first voice after collecting the first voice.
  • the subtitle synthesis plug-in determines the input time of the first voice currently received, and determines whether a new first voice is received within a preset time from the current input time, if yes, the new first voice is cached, if not, the new first voice is stored in the cache At least one first voice of the first voice is spliced to obtain the spliced first voice, and the spliced first voice is recognized based on the first target language.
  • the first terminal may also send the first voice and language configuration information to the server, so that the server correspondingly recognizes and translates the first voice according to the language configuration information.
  • S206 Synthesize the source video frame and the first text corresponding to each target language to obtain a target video frame corresponding to each target language.
  • the subtitle synthesis plug-in obtains the image width of the source video frame, based on the image width of the source video frame and each second target language.
  • the number of characters in the first text corresponding to the target language determines the size of the background image corresponding to different target languages.
  • the subtitle synthesis plug-in obtains a preset background image generation format, such as RGBA format, and generates a corresponding background image according to the preset format and size information.
  • the subtitle synthesis plug-in reads the text content in the first text corresponding to each target language, and adds the text content of the first text as the subtitle content to the corresponding background image to obtain the subtitle image corresponding to each target language.
  • the subtitle synthesis plug-in can uniformly adjust the subtitle image according to the preset background image color and character color.
  • the character refers to the text content of the first text displayed in the subtitle image.
  • the background color is uniformly adjusted to black, and the character color is uniformly adjusted to white.
  • the subtitle synthesis plug-in obtains the array elements of the subtitle image, and sets the value of the background color element represented in the array element to zero to remove the background color in the subtitle image to obtain a subtitle image with a transparent background and white subtitles.
  • the element array of the subtitle image refers to a string that records the three primary colors and transparency of each pixel in the subtitle image. Based on the element array, the three primary colors and transparency in the image can be dynamically adjusted.
  • FIG. 4 is a schematic diagram of a target video frame in an embodiment.
  • the subtitle synthesis plug-in converts the source video frame according to the background image format, and generates a video frame image with the same format as the background image.
  • the subtitle synthesis plug-in obtains preset synthesis location information, and performs pixel superposition of the video frame image and the subtitle image corresponding to each target language according to the synthesis location information to obtain at least one target video frame as shown in FIG. 4.
  • the developer of the subtitle synthesis plug-in can set a synthesis starting point in advance, so that the subtitle plug-in can linearly superimpose the element values corresponding to the pixels in the corresponding position of the video frame image and the subtitle image from the synthesis starting point.
  • the subtitle synthesis plug-in converts the format of the synthesized image after pixel superimposition to obtain the target video frame corresponding to each target language with the same format as the source video frame, and according to the corresponding relationship between the member ID and the second target language, The target video is sent to the corresponding counterparty member.
  • the subtitle synthesis plug-in on terminal A determines that the second target language corresponding to B is English and the second target language corresponding to C is Japanese according to the language configuration operation of A.
  • the hourly subtitle synthesis plug-in sends the target video frame with embedded English subtitles to B, and the target video with embedded Japanese subtitles to C.
  • the first voice generated by the target member in the video call is translated into the first text of multiple language versions according to the target language familiar to each member participating in the video call;
  • a target video frame with voice translation subtitles can be formed; the target video frame will be displayed on the page corresponding to the video call of the target member.
  • the target video frames of the voice translation subtitles in the languages required by other members are sent to the corresponding members, so that each member participating in the video call can understand what the target member is speaking through the language they are familiar with without having to leave the instant messaging client. Content, improve the efficiency of video calls, and then save video call link resources.
  • the first text can be multiplexed to reduce the amount of data processing for synthesizing the source video frame with the first text of different versions, thereby saving terminal data processing resources.
  • the above video call method further includes: when the configuration operation of the target language is triggered, displaying the language configuration page; acquiring language configuration information configured based on the language configuration page; the language configuration information includes the target members and participating in the video call Candidate languages corresponding to the members of each other; store the member ID and language configuration information of the target member in association with the server so that the server associates the member with each language configuration information when there is language configuration information associated with the member ID of the other member The candidate language corresponding to the mark is used as the target language of the corresponding member.
  • both the target member and the other member can trigger the target language configuration operation.
  • the terminal can correspondingly display the language configuration page according to the member’s operation, and it will be based on the language.
  • the configuration page generates language configuration information and sends it to the server, so that the server associates the configuration information with the member ID corresponding to the sending terminal and stores it.
  • a and B are in a video call, A can set the candidate language associated with itself as English, and the candidate language associated with B as Chinese, and B can also set the candidate language associated with itself as Chinese, and the candidate language associated with A. In English, the server then stores the configuration information sent by A and B respectively according to the member IDs of A and B.
  • the server uses the candidate language corresponding to the member identifier associated with each language configuration information as the target language of the corresponding member, thereby screening multiple pieces of language configuration information to generate a unified language configuration information.
  • the server extracts the candidate language "English” associated with the A logo from the language configuration information sent by A, and determines "English” as the target language corresponding to A. From the language configuration information sent by B, The candidate language “Chinese” associated with the B logo is extracted, and “Chinese” is determined as the target language corresponding to B.
  • a unified language configuration information when there are multiple pieces of configuration information, by filtering the multiple pieces of configuration information according to the member identification, a unified language configuration information can be obtained, so that subsequent terminals or servers can generate corresponding information based on the unified language configuration information.
  • Text by using the candidate language corresponding to the member identifier associated with each language configuration information as the target language of the corresponding member, the accuracy of the language configuration information can be improved, and the target video received by the other member due to the error of the language configuration information can be reduced
  • the language of the subtitles in the frame is not the language that you are familiar with.
  • the above-mentioned video call method further includes: sending the first voice to the server; receiving the first text returned by the server after converting the first voice according to the target language of each counterparty member.
  • the subtitle synthesis plug-in sends the first voice to the server, so that the server recognizes and translates the first voice according to the member identification of the target member and the unified language configuration information, and generates the second voice.
  • the first text corresponding to the target language, and the first text is returned to the first terminal.
  • the first voice recognition and translation process in the above-mentioned video call method can be completed either in the first terminal or in the server.
  • the first terminal can recognize and translate the first voice according to the language configuration information stored in the terminal, or it can pull unified language configuration information from the server, Therefore, the first language is recognized and translated according to the unified language configuration information;
  • the server can pull the corresponding language configuration information from the first terminal, and configure according to the language configuration in the terminal
  • the information recognizes and translates the first voice, and the first voice may also be recognized and translated according to the unified language configuration information stored in the server.
  • the server converts the first voice to obtain the corresponding first text, which can reduce terminal resources consumed by the terminal for converting the first voice.
  • the above-mentioned video call method further includes: generating a corresponding subtitle image based on each first text, and buffering the subtitle image; and synthesizing the source video frame with the first text corresponding to each target language includes: Query whether there is an updated subtitle image in the cache every first preset duration; if so, combine the updated subtitle image with the target member second before the current time; synthesize each source video frame generated within the preset duration, and it will be completed The synthesized subtitle image is deleted from the cache; the second preset duration is less than the first preset duration.
  • the first preset duration is the duration set by the developer of the subtitle synthesis plug-in according to the number of video frames of the played video. For example, when an instant messaging application performs video playback, the video frame is generally played at a rate of 30 frames. At this time, the developer of the subtitle synthesis plug-in can set the preset duration to 30 milliseconds.
  • the second preset duration is the interval time for the subtitle synthesis plug-in to read the source video frame from the image capture component. If the second preset duration is too long, the target video frame received by the other member will be delayed too long, and the second preset duration is too long. Short, it will cause the opposite member to receive too few target video frames with embedded subtitles to recognize the content of the subtitles, so it needs to be set reasonably, such as 3 seconds.
  • the image acquisition component in the terminal collects the image information of the target member in real time, and correspondingly caches the image information and the acquisition time of the target member in the image buffer area.
  • the subtitle synthesis plug-in checks whether the preset subtitle buffer area contains the buffered subtitle image. If there is a buffered subtitle image, the subtitle synthesis plug-in clears the subtitle buffer area. And cache the currently generated subtitle image to the subtitle buffer area.
  • the subtitle synthesis plug-in checks whether there is an updated subtitle image in the subtitle buffer area every first preset duration. When there is an updated subtitle image, the subtitle synthesis plug-in reads from the image buffer area the second preset before the current time. At least one source video frame is collected by the image capture component within the time period, and then the source video frame that has been read is deleted from the image buffer area. If the updated subtitle image is still not stored in the subtitle buffer area within the second preset time from the current time, the subtitle synthesis plug-in directly sends the source video frame within the second preset time from the current time to the other member, and Delete the sent source video frame from the image buffer area.
  • the subtitle synthesis plug-in separately synthesizes the subtitle image corresponding to each second target language with each source video frame read from the image buffer area to obtain the corresponding target video frame, and synthesize the subtitle image from the subtitle buffer area The subtitle image corresponding to delete.
  • the latest subtitle images can be obtained in time, so that the synthesized target video frame can be sent to the other member in time; by sending the latest subtitles
  • the image is synthesized with multiple source video frames, so that members of the other party can recognize the subtitle content based on multiple target video frames.
  • generating the corresponding subtitle image based on each type of the first text includes: determining the subtitle width according to the image width of the source video frame; converting the subtitle width into a threshold value of the number of characters corresponding to each target language; The character number threshold splits the corresponding first text into multiple sub-texts; determines the sub-text height of the corresponding first text according to the number of sub-texts corresponding to the first text; adds the first text as subtitle content to the generated sub-texts based on the sub-text width and sub-text height In the background image, the subtitle image is obtained.
  • the threshold of the number of characters is the maximum number of characters that can be displayed in a single-line subtitle.
  • the subtitle synthesis plug-in determines the image width of the source video frame, and determines the subtitle width according to a preset image width ratio value. For example, if the preset subtitle width ratio is two-thirds, the subtitle synthesis plug-in determines two-thirds of the image width of the source video frame as the subtitle width.
  • the terminal has a correspondence relationship between the width information of a single character corresponding to each target language and the separation distance between the characters.
  • the subtitle synthesis plug-in separately obtains the second target language corresponding to the first text, and according to the language information of the second target language, determines the corresponding single character width information and the spacing distance between the characters from the corresponding relationship, based on the obtained subtitle width , Single character width information and distance between characters to calculate the threshold of the number of characters corresponding to the second target language, that is, the subtitle synthesis plug-in can obtain single-line subtitles according to the subtitle width, single character width information and the distance between characters The number of characters.
  • the subtitle synthesis plug-in counts the number of characters in the first text to obtain the total number of characters, and divides the total number of characters by the character number threshold to obtain the number of sub-texts.
  • the subtitle synthesis plug-in generates a corresponding number of sub-texts based on the number of sub-texts. .
  • the subtitle synthesis plug-in reads a threshold number of characters from the first character in the first text, and stores the read characters in the sub-text.
  • the subtitle synthesis plug-in deletes the read characters from the first text, and continues to read the characters in the first text according to the threshold of the number of characters, and stores the read characters in the sub-text without storing characters until the first text All characters in are deleted.
  • the subtitle synthesis plug-in counts the number of subtexts corresponding to the first text, and determines the number of subtitle lines in the subtitle image according to the number of subtexts. For example, when there are three sub-texts, the subtitle synthesis plug-in can consider that there are three lines of subtitles in the subtitle image to be generated at this time. At this time, the subtitle synthesis plug-in can calculate the corresponding first text according to the preset height of single-line subtitles and the total number of subtitles. The height of the subtitles.
  • the subtitle synthesis plug-in generates a background image of a corresponding size according to the subtitle width and the subtitle height, and adds the characters in each sub-text to the background image as subtitle content.
  • the subtitle width by determining the subtitle width according to the image width of the source video frame, it is possible to reduce the probability of the subtitle exceeding the video image due to the width value of the generated subtitle image being greater than the image width of the source video frame; the background is determined according to the number of subtexts The height of the image can reduce the generation of unnecessary part of the background image.
  • the above-mentioned video call method further includes: collecting the second voice produced by the opposite member during the video call; obtaining the second text obtained by converting the second voice according to the target language corresponding to the target member; and displaying the second text.
  • the second voice corresponding to the opposite member can be sent to the first terminal through the instant messaging application on the second terminal.
  • the instant messaging application in the first terminal receives the second voice, and sends the second voice to the audio playback component.
  • the subtitle synthesis plug-in in the first terminal monitors whether the audio playback component receives the second voice.
  • the subtitle synthesis plug-in obtains the second voice, and according to the target member corresponding to the language configuration information Recognize and translate the second speech in the first target language to obtain the second text.
  • the subtitle synthesis plug-in correspondingly displays the generated second text on the screen of the first terminal.
  • Fig. 5 is a schematic diagram of a pop-up window displaying the second text in an embodiment.
  • the first terminal may display the second text in the form of a pop-up window, or may display the second text in the form of a prompt message as shown in FIG. 6, which is a schematic diagram of displaying the second text in the form of a prompt message in an embodiment .
  • the target member can independently select a suitable display form based on actual needs, which greatly improves the user experience.
  • the first terminal calculates whether the target member actively closes the second text within the preset time period from the time when the second text is displayed. If the second text is not actively closed, the first terminal may generate the second text based on the The closing instruction is used to automatically close the displayed second text, so that when the target member has finished reading the second text, the second text can be automatically closed, thereby saving display resources consumed by the end point to display the second text.
  • the target member can manually close the displayed second text, such as clicking the close control to close the second text, or closing the second text according to a sliding operation on the screen.
  • the first terminal when the target member minimizes the instant messaging application, can still display the second text in the form of a pop-up window or a prompt message.
  • the second text is displayed in the form of a pop-up window or a prompt message, so that the display of the second text can be separated from the video call page, so that when the instant messaging application is turned to run in the background, the target member can also follow the content of the second text Communicate smoothly with other members.
  • the second voice collected by the audio-based playback component may be combined with the voices of multiple other members.
  • the subtitle synthesis plug-in extracts timbre information from the second voice, according to the timbre
  • the information divides the second voice into multiple second sub-voices, and converts the multiple second sub-voices based on the target language corresponding to the target member to obtain multiple second texts.
  • the first terminal respectively displays a plurality of second texts correspondingly.
  • the second voice is divided according to the timbre, so that the subtitle synthesis plug-in can distinguish different second sub-voices of different opponent members, so that in a multi-person video call scene, it can assist the target member to distinguish different opponents by displaying multiple second texts.
  • the different information expressed by the members further enhances the communication efficiency of multi-person video calls.
  • the target member can understand the content of the other member when the subtitle synthesis plug-in is not installed in the second terminal, so that the video call can be carried out smoothly.
  • the page of the video call includes the target member and the video frame display area corresponding to each of the opposing members; the above video call method further includes: displaying the synthesized video frame display area in the video frame display area corresponding to the target member
  • the target video frame of the target member corresponding to the target language is recorded as the first target video frame; the second target video frame is obtained from the opponent member; the second target video frame is generated by the opponent member during the video call according to the target member corresponding to the target language
  • the second voice is converted into a second text, and synthesized based on the converted second text and the source video frame generated by the opposite member during the video call; the second target video frame is displayed in the video frame display area corresponding to the opposite member .
  • the subtitle synthesis plug-in can convert the first voice according to the first target language corresponding to the target member to obtain the corresponding first text, and synthesize the first text with the source video frame to obtain the first target language corresponding to the target member.
  • a target video frame can convert the first voice according to the first target language corresponding to the target member to obtain the corresponding first text, and synthesize the first text with the source video frame to obtain the first target language corresponding to the target member.
  • the second terminal may convert the second voice generated by the other member in the video call into the second text according to the target language corresponding to the target member, and convert the second voice
  • the two texts and the source video frame generated by the other member during the video call are synthesized to obtain the second target video frame, and then the second terminal sends the synthesized second target video frame to the first terminal.
  • the first terminal obtains the page size of the video call page, and divides the video frame display area corresponding to the target member corresponding to the counterpart member according to the page size.
  • the video frame display area For example, the first terminal counts the total number of members participating in the video call, divides the video call page into multiple video frame display areas according to the total number of members, and agrees that the first divided video display area is the video frame display corresponding to the target member Area.
  • FIG. 7 is a schematic diagram of a video frame display area according to an embodiment.
  • the first terminal separately obtains the area size of the video frame display area corresponding to the target member and the other member, and correspondingly changes the size of the first target video frame and the second target video frame according to the area size, so that the video frame shown in FIG. 7 is displayed
  • the area can completely display the first target video frame and the second target video frame.
  • the target member can change the size of the video frame display area according to his own needs. For example, when the target member has a video call with B and C, the target member can enlarge the video frame display area corresponding to B. The video frame display area corresponding to the target member and the video frame display area corresponding to C will be correspondingly reduced, so that the entire video call is more in line with the actual needs of the target member.
  • the target member when the target member finds that the subtitles in the first target video frame displayed and displayed are incorrect, the target member can calibrate the wrong characters in the subtitles. At this time, the subtitle synthesis plug-in generates the error according to the target member's calibration operation. Correct the page. Based on the correction page, the target member can input the corresponding display character corresponding to the wrong character.
  • the subtitle synthesis plug-in stores the error characters and the characters to be displayed in the character library. When the subtitle synthesis plug-in recognizes the wrong characters again, it can choose whether to correct the wrong characters according to the characters to be displayed in the character library.
  • the target user can check in real time whether the subtitle content displayed by the first target video frame is correct, so that the wrong character can be calibrated in time when the wrong character is found. , Thereby improving the accuracy of the subtitle synthesis plug-in for speech translation.
  • the above-mentioned video call method further includes: collecting the second voice produced by the opposite member during the video call; obtaining the second text obtained by converting the second voice according to the target language corresponding to the target member; and according to each other member
  • the size of the display area of the corresponding video frame determines the display style of the obtained second text; the obtained second text is displayed in the pop-up window of the video call page according to the display style.
  • the display style of the second text includes character transparency, character size, and character color in the second text.
  • the subtitle synthesis plug-in obtains the second voice generated during the video call from the audio playback component, and converts the second voice according to the target language corresponding to the target member to obtain the second text.
  • the subtitle synthesis plug-in obtains the size of the video frame display area corresponding to each opponent member. When the size of the video frame display area corresponding to each opponent member is less than the area threshold, it can be considered that the target member cannot clearly identify the subtitle displayed in the video frame display area. Content, at this time, the subtitle synthesis plug-in correspondingly reduces the character transparency, increases the character size, and changes the character color to a more eye-catching color based on the preset configuration file.
  • the subtitle synthesis plug-in can generate a style adjustment control in the terminal, and based on the style adjustment control, the target member can correspondingly adjust the style of the second text.
  • the target member can independently adjust the display style of the second text, thereby improving the user experience.
  • the style of the second text is adjusted in real time according to the size of the video frame display area corresponding to the other member, which not only reduces the situation that the target member cannot clearly identify the content of the subtitles because the video frame display area is too small, but also can be displayed in the video frame.
  • the display area is large enough, by reducing the presence of the second text, the interference to the target member caused by the repeated display of the voice information of the other member is reduced.
  • the terminal includes an audio collection component and an audio playback component; the above video playback method further includes: the first voice is generated based on the audio collection component, and the second voice is generated based on the audio playback component.
  • the audio collection component in the first terminal can record the first voice of the target member in real time, and transmit the recorded first voice to the subtitle synthesis plug-in in a voice stream. To generate the corresponding first text.
  • the audio collection component in the second terminal can also collect the second voice of the opposite member in real time, and send the second voice to the first terminal through the instant messaging application.
  • the instant messaging application in the first terminal receives the second voice, and sends the second voice to the audio playback component.
  • the subtitle synthesis plug-in in the first terminal monitors whether the audio playback component receives the second voice. When the audio playback component receives the second voice, the subtitle synthesis plug-in obtains the second voice, and according to the target member corresponding to the language configuration information Recognize and translate the second speech in the first target language to obtain the second text.
  • the subtitle synthesis plug-in can clearly distinguish between the voice generated by the target member and the voice generated by the opponent member, so that the subsequent voice can be generated according to the target member and the other member.
  • the generation of the voice corresponds to the generation of the second text and the second text.
  • steps in the flowchart of FIG. 2 are displayed in sequence as indicated by the arrows, these steps are not necessarily executed in sequence in the order indicated by the arrows. Unless there is a clear description in this article, there is no strict order for the execution of these steps, and these steps can be executed in other orders. Moreover, at least part of the steps in FIG. 2 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. The execution of these sub-steps or stages The sequence is not necessarily performed sequentially, but may be performed alternately or alternately with at least a part of other steps or sub-steps or stages of other steps.
  • a video call device 800 including: a first text generation module 802, a target video frame synthesis module 804, and a page display module 806, wherein:
  • the first text generation module 802 is used to collect the first voice and source video frame generated by the target member in the video call; convert the first voice according to the preset target languages pointed to by the other members participating in the video call to obtain the first voice One text.
  • the target video frame synthesis module 804 is configured to synthesize the source video frame with the first text corresponding to each target language to obtain the target video frame corresponding to each target language.
  • the page display module 806 is configured to send the obtained target video frames of each target language to the corresponding counterparty member.
  • the above-mentioned video call device 800 further includes a language configuration module 808, which is used to display a language configuration page when the configuration operation of the target language is triggered; to obtain the language configured based on the language configuration page Configuration information; the language configuration information includes the candidate languages corresponding to the target member and the other member participating in the video call; the member ID and language configuration information of the target member are associated and stored to the server, so that the server has the language type associated with the member ID of the other member When configuring information, the candidate language corresponding to the member identifier associated with each language configuration information is used as the target language of the corresponding member.
  • a language configuration module 808 which is used to display a language configuration page when the configuration operation of the target language is triggered; to obtain the language configured based on the language configuration page Configuration information; the language configuration information includes the candidate languages corresponding to the target member and the other member participating in the video call; the member ID and language configuration information of the target member are associated and stored to the server, so that the server has the language type associated with the member
  • the language configuration module 808 is further configured to send the first voice to the server; and receive the first text returned by the server after converting the first voice according to the target language of each member of the other party.
  • the target video frame synthesis module 804 is further configured to generate a corresponding subtitle image based on each type of first text, and cache the subtitle image; query whether there is an updated subtitle image in the cache every first preset duration; If yes, synthesize the updated subtitle image with each source video frame generated by the target member within the second preset duration before the current time, and delete the synthesized subtitle image from the cache; the second preset duration is less than the first preset duration.
  • the target video frame synthesis module 804 is further configured to determine the subtitle width according to the image width of the source video frame; convert the subtitle width into a threshold value for the number of characters corresponding to each target language; Split a text into multiple sub-texts; determine the subtitle height of the corresponding first text according to the number of sub-texts corresponding to the first text; add the first text as subtitle content to the background image generated according to the subtitle width and subtitle height to obtain subtitles image.
  • the video call device 800 further includes a second text generation module 810, which is used to collect the second voice generated by the opposite member during the video call; obtain the second voice obtained by converting the second voice according to the target language corresponding to the target member. Second text; display the second text.
  • a second text generation module 810 which is used to collect the second voice generated by the opposite member during the video call; obtain the second voice obtained by converting the second voice according to the target language corresponding to the target member. Second text; display the second text.
  • the video call device 800 further includes a video frame display area determining module 812, configured to display the synthesized target video frame corresponding to the target language of the target member in the video frame display area corresponding to the target member, which is recorded as the first The target video frame; the second target video frame is obtained from the other member; the second target video frame is based on the target member’s corresponding target language, the second voice generated by the other member during the video call is converted into the second text, and the result is obtained based on the conversion The second text and the source video frame generated by the opposite member during the video call are synthesized; and the second target video frame is displayed in the video frame display area corresponding to the opposite member.
  • a video frame display area determining module 812 configured to display the synthesized target video frame corresponding to the target language of the target member in the video frame display area corresponding to the target member, which is recorded as the first The target video frame; the second target video frame is obtained from the other member; the second target video frame is based on the target member’s corresponding target language,
  • the video frame display area determination module 812 is also used to collect the second voice generated by the opposite member during the video call; obtain the second text obtained by converting the second voice according to the target language corresponding to the target member; The size of the video frame display area corresponding to each member of the other party determines the display style of the obtained second text; the obtained second text is displayed in the pop-up window of the video call page according to the display style.
  • the video call device 800 further includes a voice acquisition module 814, configured to collect the first voice based on the audio collection component, and collect the second voice based on the audio playback component.
  • the various modules in the above-mentioned video call device may be implemented in whole or in part by software, hardware, and a combination thereof.
  • the above-mentioned modules may be embedded in the form of hardware or independent of the processor in the computer equipment, or may be stored in the memory of the computer equipment in the form of software, so that the processor can call and execute the operations corresponding to the above-mentioned modules.
  • a computer device is provided.
  • the computer device may be a first terminal, and its internal structure diagram may be as shown in FIG. 10.
  • the computer equipment includes a processor, a memory, a network interface, a display screen, an audio collection device, an audio playback device, an image collection device, and an input device connected through a system bus.
  • the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system and a computer program.
  • the internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer program is executed by the processor to realize a video call method.
  • the display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, or it can be a control, trackball or touchpad set on the housing of the computer equipment , It can also be an external keyboard, touchpad, or mouse.
  • FIG. 10 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • the specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
  • a computer device including a memory, a processor, and a computer program stored in the memory and running on the processor.
  • the processor executes the computer program, it is realized: the collection target member generates a video call during a video call. The first voice and the source video frame; the first voice is converted according to the preset target languages pointed to by the other members participating in the video call to obtain the first text; the source video frame is respectively corresponding to the first target language of each target language The text is synthesized to obtain the target video frame corresponding to each target language; the obtained target video frame of each target language is sent to the corresponding opponent member.
  • the processor when the processor executes the computer program, it also realizes: when the configuration operation of the target language is triggered, the language configuration page is displayed; the language configuration information configured based on the language configuration page is obtained; the language configuration information includes the target members and participation Candidate languages corresponding to the other members of the video call; associate the member ID and language configuration information of the target member to the server, so that the server associates each language configuration information when there is language configuration information associated with the member ID of the other member.
  • the candidate language corresponding to the member ID of is used as the target language of the corresponding member.
  • the processor when the processor executes the computer program, the processor further implements: sending the first voice to the server; and receiving the first text obtained by converting the first voice according to the target language of each member of the other party returned by the server.
  • the processor when the processor executes the computer program, the processor further implements: generating a corresponding subtitle image based on each first text, and buffering the subtitle image; and synthesizing the source video frame with the first text corresponding to each target language. Including: inquire whether there is an updated subtitle image in the cache every first preset duration; if so, synthesize the updated subtitle image with each source video frame generated by the target member within the second preset duration before the current time, and The synthesized subtitle image is deleted from the cache; the second preset duration is less than the first preset duration.
  • the processor when the processor executes the computer program, the processor further implements: determining the subtitle width according to the image width of the source video frame; converting the subtitle width into a threshold value for the number of characters corresponding to each target language; Split a text into multiple sub-texts; determine the subtitle height of the corresponding first text according to the number of sub-texts corresponding to the first text; add the first text as subtitle content to the background image generated according to the subtitle width and subtitle height to obtain subtitles image.
  • the processor when the processor executes the computer program, the processor also implements: collect the second voice produced by the opposite member during the video call; obtain the second text obtained by converting the second voice according to the target language corresponding to the target member; and display the second text.
  • the page of the video call includes the target member and the video frame display area corresponding to each other member; when the processor executes the computer program, it also realizes: display the synthesized target member in the video frame display area corresponding to the target member
  • the target video frame corresponding to the target language is recorded as the first target video frame; the second target video frame is obtained from the opponent member; the second target video frame is the second target video frame generated by the opponent member during the video call according to the target language corresponding to the target member
  • the voice is converted into a second text, and synthesized based on the converted second text and the source video frame generated by the opposite member during the video call; the second target video frame is displayed in the video frame display area corresponding to the opposite member.
  • the processor when the processor executes the computer program, the processor also implements: collect the second voice produced by the opposite member during the video call; obtain the second text obtained by converting the second voice according to the target language corresponding to the target member; The size of the video frame display area corresponding to the opposite member determines the display style of the obtained second text; the obtained second text is displayed in the pop-up window of the video call page according to the display style.
  • the terminal includes an audio collection component and an audio playback component; the processor further implements the following steps when executing the computer program: the first voice is generated based on the audio collection component, and the second voice is generated based on the audio playback component.
  • a computer-readable storage medium is provided, and a computer program is stored thereon.
  • the computer program is executed by a processor to realize: collect the first voice and source video frame generated by the target member during the video call; The preset target languages pointed to by the other members participating in the video call convert the first voice to obtain the first text; the source video frame is synthesized with the first text corresponding to each target language, and the corresponding target language is obtained The target video frame of each target language; send the obtained target video frame of each target language to the corresponding member of the other party.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Abstract

The present application relates to a video call method and apparatus, a computer device and a storage medium. The method comprises: collecting first speech and a source video frame that are generated by a target member during a video call; converting the first speech according to a target language indicated by each preset counterpart member participating in the video call so as to obtain first text; compositing the source video frame and the first text corresponding to each target language respectively so as to obtain a target video frame corresponding to each target language; and sending the obtained target video frame for each target language to corresponding counterpart members.

Description

视频通话方法、装置、计算机设备和存储介质Video call method, device, computer equipment and storage medium
相关申请交叉引用Cross-reference to related applications
本申请要求2019年9月27日递交的、标题为“视频通话方法、装置、计算机设备和存储介质”、申请号为2019109251949的中国申请,其公开内容通过引用全部结合在本申请中。This application requires a Chinese application filed on September 27, 2019, titled "Video call method, device, computer equipment and storage medium", with application number 2019109251949, the disclosure of which is fully incorporated into this application by reference.
技术领域Technical field
本申请涉及一种视频通话方法、装置、计算机设备和存储介质。This application relates to a video call method, device, computer equipment and storage medium.
背景技术Background technique
随着全球化的发展,各国之间的沟通交流也越来越多。目前,用户可以基于终端上的即时通讯客户端以视频通话的方式进行实时交流,但是由于各国之间语言不同,使得不了解他国语言的用户在与对方进行视频通话时,会因语言障碍造成无法顺畅沟通。With the development of globalization, there are more and more exchanges between countries. At present, users can communicate in real time by means of video calls based on the instant messaging client on the terminal. However, due to the differences in languages between countries, users who do not understand other languages will be unable to communicate with each other due to language barriers. Communicate smoothly.
在基于不同语言进行视频通话时,通话成员只能在视频通话过程中脱离即时通讯客户端,借助第三方翻译设备对来自其他成员的语音数据进行翻译;待收听到第三方翻译设备反馈的翻译结果后,再根据翻译结果作出语音回复。When making a video call based on different languages, members of the call can only break away from the instant messaging client during the video call, and use a third-party translation device to translate the voice data from other members; the translation results from the third-party translation device are waiting to be heard Then, make a voice reply based on the translation result.
发明内容Summary of the invention
本申请提供一种视频通话方法,方法包括:采集目标成员在视频通话中产生的第一语音及源视频帧;根据预设的参与视频通话的对方成员分别指向的目标语种对第一语音进行转换,得到第一文本;将源视频帧分别与每种目标语种对应的第一文本进行合成,得到每种目标语种对应的目标视频帧;以及将得到的每种目标语种的目标视频帧发送至相应对方成员。This application provides a video call method. The method includes: collecting the first voice and source video frame generated by the target member in the video call; and converting the first voice according to the preset target languages pointed to by the other members participating in the video call. , Get the first text; synthesize the source video frame with the first text corresponding to each target language to obtain the target video frame corresponding to each target language; and send the obtained target video frame of each target language to the corresponding Opposing member.
本申请还提供一种视频通话方法,包括:获取目标成员在视频通话中产生的第一语音及源视频帧;根据预设的参与视频通话的其他成员分别指向的目标语种对第一语音进行转换,得到第一文本;将源视频帧分别与每种目标语种对应的第一文本进行合成,得到每种目标语种对应的目标视频帧;以及将得到的每种目标语种的目标视频帧发送至其他成员This application also provides a video call method, including: obtaining the first voice and source video frame generated by the target member in the video call; and converting the first voice according to the preset target languages pointed to by other members participating in the video call. , Get the first text; synthesize the source video frame with the first text corresponding to each target language to obtain the target video frame corresponding to each target language; and send the obtained target video frame of each target language to other member
本申请还提供一种视频通话装置,装置包括:第一文本生成模块,用于采集目标成 员在视频通话中产生的第一语音及源视频帧;根据预设的参与视频通话的对方成员分别指向的目标语种对第一语音进行转换,得到第一文本;目标视频帧合成模块,用于将源视频帧分别与每种目标语种对应的第一文本进行合成,得到每种目标语种对应的目标视频帧;以及页面展示模块,用于将得到的每种目标语种的目标视频帧发送至相应对方成员。This application also provides a video call device. The device includes: a first text generation module for collecting the first voice and source video frame generated by the target member in the video call; The target language converts the first voice to obtain the first text; the target video frame synthesis module is used to synthesize the source video frame with the first text corresponding to each target language to obtain the target video corresponding to each target language Frames; and a page display module for sending the obtained target video frames in each target language to the corresponding counterparty member.
本申请还提供一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,处理器执行计算机程序时实现以下步骤:采集目标成员在视频通话中产生的第一语音及源视频帧;根据预设的参与视频通话的对方成员分别指向的目标语种对第一语音进行转换,得到第一文本;将源视频帧分别与每种目标语种对应的第一文本进行合成,得到每种目标语种对应的目标视频帧;以及将得到的每种目标语种的目标视频帧发送至相应对方成员。This application also provides a computer device, including a memory, a processor, and a computer program stored on the memory and running on the processor. The processor implements the following steps when the processor executes the computer program: collecting the first video generated by the target member during a video call A voice and source video frame; the first voice is converted according to the preset target languages pointed to by the other members participating in the video call to obtain the first text; the source video frame is respectively converted to the first text corresponding to each target language Synthesize, obtain the target video frame corresponding to each target language; and send the obtained target video frame of each target language to the corresponding counterparty member.
本申请还提供一种计算机可读存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现以下步骤:采集目标成员在视频通话中产生的第一语音及源视频帧;根据预设的参与视频通话的对方成员分别指向的目标语种对第一语音进行转换,得到第一文本;将源视频帧分别与每种目标语种对应的第一文本进行合成,得到每种目标语种对应的目标视频帧;以及将得到的每种目标语种的目标视频帧发送至相应对方成员。This application also provides a computer-readable storage medium on which a computer program is stored. When the computer program is executed by the processor, the following steps are implemented: collecting the first voice and source video frame generated by the target member in the video call; The target language pointed to by the other members participating in the video call converts the first voice to obtain the first text; the source video frame is synthesized with the first text corresponding to each target language to obtain the target corresponding to each target language Video frame; and send the obtained target video frame of each target language to the corresponding counterparty member.
本发明的一个或多个实施例的细节在下面的附图和描述中提出。本发明的其它特征、目的和优点将从说明书、附图以及权利要求书变得明显。The details of one or more embodiments of the present invention are set forth in the following drawings and description. Other features, objects and advantages of the present invention will become apparent from the description, drawings and claims.
附图说明Description of the drawings
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。In order to more clearly describe the technical solutions in the embodiments of the present application, the following will briefly introduce the drawings needed in the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, without creative work, other drawings can be obtained from these drawings.
图1为一个实施例中视频通话方法的应用场景图;Fig. 1 is an application scenario diagram of a video call method in an embodiment;
图2为一个实施例中视频通话方法的流程示意图;FIG. 2 is a schematic flowchart of a video call method in an embodiment;
图3为一个实施例中语种配置页面示意图;Figure 3 is a schematic diagram of a language configuration page in an embodiment;
图4为一个实施例中目标视频帧的示意图;Figure 4 is a schematic diagram of a target video frame in an embodiment;
图5为一个实施例中弹窗展示第二文本的示意图;Figure 5 is a schematic diagram of a pop-up window displaying a second text in an embodiment;
图6为一个实施例中以提示消息的形式展示第二文本的示意图;FIG. 6 is a schematic diagram of displaying the second text in the form of a prompt message in an embodiment;
图7为一个实施例中视频帧展示区的示意图;Figure 7 is a schematic diagram of a video frame display area in an embodiment;
图8为一个实施例中视频通话装置的结构框图;FIG. 8 is a structural block diagram of a video call device in an embodiment;
图9为一个另实施例中视频通话装置的结构框图;FIG. 9 is a structural block diagram of a video call device in another embodiment;
图10为一个实施例中计算机设备的内部结构图。Fig. 10 is a diagram of the internal structure of a computer device in an embodiment.
具体实施方式detailed description
背景技术所述的方式,不仅需要依赖第三方翻译设备,沟通成本高;且需要在终端和第三方翻译设备之间不断切换,操作繁琐。此外,由于需要等待第三方翻译设备返回的翻译结果,造成视频通话过程中的多次停顿,延长了整个视频通话的时长,造成视频通话链路资源的浪费。BACKGROUND OF THE INVENTION The method described above not only needs to rely on a third-party translation device, and the communication cost is high; it also needs to constantly switch between the terminal and the third-party translation device, and the operation is complicated. In addition, due to the need to wait for the translation result returned by the third-party translation device, it causes multiple pauses during the video call, prolongs the duration of the entire video call, and causes a waste of video call link resources.
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。In order to make the purpose, technical solutions, and advantages of this application clearer, the following further describes this application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, and are not used to limit the present application.
图1为一个实施例中视频通话方法的应用环境图。参照图1,该视频通话方法应用于视频通话系统。该视频通话系统包括第一终端102、服务器104以及第二终端106。其中,第一终端102通过网络与服务器104进行通信,第二终端106通过网络与服务器104进行通信。第一终端102和第二终端106可以是手机、平板电脑或者便携式可穿戴设备等。第一终端102为视频通话系统中目标成员所对应的终端,第二终端106为视频通话系统中对方成员所对应的终端。第一终端102和第二终端106分别运行有即时通信应用,基于即时通信应用第一终端102可以与第二终端106建立视频通话链路。视频通话可根据参与的成员标识的数量分为双人视频通话和多人视频通话。仅由两个成员标识参与的通话为双人视频通话,由超过两个成员标识参与的通话为多人视频通话。多人视频通话可以是群通话。成员标识用于唯一标识通话成员,具体可以是数字、字母或符号等。当为双人视频通话时,第二终端106具体可以由单个终端实现,当为多人视频通话时,第二终端106可以由多个终端实现。第一终端102中的即时通信应用可集成字幕合成插件,用于将采集得到的第一语音进行文本转换及翻译为多个语言版本第一文本,将不同版本的第一文本作为字幕内容与目标成员在视频通话中产生的源视频帧进行合成,得到目标视频帧,并将目标视频帧通过服务器104转发给对方成员所对应的第二终端106。服务器104可以用独立的服务器或者是多个服务器组成的服务器集群来实现。Fig. 1 is an application environment diagram of a video call method in an embodiment. Referring to Figure 1, the video call method is applied to a video call system. The video call system includes a first terminal 102, a server 104, and a second terminal 106. Among them, the first terminal 102 communicates with the server 104 through the network, and the second terminal 106 communicates with the server 104 through the network. The first terminal 102 and the second terminal 106 may be mobile phones, tablet computers, portable wearable devices, or the like. The first terminal 102 is a terminal corresponding to the target member in the video call system, and the second terminal 106 is a terminal corresponding to the counterpart member in the video call system. The first terminal 102 and the second terminal 106 respectively run instant messaging applications. Based on the instant messaging applications, the first terminal 102 can establish a video call link with the second terminal 106. Video calls can be divided into two-person video calls and multi-person video calls according to the number of participating member IDs. A call involving only two members identified as a two-person video call, and a call involving more than two members identified as a multi-person video call. Multi-person video calls can be group calls. The member ID is used to uniquely identify the call member, which can be numbers, letters, or symbols. When it is a two-person video call, the second terminal 106 may be implemented by a single terminal, and when it is a multi-person video call, the second terminal 106 may be implemented by multiple terminals. The instant messaging application in the first terminal 102 can integrate a subtitle synthesis plug-in, which is used to convert and translate the collected first voice into multiple language versions of the first text, and use different versions of the first text as the subtitle content and target The source video frames generated by the members in the video call are synthesized to obtain the target video frame, and the target video frame is forwarded to the second terminal 106 corresponding to the other member through the server 104. The server 104 may be implemented as an independent server or a server cluster composed of multiple servers.
可以理解,本申请所使用的术语“第一”、“第二”等可在本文中用于描述各种元件,但这些元件不受这些术语限制。这些术语仅用于将第一个元件与另一个元件区分。举例来说,在不脱离本申请的范围的情况下,可以将第一终端称为第二终端,且类似地,可将第 二终端称为第一终端。第一终端和第二终端两者都是终端,但其不是同一终端。It can be understood that the terms "first", "second", etc. used in this application can be used herein to describe various elements, but these elements are not limited by these terms. These terms are only used to distinguish the first element from another element. For example, without departing from the scope of the present application, the first terminal may be referred to as the second terminal, and similarly, the second terminal may be referred to as the first terminal. Both the first terminal and the second terminal are terminals, but they are not the same terminal.
在一个实施例中,如图2所示,提供了一种视频通话方法,以该方法应用于图1中的第一终端为例进行说明,包括以下步骤:In one embodiment, as shown in FIG. 2, a video call method is provided. Taking the method applied to the first terminal in FIG. 1 as an example for description, the method includes the following steps:
S202,采集目标成员在视频通话中产生的第一语音及源视频帧。S202: Collect the first voice and source video frame generated by the target member in the video call.
其中,第一语音是指目标成员所对应的第一终端在视频通话过程中,基于音频采集组件采集到的目标成员的语音数据。音频采集组件是指终端中用以采集音频数据的相关硬件,如麦克风。源视频帧是指第一终端基于图像采集组件,如摄像头采集到的有关目标成员的图像信息。The first voice refers to the voice data of the target member collected by the audio collecting component based on the first terminal corresponding to the target member during the video call. The audio collection component refers to the relevant hardware used in the terminal to collect audio data, such as a microphone. The source video frame refers to the image information about the target member collected by the first terminal based on the image collection component, such as the camera.
具体地,当目标成员与其他成员进行视频通话时,第一终端检测是否有针对字幕合成插件产生的启动指令,若检测到启动指令,第一终端启动字幕合成插件,开启字幕合成功能。Specifically, when the target member is in a video call with other members, the first terminal detects whether there is a start instruction generated for the subtitle synthesis plug-in. If the start instruction is detected, the first terminal starts the subtitle synthesis plug-in and turns on the subtitle synthesis function.
在一个实施例中,第一终端中具有开启字幕合成插件的图标,目标成员在进行视频通话之前或进行视频通话途中可主动点击插件图标,开启字幕合成功能。In one embodiment, the first terminal has an icon for turning on the subtitle synthesis plug-in, and the target member can actively click the plug-in icon before or during the video call to turn on the subtitle synthesis function.
在一个实施例中,当第一终端检测到目标成员开启视频通话后,第一终端自动调用字幕合成插件的启动接口启动字幕合成功能。In one embodiment, when the first terminal detects that the target member starts the video call, the first terminal automatically calls the start interface of the subtitle synthesis plug-in to start the subtitle synthesis function.
进一步地,字幕合成插件向图像采集组件发送图像读取指令以及向音频采集组件发送音频读取指令,用以读取图像采集组件采集到的源视频帧和音频采集组件采集到的第一语音。Further, the caption synthesis plug-in sends an image reading instruction to the image acquisition component and an audio reading instruction to the audio acquisition component to read the source video frames collected by the image acquisition component and the first voice collected by the audio acquisition component.
在一个实施例中,字幕合成插件可在向图像采集组件发送图像读取指令之前,确定目标成员是否授予图像采集组件用于采集目标成员的图像信息的权限。若未授予权限,字幕合成插件将源视频帧自动替换成预设图片。如当目标成员未授予图像采集组件相应采集权限时,字幕合成插件后续可以将预设的纯黑图像作为源视频帧。In one embodiment, the caption synthesis plug-in may determine whether the target member grants the image acquisition component the permission to collect the image information of the target member before sending the image reading instruction to the image acquisition component. If permission is not granted, the subtitle synthesis plug-in will automatically replace the source video frame with a preset picture. For example, when the target member does not grant the corresponding acquisition permission of the image acquisition component, the subtitle synthesis plug-in can subsequently use the preset pure black image as the source video frame.
上述实施例中,通过预先设置预设图片,使得当图像采集组件未能成功采集源视频帧时,字幕合成插件依旧可以根据预设图片正常地执行目标视频帧的合成过程,从而对方成员依旧可以根据目标视频帧中的字幕内容与目标成员进行顺畅地沟通。In the above embodiment, the preset picture is set in advance, so that when the image capture component fails to successfully capture the source video frame, the subtitle synthesis plug-in can still perform the synthesis process of the target video frame normally according to the preset picture, so that the other members can still Communicate smoothly with target members based on the content of the subtitles in the target video frame.
S204,根据预设的参与所述视频通话的对方成员分别指向的目标语种对所述第一语音进行转换,得到第一文本。S204: Convert the first voice according to the preset target languages pointed to by the opposite members participating in the video call to obtain the first text.
具体地,图3为一个实施例中语种配置页面示意图。当字幕合成功能启动后,第一终端可获取参与视频通话的每个对方成员的成员标识,并基于成员标识生成如图3所示的语种配置页面。目标成员可以在该语种配置页面上选定待识别的第一语音所对应的源语种 (记作第一目标语种)以及对方成员所对应的目标语种(记作第二目标语种)。例如,可以选定中文为第一目标语种,英文为第二目标语种,则终端进行翻译时就会将中文语种的第一语音转换成对应英文文本。Specifically, FIG. 3 is a schematic diagram of a language configuration page in an embodiment. When the subtitle synthesis function is activated, the first terminal can obtain the member identification of each opposing member participating in the video call, and generate the language configuration page as shown in FIG. 3 based on the member identification. The target member can select the source language corresponding to the first voice to be recognized (denoted as the first target language) and the target language corresponding to the other member (denoted as the second target language) on the language configuration page. For example, Chinese may be selected as the first target language, and English as the second target language, and the terminal will convert the first voice of the Chinese language into the corresponding English text when translating.
进一步地,字幕合成插件根据第一目标语种识别第一语音,并根据识别结果将第一语音转换成与第一目标语种对应的第一文本。字幕合成插件查看第二目标语种是否与第一目标语种相同,若不相同,字幕合成插件统计第二目标语种的语言版本种类,基于不同的语言版本种类的第二目标语种翻译与第一目标语种对应的第一文本,得到与第二目标语种对应的第一文本。Further, the subtitle synthesis plug-in recognizes the first voice according to the first target language, and converts the first voice into the first text corresponding to the first target language according to the recognition result. The subtitle synthesis plug-in checks whether the second target language is the same as the first target language. If not, the subtitle synthesis plug-in counts the language version types of the second target language, and the second target language translation and the first target language based on different language version types Corresponding to the first text, the first text corresponding to the second target language is obtained.
在一个实施例中,在为对方成员设置对应的目标语种后,第一终端可以将语种配置信息发送至第二终端,以使第二终端对应展示语种配置信息。当对方成员发现目标成员设置的第二目标语种有误时,对方成员可以通过即时通信应用简单提示目标成员,此时目标成员可以根据对方成员的提示,触发目标语种变更操作。字幕合成插件持续监听用户的操作行为,当触发目标语种变更操作时,字幕合成插件展示语种变更页面,目标成员可以在语种变更页面重新确定选定对方成员分别对应的第二目标语种,之后字幕合成插件根据重新选定的第二目标语种对第一语音进行转化,得到对应的第一文本。In one embodiment, after setting the corresponding target language for the opposite member, the first terminal may send the language configuration information to the second terminal, so that the second terminal correspondingly displays the language configuration information. When the opposing member finds that the second target language set by the target member is wrong, the opposing member can simply prompt the target member through the instant messaging application. At this time, the target member can trigger the target language change operation according to the prompt of the opposing member. The subtitle synthesis plug-in continuously monitors the user's operation behavior. When the target language change operation is triggered, the subtitle synthesis plug-in displays the language change page. The target member can re-determine the second target language corresponding to the opposing member on the language change page, and then the subtitles are synthesized The plug-in converts the first voice according to the reselected second target language to obtain the corresponding first text.
上述实施例中,通过在对方终端对应展示目标成员配置的语种配置信息,使得在发现语种配置信息有误时,目标成员能够及时更改语种配置信息,从而提升视频通话效率。In the above embodiment, by correspondingly displaying the language configuration information configured by the target member on the opposite terminal, the target member can change the language configuration information in time when the language configuration information is found to be incorrect, thereby improving the efficiency of video calls.
在一个实施例中,字幕合成插件基于第一目标语种识别第一语音,并根据第二目标语种将识别后的第一语音直接转换为对应的第一文本。In one embodiment, the subtitle synthesis plug-in recognizes the first voice based on the first target language, and directly converts the recognized first voice into the corresponding first text according to the second target language.
在一个实施例中,字幕合成插件在采集到第一语音后,对当前的第一语音进行缓存。字幕合成插件确定当前接收到第一语音的输入时间,并判断自当前输入时间起算预设时长内是否接收到新的第一语音,若是,缓存新的第一语音,若否,对缓存中存储的至少一条第一语音进行拼接,得到拼接后的第一语音,并基于第一目标语种识别拼接后的第一语音。In one embodiment, the subtitle synthesis plug-in buffers the current first voice after collecting the first voice. The subtitle synthesis plug-in determines the input time of the first voice currently received, and determines whether a new first voice is received within a preset time from the current input time, if yes, the new first voice is cached, if not, the new first voice is stored in the cache At least one first voice of the first voice is spliced to obtain the spliced first voice, and the spliced first voice is recognized based on the first target language.
通过判断在预设时长内,是否接收到新的输入文本句,来判断目标成员是否已完成本轮次的语音输入,使得字幕合成插件能够在目标成员已完成本轮次的语音输入后,对本轮次的语音进行翻译处理,从而尽可能使得第一文本中的语句为一完整语句。By judging whether a new input text sentence has been received within the preset time period, it is judged whether the target member has completed the current round of voice input, so that the subtitle synthesis plug-in can respond to the target member after the target member has completed the current round of voice input. The speech of this round is translated, so as to make the sentence in the first text a complete sentence as much as possible.
在一个实施例中,当第一终端还可以将第一语音以及语种配置信息发送至服务器,以使服务器根据语种配置信息对应识别以及翻译第一语音。In an embodiment, the first terminal may also send the first voice and language configuration information to the server, so that the server correspondingly recognizes and translates the first voice according to the language configuration information.
S206,将所述源视频帧分别与每种目标语种对应的所述第一文本进行合成,得到每种目标语种对应的目标视频帧。S206: Synthesize the source video frame and the first text corresponding to each target language to obtain a target video frame corresponding to each target language.
S208,将得到的每种目标语种的目标视频帧发送至相应对方成员。S208: Send the obtained target video frame of each target language to the corresponding counterparty member.
具体地,当第一终端获取到源视频帧以及每种第二目标语种所对应的第一文本后,字幕合成插件获取源视频帧的图像宽度,基于源视频帧的图像宽度以及每种第二目标语种所对应的第一文本中的字符数量确定不同目标语种所对应的背景图像的尺寸。字幕合成插件获取预设的背景图像生成格式,如RGBA格式,并根据预设格式以及尺寸信息生成对应的背景图像。字幕合成插件读取每种目标语种所对应的第一文本中的文本内容,并将第一文本的文本内容作为字幕内容添加至相应背景图像中,得到每种目标语种对应的字幕图像。Specifically, after the first terminal obtains the source video frame and the first text corresponding to each second target language, the subtitle synthesis plug-in obtains the image width of the source video frame, based on the image width of the source video frame and each second target language. The number of characters in the first text corresponding to the target language determines the size of the background image corresponding to different target languages. The subtitle synthesis plug-in obtains a preset background image generation format, such as RGBA format, and generates a corresponding background image according to the preset format and size information. The subtitle synthesis plug-in reads the text content in the first text corresponding to each target language, and adds the text content of the first text as the subtitle content to the corresponding background image to obtain the subtitle image corresponding to each target language.
进一步地,字幕合成插件可根据预设的背景图像颜色以及字符颜色统一调整字幕图像。字符是指字幕图像中所展示的第一文本的文本内容。如根据预设将背景色统一调整为黑色,将字符颜色统一调整为白色。之后,字幕合成插件获取字幕图像的数组元素,并将数组元素中代表的背景色的元素的数值设置为零,用以去除字幕图像中的背景色,得到透明背景,白色字幕的字幕图像。字幕图像的元素数组是指记录字幕图像中每一个像素的三原色以及透明度三的字符串,基于元素数组可以动态调整图像中的三原色和透明度。Further, the subtitle synthesis plug-in can uniformly adjust the subtitle image according to the preset background image color and character color. The character refers to the text content of the first text displayed in the subtitle image. For example, according to the preset, the background color is uniformly adjusted to black, and the character color is uniformly adjusted to white. After that, the subtitle synthesis plug-in obtains the array elements of the subtitle image, and sets the value of the background color element represented in the array element to zero to remove the background color in the subtitle image to obtain a subtitle image with a transparent background and white subtitles. The element array of the subtitle image refers to a string that records the three primary colors and transparency of each pixel in the subtitle image. Based on the element array, the three primary colors and transparency in the image can be dynamically adjusted.
进一步地,图4为一个实施例中目标视频帧的示意图。字幕合成插件根据背景图像格式对源视频帧进行转换,生成与背景图像格式相同的视频帧图像。字幕合成插件获取预设的合成位置信息,根据合成位置信息分别将视频帧图像与每种目标语种对应的字幕图像进行像素叠加,得到至少一个如图4所示的目标视频帧。比如,字幕合成插件的开发人员可以预先设置一个合成起点,从而字幕插件能够从合成起点起,将视频帧图像与字幕图像中相应位置的像素所对应的元素数值进行线性叠加。Further, FIG. 4 is a schematic diagram of a target video frame in an embodiment. The subtitle synthesis plug-in converts the source video frame according to the background image format, and generates a video frame image with the same format as the background image. The subtitle synthesis plug-in obtains preset synthesis location information, and performs pixel superposition of the video frame image and the subtitle image corresponding to each target language according to the synthesis location information to obtain at least one target video frame as shown in FIG. 4. For example, the developer of the subtitle synthesis plug-in can set a synthesis starting point in advance, so that the subtitle plug-in can linearly superimpose the element values corresponding to the pixels in the corresponding position of the video frame image and the subtitle image from the synthesis starting point.
进一步地,字幕合成插件将经像素叠加后的合成图像进行格式转换,得到与源视频帧格式相同的每种目标语种对应的目标视频帧,并根据成员标识与第二目标语种的对应关系,将目标视频发送至相应的对方成员。比如,当A与B以及C进行视频通话时,A终端上的字幕合成插件根据A的语种配置操作确定与B对应的第二目标语种为英文、与C对应的第二目标语种为日文,此时字幕合成插件将嵌入英文字字幕的目标视频帧发送至B,将嵌入日文字幕的目标视频发送至C。Further, the subtitle synthesis plug-in converts the format of the synthesized image after pixel superimposition to obtain the target video frame corresponding to each target language with the same format as the source video frame, and according to the corresponding relationship between the member ID and the second target language, The target video is sent to the corresponding counterparty member. For example, when A is in a video call with B and C, the subtitle synthesis plug-in on terminal A determines that the second target language corresponding to B is English and the second target language corresponding to C is Japanese according to the language configuration operation of A. The hourly subtitle synthesis plug-in sends the target video frame with embedded English subtitles to B, and the target video with embedded Japanese subtitles to C.
上述视频通话方法中,由于根据参与视频通话的每个成员所熟悉的目标语种,将目标成员在视频通话中产生的第一语音翻译成了多个语言版本的第一文本;将不同版本的第一文本作为语音翻译字幕与目标成员在视频通话中产生的源视频帧合成之后,可以形成带有语音翻译字幕的目标视频帧;在目标成员对应视频通话的页面中展示目标视频帧并将带有 其他各个成员所需语种的语音翻译字幕的目标视频帧发送给相应成员,可以使参与视频通话的每个成员无需脱离即时通讯客户端的情况下均能通过自己所熟悉的语种来了解目标成员所讲内容,提高视频通话效率,进而可以节约视频通话链路资源。In the above video call method, since the first voice generated by the target member in the video call is translated into the first text of multiple language versions according to the target language familiar to each member participating in the video call; After a text is used as voice translation subtitles and the source video frame generated by the target member in the video call, a target video frame with voice translation subtitles can be formed; the target video frame will be displayed on the page corresponding to the video call of the target member. The target video frames of the voice translation subtitles in the languages required by other members are sent to the corresponding members, so that each member participating in the video call can understand what the target member is speaking through the language they are familiar with without having to leave the instant messaging client. Content, improve the efficiency of video calls, and then save video call link resources.
此外,由于是针对每个目标语种将第一语音翻译成一个版本的第一文本,而非针对每个通话成员将第一语音翻译成一个版本的第一文本,实质上采用相同目标语种的成员可以对第一文本进行复用,减少对源视频帧与不同版本第一文本合成的数据处理量,从而可以节约终端数据处理资源。In addition, since the first voice is translated into a version of the first text for each target language, instead of translating the first voice into a version of the first text for each member of the call, members of the same target language are essentially used The first text can be multiplexed to reduce the amount of data processing for synthesizing the source video frame with the first text of different versions, thereby saving terminal data processing resources.
在一个实施例中,上述视频通话方法还包括:当触发了目标语种的配置操作时,展示语种配置页面;获取基于语种配置页面所配置的语种配置信息;语种配置信息包括目标成员及参与视频通话的对方成员分别对应的候选语种;将目标成员的成员标识及语种配置信息关联存储至服务器,以使服务器在存在对方成员的成员标识关联的语种配置信息时,将每个语种配置信息关联的成员标识所对应的候选语种作为相应成员的目标语种。In one embodiment, the above video call method further includes: when the configuration operation of the target language is triggered, displaying the language configuration page; acquiring language configuration information configured based on the language configuration page; the language configuration information includes the target members and participating in the video call Candidate languages corresponding to the members of each other; store the member ID and language configuration information of the target member in association with the server so that the server associates the member with each language configuration information when there is language configuration information associated with the member ID of the other member The candidate language corresponding to the mark is used as the target language of the corresponding member.
具体地,当第一终端和第二终端同时安装有字幕合成插件时,目标成员和对方成员均可以触发目标语种配置操作,此时终端可以根据成员的操作对应展示语种配置页面,并将基于语种配置页面生成语种配置信息发送至服务器,以使服务器将配置信息与发送终端对应的成员标识关联存储。比如,A与B进行视频通话时,A可以设置与自己关联的候选语种为英文,与B关联的候选语种为中文,B也可以设置与自己关联的候选语种为中文,与A关联的候选语种为英文,之后服务器根据A和B的成员标识,分别将A与B发送的配置信息对应存储。Specifically, when the first terminal and the second terminal are installed with the subtitle synthesis plug-in at the same time, both the target member and the other member can trigger the target language configuration operation. At this time, the terminal can correspondingly display the language configuration page according to the member’s operation, and it will be based on the language. The configuration page generates language configuration information and sends it to the server, so that the server associates the configuration information with the member ID corresponding to the sending terminal and stores it. For example, when A and B are in a video call, A can set the candidate language associated with itself as English, and the candidate language associated with B as Chinese, and B can also set the candidate language associated with itself as Chinese, and the candidate language associated with A. In English, the server then stores the configuration information sent by A and B respectively according to the member IDs of A and B.
进一步地,服务器将每个语种配置信息关联的成员标识所对应的候选语种作为相应成员的目标语种,从而对多份语种配置信息进行筛选,生成一份统一的语种配置信息。在上述举例中,服务器从A发送的语种配置信息中,提取与A标识关联的候选语种“英文”,并将“英文”确定为与A对应的目标语种,从B发的语种配置信息中,提取与B标识关联的候选语种“中文”,并将“中文”确定为与B对应的目标语种。Further, the server uses the candidate language corresponding to the member identifier associated with each language configuration information as the target language of the corresponding member, thereby screening multiple pieces of language configuration information to generate a unified language configuration information. In the above example, the server extracts the candidate language "English" associated with the A logo from the language configuration information sent by A, and determines "English" as the target language corresponding to A. From the language configuration information sent by B, The candidate language "Chinese" associated with the B logo is extracted, and "Chinese" is determined as the target language corresponding to B.
上述实施例中,当存在多份配置信息时,通过根据成员标识对多份配置信息进行筛选,可以得到一份统一的语种配置信息,使得后续终端或服务器可以基于统一的语种配置信息生成对应的文本;通过将每个语种配置信息关联的成员标识所对应的候选语种作为相应成员的目标语种,可以提升语种配置信息的准确性,减少因语种配置信息错误,而导致对方成员接收到的目标视频帧中的字幕语种不是自己所熟悉的语种的情况。In the above embodiment, when there are multiple pieces of configuration information, by filtering the multiple pieces of configuration information according to the member identification, a unified language configuration information can be obtained, so that subsequent terminals or servers can generate corresponding information based on the unified language configuration information. Text; by using the candidate language corresponding to the member identifier associated with each language configuration information as the target language of the corresponding member, the accuracy of the language configuration information can be improved, and the target video received by the other member due to the error of the language configuration information can be reduced The language of the subtitles in the frame is not the language that you are familiar with.
在一个实施例中,上述视频通话方法还包括:将第一语音发送至服务器;接收服务器 返回的根据每个对方成员的目标语种对第一语音进行转换得到的第一文本。In an embodiment, the above-mentioned video call method further includes: sending the first voice to the server; receiving the first text returned by the server after converting the first voice according to the target language of each counterparty member.
具体地,当获取到第一语音后,字幕合成插件将第一语音发送至服务器,以使服务器根据目标成员的成员标识以及统一的语种配置信息对第一语音进行识别、翻译,生成与第二目标语种对应的第一文本,并将第一文本返回至第一终端。Specifically, after acquiring the first voice, the subtitle synthesis plug-in sends the first voice to the server, so that the server recognizes and translates the first voice according to the member identification of the target member and the unified language configuration information, and generates the second voice. The first text corresponding to the target language, and the first text is returned to the first terminal.
容易理解的,上述视频通话方法中的第一语音识别、翻译过程既可以在第一终端完成,也可以在服务器完成。当基于第一终端实现第一语音的识别、翻译时,第一终端可以根据存储在终端中的语种配置信息对第一语音进行识别、翻译,也可以从服务器中拉取统一的语种配置信息,从而根据统一的语种配置信息对第一语种进行识别、翻译;当基于服务器实现第一语音的识别、翻译时,服务器可以从第一终端中拉取对应的语种配置信息,根据终端中的语种配置信息对第一语音进行识别、翻译,也可以根据存储于服务器中的统一的语种配置信息对第一语音进行识别、翻译。It is easy to understand that the first voice recognition and translation process in the above-mentioned video call method can be completed either in the first terminal or in the server. When realizing the recognition and translation of the first voice based on the first terminal, the first terminal can recognize and translate the first voice according to the language configuration information stored in the terminal, or it can pull unified language configuration information from the server, Therefore, the first language is recognized and translated according to the unified language configuration information; when the first voice recognition and translation are realized based on the server, the server can pull the corresponding language configuration information from the first terminal, and configure according to the language configuration in the terminal The information recognizes and translates the first voice, and the first voice may also be recognized and translated according to the unified language configuration information stored in the server.
上述实施例中,通过服务器对第一语音进行转换,得到对应的第一文本,可以减少终端因进行第一语音进行转换而耗费的终端资源。In the foregoing embodiment, the server converts the first voice to obtain the corresponding first text, which can reduce terminal resources consumed by the terminal for converting the first voice.
在一个实施例中,上述视频通话方法还包括:基于每种第一文本生成对应的字幕图像,对字幕图像进行缓存;将源视频帧分别与每种目标语种对应的第一文本进行合成包括:每隔第一预设时长查询缓存中是否存在更新的字幕图像;若是,将更新的字幕图像与目标成员自当前时间之前第二;预设时长内产生的每个源视频帧进行合成,将完成合成的字幕图像从缓存删除;第二预设时长小于第一预设时长。In an embodiment, the above-mentioned video call method further includes: generating a corresponding subtitle image based on each first text, and buffering the subtitle image; and synthesizing the source video frame with the first text corresponding to each target language includes: Query whether there is an updated subtitle image in the cache every first preset duration; if so, combine the updated subtitle image with the target member second before the current time; synthesize each source video frame generated within the preset duration, and it will be completed The synthesized subtitle image is deleted from the cache; the second preset duration is less than the first preset duration.
其中,第一预设时长为字幕合成插件的开发人员根据播放视频的视频帧数所设置的时长。比如,即时通信应用进行视频播放时,一般以30帧的速率进行视频帧播放,此时字幕合成插件的开发人员可以将预设时长设置为30毫秒。第二预设时长为字幕合成插件从图像采集组件中读取源视频帧的间隔时长,第二预设时长过长会导致对方成员接收到的目标视频帧延迟过长,第二预设时长过短,会导致对方成员因接收到的嵌入字幕的目标视频帧的数量过少而无法识别字幕内容,因而需要合理设定,如3秒等。Wherein, the first preset duration is the duration set by the developer of the subtitle synthesis plug-in according to the number of video frames of the played video. For example, when an instant messaging application performs video playback, the video frame is generally played at a rate of 30 frames. At this time, the developer of the subtitle synthesis plug-in can set the preset duration to 30 milliseconds. The second preset duration is the interval time for the subtitle synthesis plug-in to read the source video frame from the image capture component. If the second preset duration is too long, the target video frame received by the other member will be delayed too long, and the second preset duration is too long. Short, it will cause the opposite member to receive too few target video frames with embedded subtitles to recognize the content of the subtitles, so it needs to be set reasonably, such as 3 seconds.
具体地,当开启视频通话时,终端中的图像采集组件会实时采集目标成员的图像信息,并将目标成员的图像信息以及采集时间对应缓存于图像缓存区。Specifically, when the video call is started, the image acquisition component in the terminal collects the image information of the target member in real time, and correspondingly caches the image information and the acquisition time of the target member in the image buffer area.
进一步地,当字幕合成插件生成对应的字幕图像后,字幕合成插件查看预设的字幕缓存区是否存有已缓存的字幕图像,若存有已缓存的字幕图像,字幕合成插件清空字幕缓存区,并将当前生成的字幕图像缓存至字幕缓存区。Further, after the subtitle synthesis plug-in generates the corresponding subtitle image, the subtitle synthesis plug-in checks whether the preset subtitle buffer area contains the buffered subtitle image. If there is a buffered subtitle image, the subtitle synthesis plug-in clears the subtitle buffer area. And cache the currently generated subtitle image to the subtitle buffer area.
进一步地,字幕合成插件每隔第一预设时长查看字幕缓存区是否具有更新的字幕图 像,当具有更新的字幕图像时,字幕合成插件从图像缓存区中读取自当前时间之前预设第二时长内图像采集组件采集的至少一个源视频帧,然后将已读取源视频帧从图像缓存区中对应删除。若自当前时间起第二预设时长内,字幕缓存区依旧未存有更新的字幕图像时,字幕合成插件直接将自当前时间起第二预设时长内的源视频帧发送至对方成员,并从图像缓存区中删除已发送的源视频帧。Further, the subtitle synthesis plug-in checks whether there is an updated subtitle image in the subtitle buffer area every first preset duration. When there is an updated subtitle image, the subtitle synthesis plug-in reads from the image buffer area the second preset before the current time. At least one source video frame is collected by the image capture component within the time period, and then the source video frame that has been read is deleted from the image buffer area. If the updated subtitle image is still not stored in the subtitle buffer area within the second preset time from the current time, the subtitle synthesis plug-in directly sends the source video frame within the second preset time from the current time to the other member, and Delete the sent source video frame from the image buffer area.
进一步地,字幕合成插件分别将每种第二目标语种对应的字幕图像与每个从图像缓存区读取的源视频帧进行合成,得到对应的目标视频帧,并从字幕缓存区中将已合成的字幕图像对应删除。Further, the subtitle synthesis plug-in separately synthesizes the subtitle image corresponding to each second target language with each source video frame read from the image buffer area to obtain the corresponding target video frame, and synthesize the subtitle image from the subtitle buffer area The subtitle image corresponding to delete.
上述实施例中,通过每隔一定时长查询字幕缓存区中是否具有更新的字幕图像,可以及时获取最新的字幕图像,从而后续可以及时将合成的目标视频帧发送至对方成员;通过将最新的字幕图像与多个源视频帧进行合成,使得对方成员可以基于多个目标视频帧识别字幕内容。In the above embodiment, by querying whether there are updated subtitle images in the subtitle buffer area at regular intervals, the latest subtitle images can be obtained in time, so that the synthesized target video frame can be sent to the other member in time; by sending the latest subtitles The image is synthesized with multiple source video frames, so that members of the other party can recognize the subtitle content based on multiple target video frames.
在一个实施例中,基于每种所述第一文本生成对应的字幕图像包括:根据源视频帧的图像宽度确定字幕宽度;将字幕宽度转换为每种目标语种对应的字符数量阈值;根据不同的字符数量阈值将相应第一文本拆分为多个子文本;根据第一文本对应子文本的数量确定相应第一文本的字幕高度;将第一文本作为字幕内容添加至根据字幕宽度及字幕高度生成的背景图像中,得到字幕图像。In one embodiment, generating the corresponding subtitle image based on each type of the first text includes: determining the subtitle width according to the image width of the source video frame; converting the subtitle width into a threshold value of the number of characters corresponding to each target language; The character number threshold splits the corresponding first text into multiple sub-texts; determines the sub-text height of the corresponding first text according to the number of sub-texts corresponding to the first text; adds the first text as subtitle content to the generated sub-texts based on the sub-text width and sub-text height In the background image, the subtitle image is obtained.
其中,字符数量阈值为单行字幕所能展示的最多字符数量。Among them, the threshold of the number of characters is the maximum number of characters that can be displayed in a single-line subtitle.
具体地,字幕合成插件确定源视频帧的图像宽度,并根据预设的图像宽度占比值确定字幕宽度。比如,预设的字幕宽度占比值为三分之二,则字幕合成插件将源视频帧的图像宽度的三分之二确定为字幕宽度。Specifically, the subtitle synthesis plug-in determines the image width of the source video frame, and determines the subtitle width according to a preset image width ratio value. For example, if the preset subtitle width ratio is two-thirds, the subtitle synthesis plug-in determines two-thirds of the image width of the source video frame as the subtitle width.
进一步地,终端中具有每种目标语种所对应的单个字符的宽度信息以及字符之间的间隔距离的对应关系。字幕合成插件分别获取第一文本所对应的第二目标语种,并根据第二目标语种的语种信息,从对应关系中确定相应的单个字符宽度信息以及字符之间的间隔距离,基于获得的字幕宽度、单个字符宽度信息以及字符之间的间隔距离计算第二目标语种对应的字符数量阈值,即字幕合成插件可以根据字幕宽度、单个字符宽度信息以及字符之间的间隔距离,获得单行字幕所能呈现的字符数量。Further, the terminal has a correspondence relationship between the width information of a single character corresponding to each target language and the separation distance between the characters. The subtitle synthesis plug-in separately obtains the second target language corresponding to the first text, and according to the language information of the second target language, determines the corresponding single character width information and the spacing distance between the characters from the corresponding relationship, based on the obtained subtitle width , Single character width information and distance between characters to calculate the threshold of the number of characters corresponding to the second target language, that is, the subtitle synthesis plug-in can obtain single-line subtitles according to the subtitle width, single character width information and the distance between characters The number of characters.
进一步地,字幕合成插件统计第一文本中的字符数量,得到字符总数,将字符总数除以字符数量阈值,得到子文本的文本数量,字幕合成插件基于子文本的文本数量生成对应数量的子文本。字幕合成插件自第一文本中的第一个字符起,读取字符数量阈值个字符, 并将已读取的字符存储至子文本中。字幕合成插件从第一文本中删除已读取的字符,并根据字符数量阈值继续读取第一文本中的字符,将已读取的字符存储至未存储字符的子文本中,直至第一文本中的字符全部删除。Further, the subtitle synthesis plug-in counts the number of characters in the first text to obtain the total number of characters, and divides the total number of characters by the character number threshold to obtain the number of sub-texts. The subtitle synthesis plug-in generates a corresponding number of sub-texts based on the number of sub-texts. . The subtitle synthesis plug-in reads a threshold number of characters from the first character in the first text, and stores the read characters in the sub-text. The subtitle synthesis plug-in deletes the read characters from the first text, and continues to read the characters in the first text according to the threshold of the number of characters, and stores the read characters in the sub-text without storing characters until the first text All characters in are deleted.
进一步地,字幕合成插件统计第一文本对应子文本的数量,根据子文本的数量确定字幕图像中的字幕行数。比如,当具有三个子文本时,字幕合成插件可以认为此时待生成的字幕图像中具有三行字幕,此时字幕合成插件可以根据预设的单行字幕高度以及字幕总行数计算得到相应第一文本的字幕高度。Further, the subtitle synthesis plug-in counts the number of subtexts corresponding to the first text, and determines the number of subtitle lines in the subtitle image according to the number of subtexts. For example, when there are three sub-texts, the subtitle synthesis plug-in can consider that there are three lines of subtitles in the subtitle image to be generated at this time. At this time, the subtitle synthesis plug-in can calculate the corresponding first text according to the preset height of single-line subtitles and the total number of subtitles. The height of the subtitles.
进一步地,字幕合成插件根据字幕宽度以及字幕高度生成对应尺寸的背景图像,并将每个子文本中的字符作为字幕内容添加至背景图像中。Further, the subtitle synthesis plug-in generates a background image of a corresponding size according to the subtitle width and the subtitle height, and adds the characters in each sub-text to the background image as subtitle content.
上述实施例中,通过根据源视频帧的图像宽度确定字幕宽度,可以减少因生成的字幕图像的宽度值大于源视频帧的图像宽度,而导致字幕超出视频画面的概率;根据子文本数量确定背景图像高度,可以减少生成不必要的部分背景图像。In the above embodiment, by determining the subtitle width according to the image width of the source video frame, it is possible to reduce the probability of the subtitle exceeding the video image due to the width value of the generated subtitle image being greater than the image width of the source video frame; the background is determined according to the number of subtexts The height of the image can reduce the generation of unnecessary part of the background image.
在一个实施例中,上述视频通话方法还包括:采集对方成员在视频通话中产生的第二语音;获取根据目标成员对应的目标语种对第二语音转换得到的第二文本;展示第二文本。In one embodiment, the above-mentioned video call method further includes: collecting the second voice produced by the opposite member during the video call; obtaining the second text obtained by converting the second voice according to the target language corresponding to the target member; and displaying the second text.
具体地,当进行视频通话时,对方成员所对应的第二语音可以通过第二终端上的即时通信应用发送至第一终端。此时第一终端中的即时通信应用接收第二语音,并将第二语音发送至音频播放组件。第一终端中的字幕合成插件监听音频播放组件是否接收到第二语音,当音频播放组件接收到第二语音时,字幕合成插件获取此第二语音,并根据语种配置信息中的目标成员所对应的第一目标语种对第二语音进行识别以及翻译,得到第二文本。Specifically, when a video call is in progress, the second voice corresponding to the opposite member can be sent to the first terminal through the instant messaging application on the second terminal. At this time, the instant messaging application in the first terminal receives the second voice, and sends the second voice to the audio playback component. The subtitle synthesis plug-in in the first terminal monitors whether the audio playback component receives the second voice. When the audio playback component receives the second voice, the subtitle synthesis plug-in obtains the second voice, and according to the target member corresponding to the language configuration information Recognize and translate the second speech in the first target language to obtain the second text.
进一步地,字幕合成插件将生成的第二文本对应展示于第一终端的屏幕中。Further, the subtitle synthesis plug-in correspondingly displays the generated second text on the screen of the first terminal.
图5为一个实施例中弹窗展示第二文本的示意图。第一终端可以以弹窗的形式展示第二文本,也可以以如图6所示的提示消息的形式展示第二文本,图6为一个实施例中以提示消息的形式展示第二文本的示意图。Fig. 5 is a schematic diagram of a pop-up window displaying the second text in an embodiment. The first terminal may display the second text in the form of a pop-up window, or may display the second text in the form of a prompt message as shown in FIG. 6, which is a schematic diagram of displaying the second text in the form of a prompt message in an embodiment .
上述实施例中,由于第二文本的展示形式可以有多种,使得目标成员可以基于实际需求自主选择合适的展示形式,极大地提高了用户体验。In the foregoing embodiment, since the second text can be displayed in multiple forms, the target member can independently select a suitable display form based on actual needs, which greatly improves the user experience.
在一个实施例中,第一终端自展示第二文本的时间起,计算预设时长内目标成员是否主动关闭第二文本,若未主动关闭第二文本,第一终端可以基于第二文本生成的关闭指令,用以自动关闭所展示的第二文本,从而使得当目标成员已经阅读完第二文本时,第二文本能够自动关闭,进而节约的终点显示第二文本所耗费的显示资源。In one embodiment, the first terminal calculates whether the target member actively closes the second text within the preset time period from the time when the second text is displayed. If the second text is not actively closed, the first terminal may generate the second text based on the The closing instruction is used to automatically close the displayed second text, so that when the target member has finished reading the second text, the second text can be automatically closed, thereby saving display resources consumed by the end point to display the second text.
在一个实施例中,目标成员可以手动关闭展示的第二文本,如点击关闭控件关闭第二 文本,或根据针对屏幕的滑动操作关闭第二文本。In one embodiment, the target member can manually close the displayed second text, such as clicking the close control to close the second text, or closing the second text according to a sliding operation on the screen.
在一个实施例中,当目标成员最小化即时通信应用时,第一终端依旧可以以弹窗或提示消息的形式展示第二文本。In one embodiment, when the target member minimizes the instant messaging application, the first terminal can still display the second text in the form of a pop-up window or a prompt message.
上述实施例中,采用弹窗或提示消息的形式展示第二文本,使得第二文本的展示能够脱离视频通话页面,从而当即时通信应用转为后台运行时,目标成员也可以根据第二文本内容与对方成员进行顺畅地沟通。In the above embodiment, the second text is displayed in the form of a pop-up window or a prompt message, so that the display of the second text can be separated from the video call page, so that when the instant messaging application is turned to run in the background, the target member can also follow the content of the second text Communicate smoothly with other members.
在一个实施例中,当为多人视频通话时,基于音频播放组件采集到的第二语音可能糅合了多个对方成员的语音,此时字幕合成插件从第二语音中提取音色信息,根据音色信息将第二语音进行划分为多个第二子语音,并基于目标成员对应的目标语种对多个第二子语音进行转换,得到多个第二文本。之后第一终端分别对应展示多个第二文本。根据音色对第二语音进行划分,使得字幕合成插件能够区分不同对方成员的不同第二子语音,从而在多人视频通话场景中,能够以展示多个第二文本的方式辅助目标成员区分不同对方成员所表达的不同信息,进而提升多人视频通话的沟通效率。In one embodiment, when it is a multi-person video call, the second voice collected by the audio-based playback component may be combined with the voices of multiple other members. At this time, the subtitle synthesis plug-in extracts timbre information from the second voice, according to the timbre The information divides the second voice into multiple second sub-voices, and converts the multiple second sub-voices based on the target language corresponding to the target member to obtain multiple second texts. After that, the first terminal respectively displays a plurality of second texts correspondingly. The second voice is divided according to the timbre, so that the subtitle synthesis plug-in can distinguish different second sub-voices of different opponent members, so that in a multi-person video call scene, it can assist the target member to distinguish different opponents by displaying multiple second texts. The different information expressed by the members further enhances the communication efficiency of multi-person video calls.
上述实施例中,通过在终端中展示第二文本,可以使目标成员在第二终端未安装字幕合成插件时,也能了解对方成员所讲内容,从而使视频通话能够顺利进行。In the above embodiment, by displaying the second text in the terminal, the target member can understand the content of the other member when the subtitle synthesis plug-in is not installed in the second terminal, so that the video call can be carried out smoothly.
在一个实施例中,视频通话的页面包括所述目标成员以及每个所述对方成员对应的视频帧展示区;上述视频通话方法还包括:在目标成员对应的视频帧展示区,展示合成得到的目标成员对应目标语种的目标视频帧,记作第一目标视频帧;获取来自对方成员的第二目标视频帧;第二目标视频帧是根据目标成员对应目标语种将对方成员在视频通话中产生的第二语音转换为第二文本,并基于转换得到的第二文本以及对方成员在所述视频通话中产生的源视频帧合成得到;在对方成员对应的视频帧展示区,展示第二目标视频帧。In one embodiment, the page of the video call includes the target member and the video frame display area corresponding to each of the opposing members; the above video call method further includes: displaying the synthesized video frame display area in the video frame display area corresponding to the target member The target video frame of the target member corresponding to the target language is recorded as the first target video frame; the second target video frame is obtained from the opponent member; the second target video frame is generated by the opponent member during the video call according to the target member corresponding to the target language The second voice is converted into a second text, and synthesized based on the converted second text and the source video frame generated by the opposite member during the video call; the second target video frame is displayed in the video frame display area corresponding to the opposite member .
具体地,字幕合成插件可以根据目标成员对应的第一目标语种对第一语音进行转换,得到对应的第一文本,并将第一文本与源视频帧进行合成,得到目标成员对应目标语种的第一目标视频帧。Specifically, the subtitle synthesis plug-in can convert the first voice according to the first target language corresponding to the target member to obtain the corresponding first text, and synthesize the first text with the source video frame to obtain the first target language corresponding to the target member. A target video frame.
进一步地,当第二终端中安装有字幕合成插件时,第二终端可以根据目标成员对应的目标语种将对方成员在视频通话中产生的第二语音转换为第二文本,并将转换得到的第二文本以及对方成员在视频通话中产生的源视频帧进行合成,得到第二目标视频帧,之后第二终端将合成得到的第二目标视频帧发送至第一终端。Further, when the subtitle synthesis plug-in is installed in the second terminal, the second terminal may convert the second voice generated by the other member in the video call into the second text according to the target language corresponding to the target member, and convert the second voice The two texts and the source video frame generated by the other member during the video call are synthesized to obtain the second target video frame, and then the second terminal sends the synthesized second target video frame to the first terminal.
进一步地,当第一终端获取到第一目标视频帧以及第二目标视频帧后,第一终端获取视频通话页面的页面大小,根据页面大小对应划分目标成员对应的视频帧展示区和对方成 员对应的视频帧展示区。比如,第一终端统计参与视频通话的成员总数,根据成员总数,将视频通话的页面平均划分为多个视频帧展示区,并约定第一个划分的视频展示区为目标成员对应的视频帧展示区。Further, after the first terminal obtains the first target video frame and the second target video frame, the first terminal obtains the page size of the video call page, and divides the video frame display area corresponding to the target member corresponding to the counterpart member according to the page size. The video frame display area. For example, the first terminal counts the total number of members participating in the video call, divides the video call page into multiple video frame display areas according to the total number of members, and agrees that the first divided video display area is the video frame display corresponding to the target member Area.
进一步地,图7为一个实施例的视频帧展示区的示意图。第一终端分别获取目标成员以及对方成员对应的视频帧展示区的区域大小,根据区域大小对应更改第一目标视频帧和第二目标视频帧的尺寸,以使如图7所示的视频帧展示区能够完整展示第一目标视频帧和第二目标视频帧。Further, FIG. 7 is a schematic diagram of a video frame display area according to an embodiment. The first terminal separately obtains the area size of the video frame display area corresponding to the target member and the other member, and correspondingly changes the size of the first target video frame and the second target video frame according to the area size, so that the video frame shown in FIG. 7 is displayed The area can completely display the first target video frame and the second target video frame.
在一个实施例中,目标成员可以根据自身需求对应更改视频帧展示区的大小,比如当目标成员与B以及C进行视频通话时,目标成员可以放大与B对应的视频帧展示区,此时与目标成员对应的视频帧展示区和与C对应的视频帧展示区会对应缩小,从而使得整个视频通话更符合目标成员的实际需求。In one embodiment, the target member can change the size of the video frame display area according to his own needs. For example, when the target member has a video call with B and C, the target member can enlarge the video frame display area corresponding to B. The video frame display area corresponding to the target member and the video frame display area corresponding to C will be correspondingly reduced, so that the entire video call is more in line with the actual needs of the target member.
在一个实施例中,当目标成员发现与展示的第一目标视频帧中的字幕有误时,目标成员可以对字幕中的错误字符进行标定,此时字幕合成插件根据目标成员的标定操作生成的纠正页面。基于纠正页面,目标成员可以输入与错误字符相对的应展示字符。In one embodiment, when the target member finds that the subtitles in the first target video frame displayed and displayed are incorrect, the target member can calibrate the wrong characters in the subtitles. At this time, the subtitle synthesis plug-in generates the error according to the target member's calibration operation. Correct the page. Based on the correction page, the target member can input the corresponding display character corresponding to the wrong character.
进一步地,字幕合成插件将错误字符与应展示字符对应存储于字符库中,当字幕合成插件再次识别出错误字符,可以根据字符库中的应展示字符选择是否修正错误字符。Further, the subtitle synthesis plug-in stores the error characters and the characters to be displayed in the character library. When the subtitle synthesis plug-in recognizes the wrong characters again, it can choose whether to correct the wrong characters according to the characters to be displayed in the character library.
上述实施例中,通过在视频帧展示区对应展示第一目标视频帧,使得目标用户可以实时查看第一目标视频帧显示的字幕内容是否正确,从而能够在发现错误字符时对错误字符进行及时标定,进而提升字幕合成插件对语音翻译的准确率。In the above embodiment, by correspondingly displaying the first target video frame in the video frame display area, the target user can check in real time whether the subtitle content displayed by the first target video frame is correct, so that the wrong character can be calibrated in time when the wrong character is found. , Thereby improving the accuracy of the subtitle synthesis plug-in for speech translation.
在一个实施例中,上述视频通话方法还包括:采集对方成员在视频通话中产生的第二语音;获取根据目标成员对应的目标语种对第二语音转换得到的第二文本;根据每个对方成员对应的视频帧展示区的大小,确定所获取的第二文本的展示样式;根据展示样式在视频通话的页面弹窗展示所获取的第二文本。In one embodiment, the above-mentioned video call method further includes: collecting the second voice produced by the opposite member during the video call; obtaining the second text obtained by converting the second voice according to the target language corresponding to the target member; and according to each other member The size of the display area of the corresponding video frame determines the display style of the obtained second text; the obtained second text is displayed in the pop-up window of the video call page according to the display style.
其中,第二文本的展示样式包括第二文本中的字符透明度、字符大小以及字符颜色。Wherein, the display style of the second text includes character transparency, character size, and character color in the second text.
具体地,字幕合成插件从音频播放组件中获取在视频通话过程中产生的第二语音,并根据目标成员对应的目标语种对第二语音转换,得到第二文本。字幕合成插件获取每个对方成员对应的视频帧展示区的大小,当每个对方成员对应的视频帧展示区的大小均小于区域阈值时,可以认为目标成员无法清楚识别视频帧展示区展示的字幕内容,此时字幕合成插件基于预设的配置文件对应降低字符透明度、增大字符大小以及将字符颜色变更为更醒目的颜色。Specifically, the subtitle synthesis plug-in obtains the second voice generated during the video call from the audio playback component, and converts the second voice according to the target language corresponding to the target member to obtain the second text. The subtitle synthesis plug-in obtains the size of the video frame display area corresponding to each opponent member. When the size of the video frame display area corresponding to each opponent member is less than the area threshold, it can be considered that the target member cannot clearly identify the subtitle displayed in the video frame display area. Content, at this time, the subtitle synthesis plug-in correspondingly reduces the character transparency, increases the character size, and changes the character color to a more eye-catching color based on the preset configuration file.
在一个实施例中,字幕合成插件可以在终端中生成样式调整控件,基于样式调整控件,目标成员可以对应调整第二文本的样式。In one embodiment, the subtitle synthesis plug-in can generate a style adjustment control in the terminal, and based on the style adjustment control, the target member can correspondingly adjust the style of the second text.
上述实施例中,通过在终端中设置样式调整控件,能够让目标成员自主调整第二文本的展示样式,从而提升用户体验。In the foregoing embodiment, by setting the style adjustment control in the terminal, the target member can independently adjust the display style of the second text, thereby improving the user experience.
上述实施例中,根据对方成员对应的视频帧展示区的大小实时调整第二文本的样式,不仅可以减少因视频帧展示区过小时,目标成员无法清楚识别字幕内容的情况,还可以在视频帧展示区足够大时,通过降低第二文本的存在感,减少因重复显示对方成员的语音信息,而造成对目标成员的干扰。In the above embodiment, the style of the second text is adjusted in real time according to the size of the video frame display area corresponding to the other member, which not only reduces the situation that the target member cannot clearly identify the content of the subtitles because the video frame display area is too small, but also can be displayed in the video frame. When the display area is large enough, by reducing the presence of the second text, the interference to the target member caused by the repeated display of the voice information of the other member is reduced.
在一个实施例中,终端包括音频采集组件和音频播放组件;上述视频播放方法还包括:第一语音基于音频采集组件产生,第二语音基于音频播放组件产生。In an embodiment, the terminal includes an audio collection component and an audio playback component; the above video playback method further includes: the first voice is generated based on the audio collection component, and the second voice is generated based on the audio playback component.
具体地,当进行视频通话时,第一终端中的音频采集组件,如麦克风可以实时收录目标成员的第一语音,并将收录得到的第一语音以语音流的方式传输至字幕合成插件,用以生成对应的第一文本。Specifically, when a video call is made, the audio collection component in the first terminal, such as a microphone, can record the first voice of the target member in real time, and transmit the recorded first voice to the subtitle synthesis plug-in in a voice stream. To generate the corresponding first text.
第二终端中的音频采集组件也可以实时采集对方成员的第二语音,并将第二语音通过即时通信应用发送至第一终端。此时第一终端中的即时通信应用接收第二语音,并将第二语音发送至音频播放组件。第一终端中的字幕合成插件监听音频播放组件是否接收到第二语音,当音频播放组件接收到第二语音时,字幕合成插件获取此第二语音,并根据语种配置信息中的目标成员所对应的第一目标语种对第二语音进行识别以及翻译,得到第二文本。The audio collection component in the second terminal can also collect the second voice of the opposite member in real time, and send the second voice to the first terminal through the instant messaging application. At this time, the instant messaging application in the first terminal receives the second voice, and sends the second voice to the audio playback component. The subtitle synthesis plug-in in the first terminal monitors whether the audio playback component receives the second voice. When the audio playback component receives the second voice, the subtitle synthesis plug-in obtains the second voice, and according to the target member corresponding to the language configuration information Recognize and translate the second speech in the first target language to obtain the second text.
上述实施例中,通过分别读取音频采集组件和音频播放组件采集的语音,使得字幕合成插件可以清楚地区分与目标成员产生语音和对方成员产生语音,从而后续可以根据目标成员产生语音和对方成员产生语音对应生成第二文本以及第二文本。In the above-mentioned embodiment, by separately reading the voices collected by the audio collection component and the audio playback component, the subtitle synthesis plug-in can clearly distinguish between the voice generated by the target member and the voice generated by the opponent member, so that the subsequent voice can be generated according to the target member and the other member. The generation of the voice corresponds to the generation of the second text and the second text.
应该理解的是,虽然图2的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图2中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although the various steps in the flowchart of FIG. 2 are displayed in sequence as indicated by the arrows, these steps are not necessarily executed in sequence in the order indicated by the arrows. Unless there is a clear description in this article, there is no strict order for the execution of these steps, and these steps can be executed in other orders. Moreover, at least part of the steps in FIG. 2 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. The execution of these sub-steps or stages The sequence is not necessarily performed sequentially, but may be performed alternately or alternately with at least a part of other steps or sub-steps or stages of other steps.
在一个实施例中,如图8所示,提供了一种视频通话装置800,包括:第一文本生成模块802、目标视频帧合成模块804和页面展示模块806,其中:In one embodiment, as shown in FIG. 8, a video call device 800 is provided, including: a first text generation module 802, a target video frame synthesis module 804, and a page display module 806, wherein:
第一文本生成模块802,用于采集目标成员在视频通话中产生的第一语音及源视频帧;根据预设的参与视频通话的对方成员分别指向的目标语种对第一语音进行转换,得到第一文本。The first text generation module 802 is used to collect the first voice and source video frame generated by the target member in the video call; convert the first voice according to the preset target languages pointed to by the other members participating in the video call to obtain the first voice One text.
目标视频帧合成模块804,用于将源视频帧分别与每种目标语种对应的第一文本进行合成,得到每种目标语种对应的目标视频帧。The target video frame synthesis module 804 is configured to synthesize the source video frame with the first text corresponding to each target language to obtain the target video frame corresponding to each target language.
页面展示模块806,用于将得到的每种目标语种的目标视频帧发送至相应对方成员。The page display module 806 is configured to send the obtained target video frames of each target language to the corresponding counterparty member.
在一个实施例中,如图9所示,上述视频通话装置800还包括语种配置模块808,用于当触发了目标语种的配置操作时,展示语种配置页面;获取基于语种配置页面所配置的语种配置信息;语种配置信息包括目标成员及参与视频通话的对方成员分别对应的候选语种;将目标成员的成员标识及语种配置信息关联存储至服务器,以使服务器在存在对方成员的成员标识关联的语种配置信息时,将每个语种配置信息关联的成员标识所对应的候选语种作为相应成员的目标语种。In one embodiment, as shown in FIG. 9, the above-mentioned video call device 800 further includes a language configuration module 808, which is used to display a language configuration page when the configuration operation of the target language is triggered; to obtain the language configured based on the language configuration page Configuration information; the language configuration information includes the candidate languages corresponding to the target member and the other member participating in the video call; the member ID and language configuration information of the target member are associated and stored to the server, so that the server has the language type associated with the member ID of the other member When configuring information, the candidate language corresponding to the member identifier associated with each language configuration information is used as the target language of the corresponding member.
在一个实施例中,语种配置模块808还用于将第一语音发送至服务器;接收服务器返回的根据每个对方成员的目标语种对第一语音进行转换得到的第一文本。In one embodiment, the language configuration module 808 is further configured to send the first voice to the server; and receive the first text returned by the server after converting the first voice according to the target language of each member of the other party.
在一个实施例中,目标视频帧合成模块804还用于基于每种第一文本生成对应的字幕图像,对字幕图像进行缓存;每隔第一预设时长查询缓存中是否存在更新的字幕图像;若是,将更新的字幕图像与目标成员自当前时间之前第二预设时长内产生的每个源视频帧进行合成,将完成合成的字幕图像从缓存删除;第二预设时长小于第一预设时长。In an embodiment, the target video frame synthesis module 804 is further configured to generate a corresponding subtitle image based on each type of first text, and cache the subtitle image; query whether there is an updated subtitle image in the cache every first preset duration; If yes, synthesize the updated subtitle image with each source video frame generated by the target member within the second preset duration before the current time, and delete the synthesized subtitle image from the cache; the second preset duration is less than the first preset duration.
在一个实施例中,目标视频帧合成模块804还用于根据源视频帧的图像宽度确定字幕宽度;将字幕宽度转换为每种目标语种对应的字符数量阈值;根据不同的字符数量阈值将相应第一文本拆分为多个子文本;根据第一文本对应子文本的数量确定相应第一文本的字幕高度;将第一文本作为字幕内容添加至根据字幕宽度及字幕高度生成的背景图像中,得到字幕图像。In one embodiment, the target video frame synthesis module 804 is further configured to determine the subtitle width according to the image width of the source video frame; convert the subtitle width into a threshold value for the number of characters corresponding to each target language; Split a text into multiple sub-texts; determine the subtitle height of the corresponding first text according to the number of sub-texts corresponding to the first text; add the first text as subtitle content to the background image generated according to the subtitle width and subtitle height to obtain subtitles image.
在一个实施例中,视频通话装置800还包括第二文本生成模块810,用于采集对方成员在视频通话中产生的第二语音;获取根据目标成员对应的目标语种对第二语音转换得到的第二文本;展示第二文本。In one embodiment, the video call device 800 further includes a second text generation module 810, which is used to collect the second voice generated by the opposite member during the video call; obtain the second voice obtained by converting the second voice according to the target language corresponding to the target member. Second text; display the second text.
在一个实施例中,视频通话装置800还包括视频帧展示区确定模块812,用于在目标成员对应的视频帧展示区,展示合成得到的目标成员对应目标语种的目标视频帧,记作第一目标视频帧;获取来自对方成员的第二目标视频帧;第二目标视频帧是根据目标成员对应目标语种将对方成员在视频通话中产生的第二语音转换为第二文本,并基于转换得到的 第二文本以及对方成员在视频通话中产生的源视频帧合成得到;在对方成员对应的视频帧展示区,展示第二目标视频帧。In one embodiment, the video call device 800 further includes a video frame display area determining module 812, configured to display the synthesized target video frame corresponding to the target language of the target member in the video frame display area corresponding to the target member, which is recorded as the first The target video frame; the second target video frame is obtained from the other member; the second target video frame is based on the target member’s corresponding target language, the second voice generated by the other member during the video call is converted into the second text, and the result is obtained based on the conversion The second text and the source video frame generated by the opposite member during the video call are synthesized; and the second target video frame is displayed in the video frame display area corresponding to the opposite member.
在一个实施例中,视频帧展示区确定模块812还用于采集对方成员在视频通话中产生的第二语音;获取根据目标成员对应的目标语种对第二语音转换得到的第二文本;根据每个对方成员对应的视频帧展示区的大小,确定所获取的第二文本的展示样式;根据展示样式在视频通话的页面弹窗展示所获取的第二文本。In one embodiment, the video frame display area determination module 812 is also used to collect the second voice generated by the opposite member during the video call; obtain the second text obtained by converting the second voice according to the target language corresponding to the target member; The size of the video frame display area corresponding to each member of the other party determines the display style of the obtained second text; the obtained second text is displayed in the pop-up window of the video call page according to the display style.
在一个实施例中,视频通话装置800还语音获取模块814,用于基于音频采集组件采集第一语音,基于音频播放组件采集第二语音。In one embodiment, the video call device 800 further includes a voice acquisition module 814, configured to collect the first voice based on the audio collection component, and collect the second voice based on the audio playback component.
关于视频通话装置的具体限定可以参见上文中对于视频通话方法的限定,在此不再赘述。上述视频通话装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。For the specific limitation of the video call device, please refer to the above limitation of the video call method, which will not be repeated here. The various modules in the above-mentioned video call device may be implemented in whole or in part by software, hardware, and a combination thereof. The above-mentioned modules may be embedded in the form of hardware or independent of the processor in the computer equipment, or may be stored in the memory of the computer equipment in the form of software, so that the processor can call and execute the operations corresponding to the above-mentioned modules.
在一个实施例中,提供了一种计算机设备,该计算机设备可以是第一终端,其内部结构图可以如图10所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口、显示屏、音频采集装置、音频播放装置、图像采集装置和输入装置。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机程序。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种视频通话方法。该计算机设备的显示屏可以是液晶显示屏或者电子墨水显示屏,该计算机设备的输入装置可以是显示屏上覆盖的触摸层,也可以是计算机设备外壳上设置的控件、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。In one embodiment, a computer device is provided. The computer device may be a first terminal, and its internal structure diagram may be as shown in FIG. 10. The computer equipment includes a processor, a memory, a network interface, a display screen, an audio collection device, an audio playback device, an image collection device, and an input device connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer program is executed by the processor to realize a video call method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, or it can be a control, trackball or touchpad set on the housing of the computer equipment , It can also be an external keyboard, touchpad, or mouse.
本领域技术人员可以理解,图10中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。Those skilled in the art can understand that the structure shown in FIG. 10 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied. The specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
在一个实施例中,提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,处理器执行计算机程序时实现:采集目标成员在视频通话中产生的第一语音及源视频帧;根据预设的参与视频通话的对方成员分别指向的目标语种对第一语音进行转换,得到第一文本;将源视频帧分别与每种目标语种对应的第一文本进行合成,得到每种目标语种对应的目标视频帧;将得到的每种目标语种的目标视频帧发 送至相应对方成员。In one embodiment, a computer device is provided, including a memory, a processor, and a computer program stored in the memory and running on the processor. When the processor executes the computer program, it is realized: the collection target member generates a video call during a video call. The first voice and the source video frame; the first voice is converted according to the preset target languages pointed to by the other members participating in the video call to obtain the first text; the source video frame is respectively corresponding to the first target language of each target language The text is synthesized to obtain the target video frame corresponding to each target language; the obtained target video frame of each target language is sent to the corresponding opponent member.
在一个实施例中,处理器执行计算机程序时还实现:当触发了目标语种的配置操作时,展示语种配置页面;获取基于语种配置页面所配置的语种配置信息;语种配置信息包括目标成员及参与视频通话的对方成员分别对应的候选语种;将目标成员的成员标识及语种配置信息关联存储至服务器,以使服务器在存在对方成员的成员标识关联的语种配置信息时,将每个语种配置信息关联的成员标识所对应的候选语种作为相应成员的目标语种。In one embodiment, when the processor executes the computer program, it also realizes: when the configuration operation of the target language is triggered, the language configuration page is displayed; the language configuration information configured based on the language configuration page is obtained; the language configuration information includes the target members and participation Candidate languages corresponding to the other members of the video call; associate the member ID and language configuration information of the target member to the server, so that the server associates each language configuration information when there is language configuration information associated with the member ID of the other member The candidate language corresponding to the member ID of is used as the target language of the corresponding member.
在一个实施例中,处理器执行计算机程序时还实现:将第一语音发送至服务器;接收服务器返回的根据每个对方成员的目标语种对第一语音进行转换得到的第一文本。In an embodiment, when the processor executes the computer program, the processor further implements: sending the first voice to the server; and receiving the first text obtained by converting the first voice according to the target language of each member of the other party returned by the server.
在一个实施例中,处理器执行计算机程序时还实现:基于每种第一文本生成对应的字幕图像,对字幕图像进行缓存;将源视频帧分别与每种目标语种对应的第一文本进行合成包括:每隔第一预设时长查询缓存中是否存在更新的字幕图像;若是,将更新的字幕图像与目标成员自当前时间之前第二预设时长内产生的每个源视频帧进行合成,将完成合成的字幕图像从缓存删除;第二预设时长小于第一预设时长。In one embodiment, when the processor executes the computer program, the processor further implements: generating a corresponding subtitle image based on each first text, and buffering the subtitle image; and synthesizing the source video frame with the first text corresponding to each target language. Including: inquire whether there is an updated subtitle image in the cache every first preset duration; if so, synthesize the updated subtitle image with each source video frame generated by the target member within the second preset duration before the current time, and The synthesized subtitle image is deleted from the cache; the second preset duration is less than the first preset duration.
在一个实施例中,处理器执行计算机程序时还实现:根据源视频帧的图像宽度确定字幕宽度;将字幕宽度转换为每种目标语种对应的字符数量阈值;根据不同的字符数量阈值将相应第一文本拆分为多个子文本;根据第一文本对应子文本的数量确定相应第一文本的字幕高度;将第一文本作为字幕内容添加至根据字幕宽度及字幕高度生成的背景图像中,得到字幕图像。In one embodiment, when the processor executes the computer program, the processor further implements: determining the subtitle width according to the image width of the source video frame; converting the subtitle width into a threshold value for the number of characters corresponding to each target language; Split a text into multiple sub-texts; determine the subtitle height of the corresponding first text according to the number of sub-texts corresponding to the first text; add the first text as subtitle content to the background image generated according to the subtitle width and subtitle height to obtain subtitles image.
在一个实施例中,处理器执行计算机程序时还实现:采集对方成员在视频通话中产生的第二语音;获取根据目标成员对应的目标语种对第二语音转换得到的第二文本;展示第二文本。In one embodiment, when the processor executes the computer program, the processor also implements: collect the second voice produced by the opposite member during the video call; obtain the second text obtained by converting the second voice according to the target language corresponding to the target member; and display the second text.
在一个实施例中,视频通话的页面包括目标成员以及每个对方成员对应的视频帧展示区;处理器执行计算机程序时还实现:在目标成员对应的视频帧展示区,展示合成得到的目标成员对应目标语种的目标视频帧,记作第一目标视频帧;获取来自对方成员的第二目标视频帧;第二目标视频帧是根据目标成员对应目标语种将对方成员在视频通话中产生的第二语音转换为第二文本,并基于转换得到的第二文本以及对方成员在视频通话中产生的源视频帧合成得到;在对方成员对应的视频帧展示区,展示第二目标视频帧。In one embodiment, the page of the video call includes the target member and the video frame display area corresponding to each other member; when the processor executes the computer program, it also realizes: display the synthesized target member in the video frame display area corresponding to the target member The target video frame corresponding to the target language is recorded as the first target video frame; the second target video frame is obtained from the opponent member; the second target video frame is the second target video frame generated by the opponent member during the video call according to the target language corresponding to the target member The voice is converted into a second text, and synthesized based on the converted second text and the source video frame generated by the opposite member during the video call; the second target video frame is displayed in the video frame display area corresponding to the opposite member.
在一个实施例中,处理器执行计算机程序时还实现:采集对方成员在视频通话中产生的第二语音;获取根据目标成员对应的目标语种对第二语音转换得到的第二文本;根据每个对方成员对应的视频帧展示区的大小,确定所获取的第二文本的展示样式;根据展示样 式在视频通话的页面弹窗展示所获取的第二文本。In one embodiment, when the processor executes the computer program, the processor also implements: collect the second voice produced by the opposite member during the video call; obtain the second text obtained by converting the second voice according to the target language corresponding to the target member; The size of the video frame display area corresponding to the opposite member determines the display style of the obtained second text; the obtained second text is displayed in the pop-up window of the video call page according to the display style.
在一个实施例中,终端包括音频采集组件和音频播放组件;处理器执行计算机程序时还实现以下步骤:第一语音基于音频采集组件产生,第二语音基于音频播放组件产生。In one embodiment, the terminal includes an audio collection component and an audio playback component; the processor further implements the following steps when executing the computer program: the first voice is generated based on the audio collection component, and the second voice is generated based on the audio playback component.
在一个实施例中,提供了一种计算机可读存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现:采集目标成员在视频通话中产生的第一语音及源视频帧;根据预设的参与视频通话的对方成员分别指向的目标语种对第一语音进行转换,得到第一文本;将源视频帧分别与每种目标语种对应的第一文本进行合成,得到每种目标语种对应的目标视频帧;将得到的每种目标语种的目标视频帧发送至相应对方成员。In one embodiment, a computer-readable storage medium is provided, and a computer program is stored thereon. The computer program is executed by a processor to realize: collect the first voice and source video frame generated by the target member during the video call; The preset target languages pointed to by the other members participating in the video call convert the first voice to obtain the first text; the source video frame is synthesized with the first text corresponding to each target language, and the corresponding target language is obtained The target video frame of each target language; send the obtained target video frame of each target language to the corresponding member of the other party.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through a computer program. The computer program can be stored in a non-volatile computer readable storage. In the medium, when the computer program is executed, it may include the procedures of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。The technical features of the above embodiments can be combined arbitrarily. In order to make the description concise, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, they should be It is considered as the range described in this specification.
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only express several implementation manners of the present application, and their description is relatively specific and detailed, but they should not be understood as a limitation on the scope of the invention patent. It should be pointed out that for those of ordinary skill in the art, without departing from the concept of this application, several modifications and improvements can be made, and these all fall within the protection scope of this application. Therefore, the scope of protection of the patent of this application shall be subject to the appended claims.

Claims (19)

  1. 一种视频通话方法,包括:A video call method, including:
    采集目标成员在视频通话中产生的第一语音及源视频帧;Collect the first voice and source video frame generated by the target member in the video call;
    根据预设的参与所述视频通话的对方成员分别指向的目标语种对所述第一语音进行转换,得到第一文本;Converting the first voice according to the preset target languages pointed to by the opposite members participating in the video call to obtain the first text;
    将所述源视频帧分别与每种目标语种对应的所述第一文本进行合成,得到每种目标语种对应的目标视频帧;以及Synthesize the source video frame with the first text corresponding to each target language to obtain a target video frame corresponding to each target language; and
    将得到的每种目标语种的目标视频帧发送至相应对方成员。Send the obtained target video frames in each target language to the corresponding counterparty member.
  2. 根据权利要求1所述的方法,所述方法还包括:The method according to claim 1, further comprising:
    当触发了目标语种的配置操作时,展示语种配置页面;When the target language configuration operation is triggered, the language configuration page is displayed;
    获取基于所述语种配置页面所配置的语种配置信息;所述语种配置信息包括所述目标成员及参与所述视频通话的对方成员分别对应的候选语种;以及Acquiring language configuration information configured based on the language configuration page; the language configuration information includes candidate languages corresponding to the target member and the counterpart member participating in the video call; and
    将所述目标成员的成员标识及所述语种配置信息关联存储至服务器,以使所述服务器在存在所述对方成员的成员标识关联的语种配置信息时,将每个所述语种配置信息关联的成员标识所对应的候选语种作为相应成员的目标语种。The member ID of the target member and the language configuration information are stored in association with the server, so that the server associates each language configuration information with the language configuration information associated with the member ID of the counterpart member. The candidate language corresponding to the member ID is used as the target language of the corresponding member.
  3. 根据权利要求2所述的方法,所述方法还包括:The method according to claim 2, further comprising:
    将所述第一语音发送至服务器;以及Sending the first voice to the server; and
    接收所述服务器返回的根据每个对方成员的目标语种对所述第一语音进行转换得到的第一文本。Receiving the first text obtained by converting the first voice according to the target language of each counterpart member returned by the server.
  4. 根据权利要求1所述的方法,所述方法还包括:The method according to claim 1, further comprising:
    基于每种所述第一文本生成对应的字幕图像,对所述字幕图像进行缓存;Generating a corresponding subtitle image based on each of the first texts, and buffering the subtitle image;
    所述将所述源视频帧分别与每种目标语种对应的第一文本进行合成包括:The synthesizing the source video frame with the first text corresponding to each target language includes:
    每隔第一预设时长查询所述缓存中是否存在更新的字幕图像;以及Query whether there is an updated subtitle image in the cache every first preset duration; and
    若是,将所述更新的字幕图像与所述目标成员自当前时间之前第二预设时长内产生的每个源视频帧进行合成,将完成合成的字幕图像从所述缓存删除;所述第二预设时长小于所述第一预设时长。If yes, synthesize the updated subtitle image with each source video frame generated by the target member within the second preset period of time before the current time, and delete the synthesized subtitle image from the cache; the second The preset duration is less than the first preset duration.
  5. 根据权利要求4所述的方法,其中,所述基于每种所述第一文本生成对应的字幕图像包括:The method according to claim 4, wherein said generating a corresponding subtitle image based on each type of said first text comprises:
    根据所述源视频帧的图像宽度确定字幕宽度;Determining the subtitle width according to the image width of the source video frame;
    将所述字幕宽度转换为每种目标语种对应的字符数量阈值;Converting the subtitle width into a threshold value of the number of characters corresponding to each target language;
    根据不同的所述字符数量阈值将相应第一文本拆分为多个子文本;Split the corresponding first text into multiple sub-texts according to different thresholds for the number of characters;
    根据所述第一文本对应子文本的数量确定相应第一文本的字幕高度;以及Determining the subtitle height of the corresponding first text according to the number of sub-texts corresponding to the first text; and
    将所述第一文本作为字幕内容添加至根据所述字幕宽度及所述字幕高度生成的背景图像中,得到字幕图像。The first text is added as subtitle content to a background image generated according to the subtitle width and the subtitle height to obtain a subtitle image.
  6. 根据权利要求1所述的方法,其中,所述方法还包括:The method according to claim 1, wherein the method further comprises:
    采集所述对方成员在所述视频通话中产生的第二语音;Collecting the second voice generated by the opposite member during the video call;
    获取根据所述目标成员对应的目标语种对第二语音转换得到的第二文本;以及Acquiring the second text obtained by converting the second voice according to the target language corresponding to the target member; and
    展示所述第二文本。Show the second text.
  7. 根据权利要求1所述的方法,其中,所述视频通话的页面包括所述目标成员以及每个所述对方成员对应的视频帧展示区;所述方法还包括:The method according to claim 1, wherein the page of the video call includes the target member and a video frame display area corresponding to each of the counterpart members; the method further comprises:
    在所述目标成员对应的视频帧展示区,展示合成得到的所述目标成员对应目标语种的目标视频帧,记作第一目标视频帧;In the video frame display area corresponding to the target member, display the synthesized target video frame corresponding to the target language of the target member, which is recorded as the first target video frame;
    获取来自对方成员的第二目标视频帧;所述第二目标视频帧是根据所述目标成员对应目标语种将所述对方成员在所述视频通话中产生的第二语音转换为第二文本,并基于转换得到的第二文本以及所述对方成员在所述视频通话中产生的源视频帧合成得到;以及Obtain a second target video frame from the opposite member; the second target video frame converts the second voice generated by the opposite member in the video call into a second text according to the target language corresponding to the target member, and Synthesized based on the converted second text and the source video frame generated by the opposite member during the video call; and
    在所述对方成员对应的视频帧展示区,展示所述第二目标视频帧。In the video frame display area corresponding to the opposing member, the second target video frame is displayed.
  8. 根据权利要求7所述的方法,其中,所述方法还包括:The method according to claim 7, wherein the method further comprises:
    采集所述对方成员在所述视频通话中产生的第二语音;Collecting the second voice generated by the opposite member during the video call;
    获取根据所述目标成员对应的目标语种对第二语音转换得到的第二文本;Acquiring a second text obtained by converting a second voice according to the target language corresponding to the target member;
    根据每个所述对方成员对应的视频帧展示区的大小,确定所获取的所述第二文本的展示样式;以及Determine the acquired display style of the second text according to the size of the video frame display area corresponding to each of the opposing members; and
    根据所述展示样式在所述视频通话的页面弹窗展示所获取的第二文本。Display the acquired second text in a pop-up window of the video call page according to the display style.
  9. 根据权利要求6至8中任意一项所述的方法,其中,所述终端包括音频采集组件和音频播放组件;所述第一语音基于所述音频采集组件产生,所述第二语音基于所述音频播放组件产生。The method according to any one of claims 6 to 8, wherein the terminal includes an audio collection component and an audio playback component; the first voice is generated based on the audio collection component, and the second voice is based on the The audio playback component is generated.
  10. 根据权利要求1所述的方法,在所述的采集所述目标成员在所述视频通话中产生的所述第一语音及所述源视频帧采之后,还包括:The method according to claim 1, after said collecting said first voice and said source video frame produced by said target member in said video call, further comprising:
    缓存采集到的第一语音并确定所述第一语音的输入时间;Buffering the collected first voice and determining the input time of the first voice;
    判断自所述输入时间起算的预设时长内是否接收到新的第一语音;Judging whether a new first voice is received within a preset time period from the input time;
    响应于接收到新的第一语音,继续缓存新的第一语音;以及In response to receiving the new first voice, continue to buffer the new first voice; and
    响应于没有接收到新的第一语音对缓存的第一语音进行拼接,得到拼接后的第一语音。In response to not receiving a new first voice, splicing the buffered first voice to obtain the spliced first voice.
  11. 根据权利要求1所述的方法,所述的根据预设的参与所述视频通话的对方成员分别指向的目标语种对所述第一语音进行转换,得到第一文本包括:The method according to claim 1, wherein said converting the first voice according to the preset target languages pointed to by the opposite members participating in the video call to obtain the first text comprises:
    根据第一目标语种识别所述第一语音,并根据识别结果将所述第一语音转换成与所述第一目标语种对应的所述第一文本;Recognizing the first voice according to the first target language, and converting the first voice into the first text corresponding to the first target language according to the recognition result;
    查看第二目标语种是否与所述第一目标语种相同;Check whether the second target language is the same as the first target language;
    响应于所述第二目标语种不同于第一目标语种,统计所述第二目标语种的语言版本种类,基于不同的所述语言版本种类的所述第二目标语种翻译与所述第一目标语种对应的所述第一文本,得到与所述第二目标语种对应的所述第一文本。In response to the second target language being different from the first target language, the language version types of the second target language are counted, and the translation of the second target language and the first target language based on the different language version types Corresponding to the first text, the first text corresponding to the second target language is obtained.
  12. 根据权利要求2所述的方法,所述方法还包括:The method according to claim 2, further comprising:
    将所述语种配置信息发送至所述对方成员,并监听所述对方成员的操作行为;Sending the language configuration information to the opposing member, and monitoring the operation behavior of the opposing member;
    当触发目标语种变更操作时,展示语种变更页面;When the target language change operation is triggered, the language change page is displayed;
    在语种变更页面重新确定所述对方成员对应目标语种;以及Re-determine the target language corresponding to the counterparty member on the language change page; and
    根据重新确定的所述目标语种对所述第一语音进行转化,得到对应的所述第一文本。The first voice is converted according to the newly determined target language to obtain the corresponding first text.
  13. 一种视频通话方法,应用于多个成员之间,包括:A video call method applied between multiple members, including:
    获取目标成员在视频通话中产生的第一语音及源视频帧;Obtain the first voice and source video frame generated by the target member in the video call;
    根据预设的参与所述视频通话的其他成员分别指向的目标语种对所述第一语音进行转换,得到第一文本;Converting the first voice according to the preset target languages pointed to by other members participating in the video call to obtain the first text;
    将所述源视频帧分别与每种目标语种对应的所述第一文本进行合成,得到每种目标语种对应的目标视频帧;以及Synthesize the source video frame with the first text corresponding to each target language to obtain a target video frame corresponding to each target language; and
    将得到的每种目标语种的目标视频帧发送至其他成员。Send the obtained target video frames in each target language to other members.
  14. 根据权利要求13所述的方法,所述方法还包括:The method according to claim 13, further comprising:
    获取所述目标成员的成员标识及语种配置信息并将所述成员标识及所述语种配置信息关联存储,将每个所述语种配置信息关联的所述成员标识所对应的候选语种作为相应成员的所述目标语种。Obtain the member ID and language configuration information of the target member, and store the member ID and the language configuration information in association, and use the candidate language corresponding to the member ID associated with each of the language configuration information as the corresponding member The target language.
  15. 根据权利要求13所述的方法,所述方法还包括:The method according to claim 13, further comprising:
    基于每种所述第一文本生成对应的字幕图像,对所述字幕图像进行缓存;Generating a corresponding subtitle image based on each of the first texts, and buffering the subtitle image;
    所述将所述源视频帧分别与每种目标语种对应的第一文本进行合成包括:The synthesizing the source video frame with the first text corresponding to each target language includes:
    每隔第一预设时长查询所述缓存中是否存在更新的字幕图像;以及Query whether there is an updated subtitle image in the cache every first preset duration; and
    若是,将所述更新的字幕图像与所述目标成员自当前时间之前第二预设时长内产生的每个源视频帧进行合成,将完成合成的字幕图像从所述缓存删除;所述第二预设时长小于所述第一预设时长。If yes, synthesize the updated subtitle image with each source video frame generated by the target member within the second preset period of time before the current time, and delete the synthesized subtitle image from the cache; the second The preset duration is less than the first preset duration.
  16. 根据权利要求15所述的方法,其中,所述基于每种所述第一文本生成对应的字幕图像包括:The method according to claim 15, wherein said generating a corresponding subtitle image based on each of said first texts comprises:
    根据所述源视频帧的图像宽度确定字幕宽度;Determining the subtitle width according to the image width of the source video frame;
    将所述字幕宽度转换为每种目标语种对应的字符数量阈值;Converting the subtitle width into a threshold value of the number of characters corresponding to each target language;
    根据不同的所述字符数量阈值将相应第一文本拆分为多个子文本;Split the corresponding first text into multiple sub-texts according to different thresholds for the number of characters;
    根据所述第一文本对应子文本的数量确定相应第一文本的字幕高度;以及Determining the subtitle height of the corresponding first text according to the number of sub-texts corresponding to the first text; and
    将所述第一文本作为字幕内容添加至根据所述字幕宽度及所述字幕高度生成的背景图像中,得到字幕图像并发送至相应成员。The first text is added as subtitle content to a background image generated according to the subtitle width and the subtitle height, and the subtitle image is obtained and sent to the corresponding member.
  17. 一种视频通话装置,所述装置包括:A video call device, the device includes:
    第一文本生成模块,用于采集目标成员在视频通话中产生的第一语音及源视频帧;根据预设的参与所述视频通话的对方成员分别指向的目标语种对所述第一语音进行转换,得到第一文本;The first text generation module is used to collect the first voice and source video frame generated by the target member in the video call; convert the first voice according to the preset target languages pointed to by the other members participating in the video call , Get the first text;
    目标视频帧合成模块,用于将所述源视频帧分别与每种目标语种对应的所述第一文本进行合成,得到每种目标语种对应的目标视频帧;以及A target video frame synthesis module, configured to synthesize the source video frame and the first text corresponding to each target language to obtain a target video frame corresponding to each target language; and
    页面展示模块,用于将得到的每种目标语种的目标视频帧发送至相应对方成员。The page display module is used to send the obtained target video frames in each target language to the corresponding counterparty member.
  18. 一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现权利要求1至12中任一项所述方法的步骤。A computer device comprising a memory, a processor, and a computer program stored on the memory and capable of running on the processor. The processor implements the method described in any one of claims 1 to 12 when the processor executes the computer program step.
  19. 一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现权利要求1至12中任一项所述的方法的步骤。A computer-readable storage medium having a computer program stored thereon, and when the computer program is executed by a processor, the steps of the method according to any one of claims 1 to 12 are realized.
PCT/CN2020/118049 2019-09-27 2020-09-27 Video call method and apparatus, computer device and storage medium WO2021057957A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910925194.9A CN112584078B (en) 2019-09-27 2019-09-27 Video call method, video call device, computer equipment and storage medium
CN201910925194.9 2019-09-27

Publications (1)

Publication Number Publication Date
WO2021057957A1 true WO2021057957A1 (en) 2021-04-01

Family

ID=75110185

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/118049 WO2021057957A1 (en) 2019-09-27 2020-09-27 Video call method and apparatus, computer device and storage medium

Country Status (2)

Country Link
CN (1) CN112584078B (en)
WO (1) WO2021057957A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113627301A (en) * 2021-08-02 2021-11-09 科大讯飞股份有限公司 Real-time video information extraction method, device and system

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113225614A (en) * 2021-04-20 2021-08-06 深圳市九洲电器有限公司 Video playing method, device, server and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002163400A (en) * 2000-11-28 2002-06-07 Mitsuaki Arita Language conversion mediating method, language conversion mediation processor and computer readable recording medium
CN101542462A (en) * 2007-05-16 2009-09-23 莫卡有限公司 Establishing and translating within multilingual group messaging sessions using multiple messaging protocols
US20140157113A1 (en) * 2012-11-30 2014-06-05 Ricoh Co., Ltd. System and Method for Translating Content between Devices
CN104219459A (en) * 2014-09-30 2014-12-17 上海摩软通讯技术有限公司 Video language translation method and system and intelligent display device
CN104780335A (en) * 2015-03-26 2015-07-15 中兴通讯股份有限公司 Method and device for WebRTC P2P (web real-time communication peer-to-peer) audio and video call
CN106464768A (en) * 2014-05-27 2017-02-22 微软技术许可有限责任公司 In-call translation
CN106462573A (en) * 2014-05-27 2017-02-22 微软技术许可有限责任公司 In-call translation
CN109688363A (en) * 2018-12-31 2019-04-26 深圳爱为移动科技有限公司 The method and system of private chat in the multilingual real-time video group in multiple terminals
CN109688367A (en) * 2018-12-31 2019-04-26 深圳爱为移动科技有限公司 The method and system of the multilingual real-time video group chat in multiple terminals

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004215126A (en) * 2003-01-08 2004-07-29 Cyber Business Corp Multilanguage adaptive moving picture delivery system
CN101262611B (en) * 2005-05-01 2010-10-13 腾讯科技(深圳)有限公司 A stream media player
US8260604B2 (en) * 2008-10-29 2012-09-04 Google Inc. System and method for translating timed text in web video
US8913188B2 (en) * 2008-11-12 2014-12-16 Cisco Technology, Inc. Closed caption translation apparatus and method of translating closed captioning
CN105959772B (en) * 2015-12-22 2019-04-23 合一网络技术(北京)有限公司 Streaming Media and the instant simultaneous display of subtitle, matched processing method, apparatus and system
CN107690089A (en) * 2016-08-05 2018-02-13 阿里巴巴集团控股有限公司 Data processing method, live broadcasting method and device
CN106782545B (en) * 2016-12-16 2019-07-16 广州视源电子科技股份有限公司 A kind of system and method that audio, video data is converted to writing record
CN109246472A (en) * 2018-08-01 2019-01-18 平安科技(深圳)有限公司 Video broadcasting method, device, terminal device and storage medium
CN109274831B (en) * 2018-11-01 2021-08-13 科大讯飞股份有限公司 Voice call method, device, equipment and readable storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002163400A (en) * 2000-11-28 2002-06-07 Mitsuaki Arita Language conversion mediating method, language conversion mediation processor and computer readable recording medium
CN101542462A (en) * 2007-05-16 2009-09-23 莫卡有限公司 Establishing and translating within multilingual group messaging sessions using multiple messaging protocols
US20140157113A1 (en) * 2012-11-30 2014-06-05 Ricoh Co., Ltd. System and Method for Translating Content between Devices
CN106464768A (en) * 2014-05-27 2017-02-22 微软技术许可有限责任公司 In-call translation
CN106462573A (en) * 2014-05-27 2017-02-22 微软技术许可有限责任公司 In-call translation
CN104219459A (en) * 2014-09-30 2014-12-17 上海摩软通讯技术有限公司 Video language translation method and system and intelligent display device
CN104780335A (en) * 2015-03-26 2015-07-15 中兴通讯股份有限公司 Method and device for WebRTC P2P (web real-time communication peer-to-peer) audio and video call
CN109688363A (en) * 2018-12-31 2019-04-26 深圳爱为移动科技有限公司 The method and system of private chat in the multilingual real-time video group in multiple terminals
CN109688367A (en) * 2018-12-31 2019-04-26 深圳爱为移动科技有限公司 The method and system of the multilingual real-time video group chat in multiple terminals

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113627301A (en) * 2021-08-02 2021-11-09 科大讯飞股份有限公司 Real-time video information extraction method, device and system
CN113627301B (en) * 2021-08-02 2023-10-31 科大讯飞股份有限公司 Real-time video information extraction method, device and system

Also Published As

Publication number Publication date
CN112584078B (en) 2022-03-18
CN112584078A (en) 2021-03-30

Similar Documents

Publication Publication Date Title
CN108615527B (en) Data processing method, device and storage medium based on simultaneous interpretation
US10176366B1 (en) Video relay service, communication system, and related methods for performing artificial intelligence sign language translation services in a video relay service environment
US20220239882A1 (en) Interactive information processing method, device and medium
CN110035326A (en) Subtitle generation, the video retrieval method based on subtitle, device and electronic equipment
CN111050201B (en) Data processing method and device, electronic equipment and storage medium
CN110691204B (en) Audio and video processing method and device, electronic equipment and storage medium
WO2017072534A2 (en) Communication system and method
WO2021057957A1 (en) Video call method and apparatus, computer device and storage medium
RU2500081C2 (en) Information processing device, information processing method and recording medium on which computer programme is stored
CN109782997B (en) Data processing method, device and storage medium
WO2021120190A1 (en) Data processing method and apparatus, electronic device, and storage medium
EP4246425A1 (en) Animal face style image generation method and apparatus, model training method and apparatus, and device
CN111107283B (en) Information display method, electronic equipment and storage medium
US20230326369A1 (en) Method and apparatus for generating sign language video, computer device, and storage medium
CN110768896B (en) Session information processing method and device, readable storage medium and computer equipment
US10504519B1 (en) Transcription of communications
CN111161710A (en) Simultaneous interpretation method and device, electronic equipment and storage medium
TWM574267U (en) Live broadcast system of synchronous and automatic translation of real-time voice and subtitle
TWI769520B (en) Multi-language speech recognition and translation method and system
CN113709521B (en) System for automatically matching background according to video content
CN114373464A (en) Text display method and device, electronic equipment and storage medium
US20200184973A1 (en) Transcription of communications
CN114341866A (en) Simultaneous interpretation method, device, server and storage medium
WO2021092733A1 (en) Subtitle display method and apparatus, electronic device and storage medium
TW202009750A (en) Live broadcast system with instant voice and automatic synchronous translation subtitle and the method of the same enables the other party to directly play the original video information

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20868986

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 26/08/2022)

122 Ep: pct application non-entry in european phase

Ref document number: 20868986

Country of ref document: EP

Kind code of ref document: A1