WO2021057957A1

WO2021057957A1 - Video call method and apparatus, computer device and storage medium

Info

Publication number: WO2021057957A1
Application number: PCT/CN2020/118049
Authority: WO
Inventors: 严伟波
Original assignee: 深圳市万普拉斯科技有限公司
Priority date: 2019-09-27
Filing date: 2020-09-27
Publication date: 2021-04-01
Also published as: CN112584078B; CN112584078A

Abstract

The present application relates to a video call method and apparatus, a computer device and a storage medium. The method comprises: collecting first speech and a source video frame that are generated by a target member during a video call; converting the first speech according to a target language indicated by each preset counterpart member participating in the video call so as to obtain first text; compositing the source video frame and the first text corresponding to each target language respectively so as to obtain a target video frame corresponding to each target language; and sending the obtained target video frame for each target language to corresponding counterpart members.

Description

Video call method, device, computer equipment and storage medium

Cross-reference to related applications

This application requires a Chinese application filed on September 27, 2019, titled "Video call method, device, computer equipment and storage medium", with application number 2019109251949, the disclosure of which is fully incorporated into this application by reference.

Technical field

This application relates to a video call method, device, computer equipment and storage medium.

Background technique

With the development of globalization, there are more and more exchanges between countries. At present, users can communicate in real time by means of video calls based on the instant messaging client on the terminal. However, due to the differences in languages between countries, users who do not understand other languages will be unable to communicate with each other due to language barriers. Communicate smoothly.

When making a video call based on different languages, members of the call can only break away from the instant messaging client during the video call, and use a third-party translation device to translate the voice data from other members; the translation results from the third-party translation device are waiting to be heard Then, make a voice reply based on the translation result.

Summary of the invention

This application provides a video call method. The method includes: collecting the first voice and source video frame generated by the target member in the video call; and converting the first voice according to the preset target languages pointed to by the other members participating in the video call. , Get the first text; synthesize the source video frame with the first text corresponding to each target language to obtain the target video frame corresponding to each target language; and send the obtained target video frame of each target language to the corresponding Opposing member.

This application also provides a video call method, including: obtaining the first voice and source video frame generated by the target member in the video call; and converting the first voice according to the preset target languages pointed to by other members participating in the video call. , Get the first text; synthesize the source video frame with the first text corresponding to each target language to obtain the target video frame corresponding to each target language; and send the obtained target video frame of each target language to other member

This application also provides a video call device. The device includes: a first text generation module for collecting the first voice and source video frame generated by the target member in the video call; The target language converts the first voice to obtain the first text; the target video frame synthesis module is used to synthesize the source video frame with the first text corresponding to each target language to obtain the target video corresponding to each target language Frames; and a page display module for sending the obtained target video frames in each target language to the corresponding counterparty member.

This application also provides a computer device, including a memory, a processor, and a computer program stored on the memory and running on the processor. The processor implements the following steps when the processor executes the computer program: collecting the first video generated by the target member during a video call A voice and source video frame; the first voice is converted according to the preset target languages pointed to by the other members participating in the video call to obtain the first text; the source video frame is respectively converted to the first text corresponding to each target language Synthesize, obtain the target video frame corresponding to each target language; and send the obtained target video frame of each target language to the corresponding counterparty member.

This application also provides a computer-readable storage medium on which a computer program is stored. When the computer program is executed by the processor, the following steps are implemented: collecting the first voice and source video frame generated by the target member in the video call; The target language pointed to by the other members participating in the video call converts the first voice to obtain the first text; the source video frame is synthesized with the first text corresponding to each target language to obtain the target corresponding to each target language Video frame; and send the obtained target video frame of each target language to the corresponding counterparty member.

The details of one or more embodiments of the present invention are set forth in the following drawings and description. Other features, objects and advantages of the present invention will become apparent from the description, drawings and claims.

Description of the drawings

In order to more clearly describe the technical solutions in the embodiments of the present application, the following will briefly introduce the drawings needed in the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, without creative work, other drawings can be obtained from these drawings.

Fig. 1 is an application scenario diagram of a video call method in an embodiment;

FIG. 2 is a schematic flowchart of a video call method in an embodiment;

Figure 3 is a schematic diagram of a language configuration page in an embodiment;

Figure 4 is a schematic diagram of a target video frame in an embodiment;

Figure 5 is a schematic diagram of a pop-up window displaying a second text in an embodiment;

FIG. 6 is a schematic diagram of displaying the second text in the form of a prompt message in an embodiment;

Figure 7 is a schematic diagram of a video frame display area in an embodiment;

FIG. 8 is a structural block diagram of a video call device in an embodiment;

FIG. 9 is a structural block diagram of a video call device in another embodiment;

Fig. 10 is a diagram of the internal structure of a computer device in an embodiment.

detailed description

BACKGROUND OF THE INVENTION The method described above not only needs to rely on a third-party translation device, and the communication cost is high; it also needs to constantly switch between the terminal and the third-party translation device, and the operation is complicated. In addition, due to the need to wait for the translation result returned by the third-party translation device, it causes multiple pauses during the video call, prolongs the duration of the entire video call, and causes a waste of video call link resources.

In order to make the purpose, technical solutions, and advantages of this application clearer, the following further describes this application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, and are not used to limit the present application.

Fig. 1 is an application environment diagram of a video call method in an embodiment. Referring to Figure 1, the video call method is applied to a video call system. The video call system includes a first terminal 102, a server 104, and a second terminal 106. Among them, the first terminal 102 communicates with the server 104 through the network, and the second terminal 106 communicates with the server 104 through the network. The first terminal 102 and the second terminal 106 may be mobile phones, tablet computers, portable wearable devices, or the like. The first terminal 102 is a terminal corresponding to the target member in the video call system, and the second terminal 106 is a terminal corresponding to the counterpart member in the video call system. The first terminal 102 and the second terminal 106 respectively run instant messaging applications. Based on the instant messaging applications, the first terminal 102 can establish a video call link with the second terminal 106. Video calls can be divided into two-person video calls and multi-person video calls according to the number of participating member IDs. A call involving only two members identified as a two-person video call, and a call involving more than two members identified as a multi-person video call. Multi-person video calls can be group calls. The member ID is used to uniquely identify the call member, which can be numbers, letters, or symbols. When it is a two-person video call, the second terminal 106 may be implemented by a single terminal, and when it is a multi-person video call, the second terminal 106 may be implemented by multiple terminals. The instant messaging application in the first terminal 102 can integrate a subtitle synthesis plug-in, which is used to convert and translate the collected first voice into multiple language versions of the first text, and use different versions of the first text as the subtitle content and target The source video frames generated by the members in the video call are synthesized to obtain the target video frame, and the target video frame is forwarded to the second terminal 106 corresponding to the other member through the server 104. The server 104 may be implemented as an independent server or a server cluster composed of multiple servers.

It can be understood that the terms "first", "second", etc. used in this application can be used herein to describe various elements, but these elements are not limited by these terms. These terms are only used to distinguish the first element from another element. For example, without departing from the scope of the present application, the first terminal may be referred to as the second terminal, and similarly, the second terminal may be referred to as the first terminal. Both the first terminal and the second terminal are terminals, but they are not the same terminal.

In one embodiment, as shown in FIG. 2, a video call method is provided. Taking the method applied to the first terminal in FIG. 1 as an example for description, the method includes the following steps:

S202: Collect the first voice and source video frame generated by the target member in the video call.

The first voice refers to the voice data of the target member collected by the audio collecting component based on the first terminal corresponding to the target member during the video call. The audio collection component refers to the relevant hardware used in the terminal to collect audio data, such as a microphone. The source video frame refers to the image information about the target member collected by the first terminal based on the image collection component, such as the camera.

Specifically, when the target member is in a video call with other members, the first terminal detects whether there is a start instruction generated for the subtitle synthesis plug-in. If the start instruction is detected, the first terminal starts the subtitle synthesis plug-in and turns on the subtitle synthesis function.

In one embodiment, the first terminal has an icon for turning on the subtitle synthesis plug-in, and the target member can actively click the plug-in icon before or during the video call to turn on the subtitle synthesis function.

In one embodiment, when the first terminal detects that the target member starts the video call, the first terminal automatically calls the start interface of the subtitle synthesis plug-in to start the subtitle synthesis function.

Further, the caption synthesis plug-in sends an image reading instruction to the image acquisition component and an audio reading instruction to the audio acquisition component to read the source video frames collected by the image acquisition component and the first voice collected by the audio acquisition component.

In one embodiment, the caption synthesis plug-in may determine whether the target member grants the image acquisition component the permission to collect the image information of the target member before sending the image reading instruction to the image acquisition component. If permission is not granted, the subtitle synthesis plug-in will automatically replace the source video frame with a preset picture. For example, when the target member does not grant the corresponding acquisition permission of the image acquisition component, the subtitle synthesis plug-in can subsequently use the preset pure black image as the source video frame.

In the above embodiment, the preset picture is set in advance, so that when the image capture component fails to successfully capture the source video frame, the subtitle synthesis plug-in can still perform the synthesis process of the target video frame normally according to the preset picture, so that the other members can still Communicate smoothly with target members based on the content of the subtitles in the target video frame.

S204: Convert the first voice according to the preset target languages pointed to by the opposite members participating in the video call to obtain the first text.

Specifically, FIG. 3 is a schematic diagram of a language configuration page in an embodiment. When the subtitle synthesis function is activated, the first terminal can obtain the member identification of each opposing member participating in the video call, and generate the language configuration page as shown in FIG. 3 based on the member identification. The target member can select the source language corresponding to the first voice to be recognized (denoted as the first target language) and the target language corresponding to the other member (denoted as the second target language) on the language configuration page. For example, Chinese may be selected as the first target language, and English as the second target language, and the terminal will convert the first voice of the Chinese language into the corresponding English text when translating.

Further, the subtitle synthesis plug-in recognizes the first voice according to the first target language, and converts the first voice into the first text corresponding to the first target language according to the recognition result. The subtitle synthesis plug-in checks whether the second target language is the same as the first target language. If not, the subtitle synthesis plug-in counts the language version types of the second target language, and the second target language translation and the first target language based on different language version types Corresponding to the first text, the first text corresponding to the second target language is obtained.

In one embodiment, after setting the corresponding target language for the opposite member, the first terminal may send the language configuration information to the second terminal, so that the second terminal correspondingly displays the language configuration information. When the opposing member finds that the second target language set by the target member is wrong, the opposing member can simply prompt the target member through the instant messaging application. At this time, the target member can trigger the target language change operation according to the prompt of the opposing member. The subtitle synthesis plug-in continuously monitors the user's operation behavior. When the target language change operation is triggered, the subtitle synthesis plug-in displays the language change page. The target member can re-determine the second target language corresponding to the opposing member on the language change page, and then the subtitles are synthesized The plug-in converts the first voice according to the reselected second target language to obtain the corresponding first text.

In the above embodiment, by correspondingly displaying the language configuration information configured by the target member on the opposite terminal, the target member can change the language configuration information in time when the language configuration information is found to be incorrect, thereby improving the efficiency of video calls.

In one embodiment, the subtitle synthesis plug-in recognizes the first voice based on the first target language, and directly converts the recognized first voice into the corresponding first text according to the second target language.

In one embodiment, the subtitle synthesis plug-in buffers the current first voice after collecting the first voice. The subtitle synthesis plug-in determines the input time of the first voice currently received, and determines whether a new first voice is received within a preset time from the current input time, if yes, the new first voice is cached, if not, the new first voice is stored in the cache At least one first voice of the first voice is spliced to obtain the spliced first voice, and the spliced first voice is recognized based on the first target language.

By judging whether a new input text sentence has been received within the preset time period, it is judged whether the target member has completed the current round of voice input, so that the subtitle synthesis plug-in can respond to the target member after the target member has completed the current round of voice input. The speech of this round is translated, so as to make the sentence in the first text a complete sentence as much as possible.

In an embodiment, the first terminal may also send the first voice and language configuration information to the server, so that the server correspondingly recognizes and translates the first voice according to the language configuration information.

S206: Synthesize the source video frame and the first text corresponding to each target language to obtain a target video frame corresponding to each target language.

S208: Send the obtained target video frame of each target language to the corresponding counterparty member.

Specifically, after the first terminal obtains the source video frame and the first text corresponding to each second target language, the subtitle synthesis plug-in obtains the image width of the source video frame, based on the image width of the source video frame and each second target language. The number of characters in the first text corresponding to the target language determines the size of the background image corresponding to different target languages. The subtitle synthesis plug-in obtains a preset background image generation format, such as RGBA format, and generates a corresponding background image according to the preset format and size information. The subtitle synthesis plug-in reads the text content in the first text corresponding to each target language, and adds the text content of the first text as the subtitle content to the corresponding background image to obtain the subtitle image corresponding to each target language.

Further, the subtitle synthesis plug-in can uniformly adjust the subtitle image according to the preset background image color and character color. The character refers to the text content of the first text displayed in the subtitle image. For example, according to the preset, the background color is uniformly adjusted to black, and the character color is uniformly adjusted to white. After that, the subtitle synthesis plug-in obtains the array elements of the subtitle image, and sets the value of the background color element represented in the array element to zero to remove the background color in the subtitle image to obtain a subtitle image with a transparent background and white subtitles. The element array of the subtitle image refers to a string that records the three primary colors and transparency of each pixel in the subtitle image. Based on the element array, the three primary colors and transparency in the image can be dynamically adjusted.

Further, FIG. 4 is a schematic diagram of a target video frame in an embodiment. The subtitle synthesis plug-in converts the source video frame according to the background image format, and generates a video frame image with the same format as the background image. The subtitle synthesis plug-in obtains preset synthesis location information, and performs pixel superposition of the video frame image and the subtitle image corresponding to each target language according to the synthesis location information to obtain at least one target video frame as shown in FIG. 4. For example, the developer of the subtitle synthesis plug-in can set a synthesis starting point in advance, so that the subtitle plug-in can linearly superimpose the element values corresponding to the pixels in the corresponding position of the video frame image and the subtitle image from the synthesis starting point.

Further, the subtitle synthesis plug-in converts the format of the synthesized image after pixel superimposition to obtain the target video frame corresponding to each target language with the same format as the source video frame, and according to the corresponding relationship between the member ID and the second target language, The target video is sent to the corresponding counterparty member. For example, when A is in a video call with B and C, the subtitle synthesis plug-in on terminal A determines that the second target language corresponding to B is English and the second target language corresponding to C is Japanese according to the language configuration operation of A. The hourly subtitle synthesis plug-in sends the target video frame with embedded English subtitles to B, and the target video with embedded Japanese subtitles to C.

In the above video call method, since the first voice generated by the target member in the video call is translated into the first text of multiple language versions according to the target language familiar to each member participating in the video call; After a text is used as voice translation subtitles and the source video frame generated by the target member in the video call, a target video frame with voice translation subtitles can be formed; the target video frame will be displayed on the page corresponding to the video call of the target member. The target video frames of the voice translation subtitles in the languages required by other members are sent to the corresponding members, so that each member participating in the video call can understand what the target member is speaking through the language they are familiar with without having to leave the instant messaging client. Content, improve the efficiency of video calls, and then save video call link resources.

In addition, since the first voice is translated into a version of the first text for each target language, instead of translating the first voice into a version of the first text for each member of the call, members of the same target language are essentially used The first text can be multiplexed to reduce the amount of data processing for synthesizing the source video frame with the first text of different versions, thereby saving terminal data processing resources.

In one embodiment, the above video call method further includes: when the configuration operation of the target language is triggered, displaying the language configuration page; acquiring language configuration information configured based on the language configuration page; the language configuration information includes the target members and participating in the video call Candidate languages corresponding to the members of each other; store the member ID and language configuration information of the target member in association with the server so that the server associates the member with each language configuration information when there is language configuration information associated with the member ID of the other member The candidate language corresponding to the mark is used as the target language of the corresponding member.

Specifically, when the first terminal and the second terminal are installed with the subtitle synthesis plug-in at the same time, both the target member and the other member can trigger the target language configuration operation. At this time, the terminal can correspondingly display the language configuration page according to the member’s operation, and it will be based on the language. The configuration page generates language configuration information and sends it to the server, so that the server associates the configuration information with the member ID corresponding to the sending terminal and stores it. For example, when A and B are in a video call, A can set the candidate language associated with itself as English, and the candidate language associated with B as Chinese, and B can also set the candidate language associated with itself as Chinese, and the candidate language associated with A. In English, the server then stores the configuration information sent by A and B respectively according to the member IDs of A and B.

Further, the server uses the candidate language corresponding to the member identifier associated with each language configuration information as the target language of the corresponding member, thereby screening multiple pieces of language configuration information to generate a unified language configuration information. In the above example, the server extracts the candidate language "English" associated with the A logo from the language configuration information sent by A, and determines "English" as the target language corresponding to A. From the language configuration information sent by B, The candidate language "Chinese" associated with the B logo is extracted, and "Chinese" is determined as the target language corresponding to B.

In the above embodiment, when there are multiple pieces of configuration information, by filtering the multiple pieces of configuration information according to the member identification, a unified language configuration information can be obtained, so that subsequent terminals or servers can generate corresponding information based on the unified language configuration information. Text; by using the candidate language corresponding to the member identifier associated with each language configuration information as the target language of the corresponding member, the accuracy of the language configuration information can be improved, and the target video received by the other member due to the error of the language configuration information can be reduced The language of the subtitles in the frame is not the language that you are familiar with.

In an embodiment, the above-mentioned video call method further includes: sending the first voice to the server; receiving the first text returned by the server after converting the first voice according to the target language of each counterparty member.

Specifically, after acquiring the first voice, the subtitle synthesis plug-in sends the first voice to the server, so that the server recognizes and translates the first voice according to the member identification of the target member and the unified language configuration information, and generates the second voice. The first text corresponding to the target language, and the first text is returned to the first terminal.

It is easy to understand that the first voice recognition and translation process in the above-mentioned video call method can be completed either in the first terminal or in the server. When realizing the recognition and translation of the first voice based on the first terminal, the first terminal can recognize and translate the first voice according to the language configuration information stored in the terminal, or it can pull unified language configuration information from the server, Therefore, the first language is recognized and translated according to the unified language configuration information; when the first voice recognition and translation are realized based on the server, the server can pull the corresponding language configuration information from the first terminal, and configure according to the language configuration in the terminal The information recognizes and translates the first voice, and the first voice may also be recognized and translated according to the unified language configuration information stored in the server.

In the foregoing embodiment, the server converts the first voice to obtain the corresponding first text, which can reduce terminal resources consumed by the terminal for converting the first voice.

In an embodiment, the above-mentioned video call method further includes: generating a corresponding subtitle image based on each first text, and buffering the subtitle image; and synthesizing the source video frame with the first text corresponding to each target language includes: Query whether there is an updated subtitle image in the cache every first preset duration; if so, combine the updated subtitle image with the target member second before the current time; synthesize each source video frame generated within the preset duration, and it will be completed The synthesized subtitle image is deleted from the cache; the second preset duration is less than the first preset duration.

Wherein, the first preset duration is the duration set by the developer of the subtitle synthesis plug-in according to the number of video frames of the played video. For example, when an instant messaging application performs video playback, the video frame is generally played at a rate of 30 frames. At this time, the developer of the subtitle synthesis plug-in can set the preset duration to 30 milliseconds. The second preset duration is the interval time for the subtitle synthesis plug-in to read the source video frame from the image capture component. If the second preset duration is too long, the target video frame received by the other member will be delayed too long, and the second preset duration is too long. Short, it will cause the opposite member to receive too few target video frames with embedded subtitles to recognize the content of the subtitles, so it needs to be set reasonably, such as 3 seconds.

Specifically, when the video call is started, the image acquisition component in the terminal collects the image information of the target member in real time, and correspondingly caches the image information and the acquisition time of the target member in the image buffer area.

Further, after the subtitle synthesis plug-in generates the corresponding subtitle image, the subtitle synthesis plug-in checks whether the preset subtitle buffer area contains the buffered subtitle image. If there is a buffered subtitle image, the subtitle synthesis plug-in clears the subtitle buffer area. And cache the currently generated subtitle image to the subtitle buffer area.

Further, the subtitle synthesis plug-in checks whether there is an updated subtitle image in the subtitle buffer area every first preset duration. When there is an updated subtitle image, the subtitle synthesis plug-in reads from the image buffer area the second preset before the current time. At least one source video frame is collected by the image capture component within the time period, and then the source video frame that has been read is deleted from the image buffer area. If the updated subtitle image is still not stored in the subtitle buffer area within the second preset time from the current time, the subtitle synthesis plug-in directly sends the source video frame within the second preset time from the current time to the other member, and Delete the sent source video frame from the image buffer area.

Further, the subtitle synthesis plug-in separately synthesizes the subtitle image corresponding to each second target language with each source video frame read from the image buffer area to obtain the corresponding target video frame, and synthesize the subtitle image from the subtitle buffer area The subtitle image corresponding to delete.

In the above embodiment, by querying whether there are updated subtitle images in the subtitle buffer area at regular intervals, the latest subtitle images can be obtained in time, so that the synthesized target video frame can be sent to the other member in time; by sending the latest subtitles The image is synthesized with multiple source video frames, so that members of the other party can recognize the subtitle content based on multiple target video frames.

In one embodiment, generating the corresponding subtitle image based on each type of the first text includes: determining the subtitle width according to the image width of the source video frame; converting the subtitle width into a threshold value of the number of characters corresponding to each target language; The character number threshold splits the corresponding first text into multiple sub-texts; determines the sub-text height of the corresponding first text according to the number of sub-texts corresponding to the first text; adds the first text as subtitle content to the generated sub-texts based on the sub-text width and sub-text height In the background image, the subtitle image is obtained.

Among them, the threshold of the number of characters is the maximum number of characters that can be displayed in a single-line subtitle.

Specifically, the subtitle synthesis plug-in determines the image width of the source video frame, and determines the subtitle width according to a preset image width ratio value. For example, if the preset subtitle width ratio is two-thirds, the subtitle synthesis plug-in determines two-thirds of the image width of the source video frame as the subtitle width.

Further, the terminal has a correspondence relationship between the width information of a single character corresponding to each target language and the separation distance between the characters. The subtitle synthesis plug-in separately obtains the second target language corresponding to the first text, and according to the language information of the second target language, determines the corresponding single character width information and the spacing distance between the characters from the corresponding relationship, based on the obtained subtitle width , Single character width information and distance between characters to calculate the threshold of the number of characters corresponding to the second target language, that is, the subtitle synthesis plug-in can obtain single-line subtitles according to the subtitle width, single character width information and the distance between characters The number of characters.

Further, the subtitle synthesis plug-in counts the number of characters in the first text to obtain the total number of characters, and divides the total number of characters by the character number threshold to obtain the number of sub-texts. The subtitle synthesis plug-in generates a corresponding number of sub-texts based on the number of sub-texts. . The subtitle synthesis plug-in reads a threshold number of characters from the first character in the first text, and stores the read characters in the sub-text. The subtitle synthesis plug-in deletes the read characters from the first text, and continues to read the characters in the first text according to the threshold of the number of characters, and stores the read characters in the sub-text without storing characters until the first text All characters in are deleted.

Further, the subtitle synthesis plug-in counts the number of subtexts corresponding to the first text, and determines the number of subtitle lines in the subtitle image according to the number of subtexts. For example, when there are three sub-texts, the subtitle synthesis plug-in can consider that there are three lines of subtitles in the subtitle image to be generated at this time. At this time, the subtitle synthesis plug-in can calculate the corresponding first text according to the preset height of single-line subtitles and the total number of subtitles. The height of the subtitles.

Further, the subtitle synthesis plug-in generates a background image of a corresponding size according to the subtitle width and the subtitle height, and adds the characters in each sub-text to the background image as subtitle content.

In the above embodiment, by determining the subtitle width according to the image width of the source video frame, it is possible to reduce the probability of the subtitle exceeding the video image due to the width value of the generated subtitle image being greater than the image width of the source video frame; the background is determined according to the number of subtexts The height of the image can reduce the generation of unnecessary part of the background image.

In one embodiment, the above-mentioned video call method further includes: collecting the second voice produced by the opposite member during the video call; obtaining the second text obtained by converting the second voice according to the target language corresponding to the target member; and displaying the second text.

Specifically, when a video call is in progress, the second voice corresponding to the opposite member can be sent to the first terminal through the instant messaging application on the second terminal. At this time, the instant messaging application in the first terminal receives the second voice, and sends the second voice to the audio playback component. The subtitle synthesis plug-in in the first terminal monitors whether the audio playback component receives the second voice. When the audio playback component receives the second voice, the subtitle synthesis plug-in obtains the second voice, and according to the target member corresponding to the language configuration information Recognize and translate the second speech in the first target language to obtain the second text.

Further, the subtitle synthesis plug-in correspondingly displays the generated second text on the screen of the first terminal.

Fig. 5 is a schematic diagram of a pop-up window displaying the second text in an embodiment. The first terminal may display the second text in the form of a pop-up window, or may display the second text in the form of a prompt message as shown in FIG. 6, which is a schematic diagram of displaying the second text in the form of a prompt message in an embodiment .

In the foregoing embodiment, since the second text can be displayed in multiple forms, the target member can independently select a suitable display form based on actual needs, which greatly improves the user experience.

In one embodiment, the first terminal calculates whether the target member actively closes the second text within the preset time period from the time when the second text is displayed. If the second text is not actively closed, the first terminal may generate the second text based on the The closing instruction is used to automatically close the displayed second text, so that when the target member has finished reading the second text, the second text can be automatically closed, thereby saving display resources consumed by the end point to display the second text.

In one embodiment, the target member can manually close the displayed second text, such as clicking the close control to close the second text, or closing the second text according to a sliding operation on the screen.

In one embodiment, when the target member minimizes the instant messaging application, the first terminal can still display the second text in the form of a pop-up window or a prompt message.

In the above embodiment, the second text is displayed in the form of a pop-up window or a prompt message, so that the display of the second text can be separated from the video call page, so that when the instant messaging application is turned to run in the background, the target member can also follow the content of the second text Communicate smoothly with other members.

In one embodiment, when it is a multi-person video call, the second voice collected by the audio-based playback component may be combined with the voices of multiple other members. At this time, the subtitle synthesis plug-in extracts timbre information from the second voice, according to the timbre The information divides the second voice into multiple second sub-voices, and converts the multiple second sub-voices based on the target language corresponding to the target member to obtain multiple second texts. After that, the first terminal respectively displays a plurality of second texts correspondingly. The second voice is divided according to the timbre, so that the subtitle synthesis plug-in can distinguish different second sub-voices of different opponent members, so that in a multi-person video call scene, it can assist the target member to distinguish different opponents by displaying multiple second texts. The different information expressed by the members further enhances the communication efficiency of multi-person video calls.

In the above embodiment, by displaying the second text in the terminal, the target member can understand the content of the other member when the subtitle synthesis plug-in is not installed in the second terminal, so that the video call can be carried out smoothly.

In one embodiment, the page of the video call includes the target member and the video frame display area corresponding to each of the opposing members; the above video call method further includes: displaying the synthesized video frame display area in the video frame display area corresponding to the target member The target video frame of the target member corresponding to the target language is recorded as the first target video frame; the second target video frame is obtained from the opponent member; the second target video frame is generated by the opponent member during the video call according to the target member corresponding to the target language The second voice is converted into a second text, and synthesized based on the converted second text and the source video frame generated by the opposite member during the video call; the second target video frame is displayed in the video frame display area corresponding to the opposite member .

Specifically, the subtitle synthesis plug-in can convert the first voice according to the first target language corresponding to the target member to obtain the corresponding first text, and synthesize the first text with the source video frame to obtain the first target language corresponding to the target member. A target video frame.

Further, when the subtitle synthesis plug-in is installed in the second terminal, the second terminal may convert the second voice generated by the other member in the video call into the second text according to the target language corresponding to the target member, and convert the second voice The two texts and the source video frame generated by the other member during the video call are synthesized to obtain the second target video frame, and then the second terminal sends the synthesized second target video frame to the first terminal.

Further, after the first terminal obtains the first target video frame and the second target video frame, the first terminal obtains the page size of the video call page, and divides the video frame display area corresponding to the target member corresponding to the counterpart member according to the page size. The video frame display area. For example, the first terminal counts the total number of members participating in the video call, divides the video call page into multiple video frame display areas according to the total number of members, and agrees that the first divided video display area is the video frame display corresponding to the target member Area.

Further, FIG. 7 is a schematic diagram of a video frame display area according to an embodiment. The first terminal separately obtains the area size of the video frame display area corresponding to the target member and the other member, and correspondingly changes the size of the first target video frame and the second target video frame according to the area size, so that the video frame shown in FIG. 7 is displayed The area can completely display the first target video frame and the second target video frame.

In one embodiment, the target member can change the size of the video frame display area according to his own needs. For example, when the target member has a video call with B and C, the target member can enlarge the video frame display area corresponding to B. The video frame display area corresponding to the target member and the video frame display area corresponding to C will be correspondingly reduced, so that the entire video call is more in line with the actual needs of the target member.

In one embodiment, when the target member finds that the subtitles in the first target video frame displayed and displayed are incorrect, the target member can calibrate the wrong characters in the subtitles. At this time, the subtitle synthesis plug-in generates the error according to the target member's calibration operation. Correct the page. Based on the correction page, the target member can input the corresponding display character corresponding to the wrong character.

Further, the subtitle synthesis plug-in stores the error characters and the characters to be displayed in the character library. When the subtitle synthesis plug-in recognizes the wrong characters again, it can choose whether to correct the wrong characters according to the characters to be displayed in the character library.

In the above embodiment, by correspondingly displaying the first target video frame in the video frame display area, the target user can check in real time whether the subtitle content displayed by the first target video frame is correct, so that the wrong character can be calibrated in time when the wrong character is found. , Thereby improving the accuracy of the subtitle synthesis plug-in for speech translation.

In one embodiment, the above-mentioned video call method further includes: collecting the second voice produced by the opposite member during the video call; obtaining the second text obtained by converting the second voice according to the target language corresponding to the target member; and according to each other member The size of the display area of the corresponding video frame determines the display style of the obtained second text; the obtained second text is displayed in the pop-up window of the video call page according to the display style.

Wherein, the display style of the second text includes character transparency, character size, and character color in the second text.

Specifically, the subtitle synthesis plug-in obtains the second voice generated during the video call from the audio playback component, and converts the second voice according to the target language corresponding to the target member to obtain the second text. The subtitle synthesis plug-in obtains the size of the video frame display area corresponding to each opponent member. When the size of the video frame display area corresponding to each opponent member is less than the area threshold, it can be considered that the target member cannot clearly identify the subtitle displayed in the video frame display area. Content, at this time, the subtitle synthesis plug-in correspondingly reduces the character transparency, increases the character size, and changes the character color to a more eye-catching color based on the preset configuration file.

In one embodiment, the subtitle synthesis plug-in can generate a style adjustment control in the terminal, and based on the style adjustment control, the target member can correspondingly adjust the style of the second text.

In the foregoing embodiment, by setting the style adjustment control in the terminal, the target member can independently adjust the display style of the second text, thereby improving the user experience.

In the above embodiment, the style of the second text is adjusted in real time according to the size of the video frame display area corresponding to the other member, which not only reduces the situation that the target member cannot clearly identify the content of the subtitles because the video frame display area is too small, but also can be displayed in the video frame. When the display area is large enough, by reducing the presence of the second text, the interference to the target member caused by the repeated display of the voice information of the other member is reduced.

In an embodiment, the terminal includes an audio collection component and an audio playback component; the above video playback method further includes: the first voice is generated based on the audio collection component, and the second voice is generated based on the audio playback component.

Specifically, when a video call is made, the audio collection component in the first terminal, such as a microphone, can record the first voice of the target member in real time, and transmit the recorded first voice to the subtitle synthesis plug-in in a voice stream. To generate the corresponding first text.

The audio collection component in the second terminal can also collect the second voice of the opposite member in real time, and send the second voice to the first terminal through the instant messaging application. At this time, the instant messaging application in the first terminal receives the second voice, and sends the second voice to the audio playback component. The subtitle synthesis plug-in in the first terminal monitors whether the audio playback component receives the second voice. When the audio playback component receives the second voice, the subtitle synthesis plug-in obtains the second voice, and according to the target member corresponding to the language configuration information Recognize and translate the second speech in the first target language to obtain the second text.

In the above-mentioned embodiment, by separately reading the voices collected by the audio collection component and the audio playback component, the subtitle synthesis plug-in can clearly distinguish between the voice generated by the target member and the voice generated by the opponent member, so that the subsequent voice can be generated according to the target member and the other member. The generation of the voice corresponds to the generation of the second text and the second text.

It should be understood that although the various steps in the flowchart of FIG. 2 are displayed in sequence as indicated by the arrows, these steps are not necessarily executed in sequence in the order indicated by the arrows. Unless there is a clear description in this article, there is no strict order for the execution of these steps, and these steps can be executed in other orders. Moreover, at least part of the steps in FIG. 2 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. The execution of these sub-steps or stages The sequence is not necessarily performed sequentially, but may be performed alternately or alternately with at least a part of other steps or sub-steps or stages of other steps.

In one embodiment, as shown in FIG. 8, a video call device 800 is provided, including: a first text generation module 802, a target video frame synthesis module 804, and a page display module 806, wherein:

The first text generation module 802 is used to collect the first voice and source video frame generated by the target member in the video call; convert the first voice according to the preset target languages pointed to by the other members participating in the video call to obtain the first voice One text.

The target video frame synthesis module 804 is configured to synthesize the source video frame with the first text corresponding to each target language to obtain the target video frame corresponding to each target language.

The page display module 806 is configured to send the obtained target video frames of each target language to the corresponding counterparty member.

In one embodiment, as shown in FIG. 9, the above-mentioned video call device 800 further includes a language configuration module 808, which is used to display a language configuration page when the configuration operation of the target language is triggered; to obtain the language configured based on the language configuration page Configuration information; the language configuration information includes the candidate languages corresponding to the target member and the other member participating in the video call; the member ID and language configuration information of the target member are associated and stored to the server, so that the server has the language type associated with the member ID of the other member When configuring information, the candidate language corresponding to the member identifier associated with each language configuration information is used as the target language of the corresponding member.

In one embodiment, the language configuration module 808 is further configured to send the first voice to the server; and receive the first text returned by the server after converting the first voice according to the target language of each member of the other party.

In an embodiment, the target video frame synthesis module 804 is further configured to generate a corresponding subtitle image based on each type of first text, and cache the subtitle image; query whether there is an updated subtitle image in the cache every first preset duration; If yes, synthesize the updated subtitle image with each source video frame generated by the target member within the second preset duration before the current time, and delete the synthesized subtitle image from the cache; the second preset duration is less than the first preset duration.

In one embodiment, the target video frame synthesis module 804 is further configured to determine the subtitle width according to the image width of the source video frame; convert the subtitle width into a threshold value for the number of characters corresponding to each target language; Split a text into multiple sub-texts; determine the subtitle height of the corresponding first text according to the number of sub-texts corresponding to the first text; add the first text as subtitle content to the background image generated according to the subtitle width and subtitle height to obtain subtitles image.

In one embodiment, the video call device 800 further includes a second text generation module 810, which is used to collect the second voice generated by the opposite member during the video call; obtain the second voice obtained by converting the second voice according to the target language corresponding to the target member. Second text; display the second text.

In one embodiment, the video call device 800 further includes a video frame display area determining module 812, configured to display the synthesized target video frame corresponding to the target language of the target member in the video frame display area corresponding to the target member, which is recorded as the first The target video frame; the second target video frame is obtained from the other member; the second target video frame is based on the target member’s corresponding target language, the second voice generated by the other member during the video call is converted into the second text, and the result is obtained based on the conversion The second text and the source video frame generated by the opposite member during the video call are synthesized; and the second target video frame is displayed in the video frame display area corresponding to the opposite member.

In one embodiment, the video frame display area determination module 812 is also used to collect the second voice generated by the opposite member during the video call; obtain the second text obtained by converting the second voice according to the target language corresponding to the target member; The size of the video frame display area corresponding to each member of the other party determines the display style of the obtained second text; the obtained second text is displayed in the pop-up window of the video call page according to the display style.

In one embodiment, the video call device 800 further includes a voice acquisition module 814, configured to collect the first voice based on the audio collection component, and collect the second voice based on the audio playback component.

For the specific limitation of the video call device, please refer to the above limitation of the video call method, which will not be repeated here. The various modules in the above-mentioned video call device may be implemented in whole or in part by software, hardware, and a combination thereof. The above-mentioned modules may be embedded in the form of hardware or independent of the processor in the computer equipment, or may be stored in the memory of the computer equipment in the form of software, so that the processor can call and execute the operations corresponding to the above-mentioned modules.

In one embodiment, a computer device is provided. The computer device may be a first terminal, and its internal structure diagram may be as shown in FIG. 10. The computer equipment includes a processor, a memory, a network interface, a display screen, an audio collection device, an audio playback device, an image collection device, and an input device connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer program is executed by the processor to realize a video call method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, or it can be a control, trackball or touchpad set on the housing of the computer equipment , It can also be an external keyboard, touchpad, or mouse.

Those skilled in the art can understand that the structure shown in FIG. 10 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied. The specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.

In one embodiment, a computer device is provided, including a memory, a processor, and a computer program stored in the memory and running on the processor. When the processor executes the computer program, it is realized: the collection target member generates a video call during a video call. The first voice and the source video frame; the first voice is converted according to the preset target languages pointed to by the other members participating in the video call to obtain the first text; the source video frame is respectively corresponding to the first target language of each target language The text is synthesized to obtain the target video frame corresponding to each target language; the obtained target video frame of each target language is sent to the corresponding opponent member.

In one embodiment, when the processor executes the computer program, it also realizes: when the configuration operation of the target language is triggered, the language configuration page is displayed; the language configuration information configured based on the language configuration page is obtained; the language configuration information includes the target members and participation Candidate languages corresponding to the other members of the video call; associate the member ID and language configuration information of the target member to the server, so that the server associates each language configuration information when there is language configuration information associated with the member ID of the other member The candidate language corresponding to the member ID of is used as the target language of the corresponding member.

In an embodiment, when the processor executes the computer program, the processor further implements: sending the first voice to the server; and receiving the first text obtained by converting the first voice according to the target language of each member of the other party returned by the server.

In one embodiment, when the processor executes the computer program, the processor further implements: generating a corresponding subtitle image based on each first text, and buffering the subtitle image; and synthesizing the source video frame with the first text corresponding to each target language. Including: inquire whether there is an updated subtitle image in the cache every first preset duration; if so, synthesize the updated subtitle image with each source video frame generated by the target member within the second preset duration before the current time, and The synthesized subtitle image is deleted from the cache; the second preset duration is less than the first preset duration.

In one embodiment, when the processor executes the computer program, the processor further implements: determining the subtitle width according to the image width of the source video frame; converting the subtitle width into a threshold value for the number of characters corresponding to each target language; Split a text into multiple sub-texts; determine the subtitle height of the corresponding first text according to the number of sub-texts corresponding to the first text; add the first text as subtitle content to the background image generated according to the subtitle width and subtitle height to obtain subtitles image.

In one embodiment, when the processor executes the computer program, the processor also implements: collect the second voice produced by the opposite member during the video call; obtain the second text obtained by converting the second voice according to the target language corresponding to the target member; and display the second text.

In one embodiment, the page of the video call includes the target member and the video frame display area corresponding to each other member; when the processor executes the computer program, it also realizes: display the synthesized target member in the video frame display area corresponding to the target member The target video frame corresponding to the target language is recorded as the first target video frame; the second target video frame is obtained from the opponent member; the second target video frame is the second target video frame generated by the opponent member during the video call according to the target language corresponding to the target member The voice is converted into a second text, and synthesized based on the converted second text and the source video frame generated by the opposite member during the video call; the second target video frame is displayed in the video frame display area corresponding to the opposite member.

In one embodiment, when the processor executes the computer program, the processor also implements: collect the second voice produced by the opposite member during the video call; obtain the second text obtained by converting the second voice according to the target language corresponding to the target member; The size of the video frame display area corresponding to the opposite member determines the display style of the obtained second text; the obtained second text is displayed in the pop-up window of the video call page according to the display style.

In one embodiment, the terminal includes an audio collection component and an audio playback component; the processor further implements the following steps when executing the computer program: the first voice is generated based on the audio collection component, and the second voice is generated based on the audio playback component.

In one embodiment, a computer-readable storage medium is provided, and a computer program is stored thereon. The computer program is executed by a processor to realize: collect the first voice and source video frame generated by the target member during the video call; The preset target languages pointed to by the other members participating in the video call convert the first voice to obtain the first text; the source video frame is synthesized with the first text corresponding to each target language, and the corresponding target language is obtained The target video frame of each target language; send the obtained target video frame of each target language to the corresponding member of the other party.

A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through a computer program. The computer program can be stored in a non-volatile computer readable storage. In the medium, when the computer program is executed, it may include the procedures of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

The technical features of the above embodiments can be combined arbitrarily. In order to make the description concise, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, they should be It is considered as the range described in this specification.

The above-mentioned embodiments only express several implementation manners of the present application, and their description is relatively specific and detailed, but they should not be understood as a limitation on the scope of the invention patent. It should be pointed out that for those of ordinary skill in the art, without departing from the concept of this application, several modifications and improvements can be made, and these all fall within the protection scope of this application. Therefore, the scope of protection of the patent of this application shall be subject to the appended claims.

Claims

A video call method, including:

Collect the first voice and source video frame generated by the target member in the video call;

Converting the first voice according to the preset target languages pointed to by the opposite members participating in the video call to obtain the first text;

Synthesize the source video frame with the first text corresponding to each target language to obtain a target video frame corresponding to each target language; and

Send the obtained target video frames in each target language to the corresponding counterparty member.
The method according to claim 1, further comprising:

When the target language configuration operation is triggered, the language configuration page is displayed;

Acquiring language configuration information configured based on the language configuration page; the language configuration information includes candidate languages corresponding to the target member and the counterpart member participating in the video call; and

The member ID of the target member and the language configuration information are stored in association with the server, so that the server associates each language configuration information with the language configuration information associated with the member ID of the counterpart member. The candidate language corresponding to the member ID is used as the target language of the corresponding member.
The method according to claim 2, further comprising:

Sending the first voice to the server; and

Receiving the first text obtained by converting the first voice according to the target language of each counterpart member returned by the server.
The method according to claim 1, further comprising:

Generating a corresponding subtitle image based on each of the first texts, and buffering the subtitle image;

The synthesizing the source video frame with the first text corresponding to each target language includes:

Query whether there is an updated subtitle image in the cache every first preset duration; and

If yes, synthesize the updated subtitle image with each source video frame generated by the target member within the second preset period of time before the current time, and delete the synthesized subtitle image from the cache; the second The preset duration is less than the first preset duration.
The method according to claim 4, wherein said generating a corresponding subtitle image based on each type of said first text comprises:

Determining the subtitle width according to the image width of the source video frame;

Converting the subtitle width into a threshold value of the number of characters corresponding to each target language;

Split the corresponding first text into multiple sub-texts according to different thresholds for the number of characters;

Determining the subtitle height of the corresponding first text according to the number of sub-texts corresponding to the first text; and

The first text is added as subtitle content to a background image generated according to the subtitle width and the subtitle height to obtain a subtitle image.
The method according to claim 1, wherein the method further comprises:

Collecting the second voice generated by the opposite member during the video call;

Acquiring the second text obtained by converting the second voice according to the target language corresponding to the target member; and

Show the second text.
The method according to claim 1, wherein the page of the video call includes the target member and a video frame display area corresponding to each of the counterpart members; the method further comprises:

In the video frame display area corresponding to the target member, display the synthesized target video frame corresponding to the target language of the target member, which is recorded as the first target video frame;

Obtain a second target video frame from the opposite member; the second target video frame converts the second voice generated by the opposite member in the video call into a second text according to the target language corresponding to the target member, and Synthesized based on the converted second text and the source video frame generated by the opposite member during the video call; and

In the video frame display area corresponding to the opposing member, the second target video frame is displayed.
The method according to claim 7, wherein the method further comprises:

Collecting the second voice generated by the opposite member during the video call;

Acquiring a second text obtained by converting a second voice according to the target language corresponding to the target member;

Determine the acquired display style of the second text according to the size of the video frame display area corresponding to each of the opposing members; and

Display the acquired second text in a pop-up window of the video call page according to the display style.
The method according to any one of claims 6 to 8, wherein the terminal includes an audio collection component and an audio playback component; the first voice is generated based on the audio collection component, and the second voice is based on the The audio playback component is generated.
The method according to claim 1, after said collecting said first voice and said source video frame produced by said target member in said video call, further comprising:

Buffering the collected first voice and determining the input time of the first voice;

Judging whether a new first voice is received within a preset time period from the input time;

In response to receiving the new first voice, continue to buffer the new first voice; and

In response to not receiving a new first voice, splicing the buffered first voice to obtain the spliced first voice.
The method according to claim 1, wherein said converting the first voice according to the preset target languages pointed to by the opposite members participating in the video call to obtain the first text comprises:

Recognizing the first voice according to the first target language, and converting the first voice into the first text corresponding to the first target language according to the recognition result;

Check whether the second target language is the same as the first target language;

In response to the second target language being different from the first target language, the language version types of the second target language are counted, and the translation of the second target language and the first target language based on the different language version types Corresponding to the first text, the first text corresponding to the second target language is obtained.
The method according to claim 2, further comprising:

Sending the language configuration information to the opposing member, and monitoring the operation behavior of the opposing member;

When the target language change operation is triggered, the language change page is displayed;

Re-determine the target language corresponding to the counterparty member on the language change page; and

The first voice is converted according to the newly determined target language to obtain the corresponding first text.
A video call method applied between multiple members, including:

Obtain the first voice and source video frame generated by the target member in the video call;

Converting the first voice according to the preset target languages pointed to by other members participating in the video call to obtain the first text;

Synthesize the source video frame with the first text corresponding to each target language to obtain a target video frame corresponding to each target language; and

Send the obtained target video frames in each target language to other members.
The method according to claim 13, further comprising:

Obtain the member ID and language configuration information of the target member, and store the member ID and the language configuration information in association, and use the candidate language corresponding to the member ID associated with each of the language configuration information as the corresponding member The target language.
The method according to claim 13, further comprising:

Generating a corresponding subtitle image based on each of the first texts, and buffering the subtitle image;

The synthesizing the source video frame with the first text corresponding to each target language includes:

Query whether there is an updated subtitle image in the cache every first preset duration; and

If yes, synthesize the updated subtitle image with each source video frame generated by the target member within the second preset period of time before the current time, and delete the synthesized subtitle image from the cache; the second The preset duration is less than the first preset duration.
The method according to claim 15, wherein said generating a corresponding subtitle image based on each of said first texts comprises:

Determining the subtitle width according to the image width of the source video frame;

Converting the subtitle width into a threshold value of the number of characters corresponding to each target language;

Split the corresponding first text into multiple sub-texts according to different thresholds for the number of characters;

Determining the subtitle height of the corresponding first text according to the number of sub-texts corresponding to the first text; and

The first text is added as subtitle content to a background image generated according to the subtitle width and the subtitle height, and the subtitle image is obtained and sent to the corresponding member.
A video call device, the device includes:

The first text generation module is used to collect the first voice and source video frame generated by the target member in the video call; convert the first voice according to the preset target languages pointed to by the other members participating in the video call , Get the first text;

A target video frame synthesis module, configured to synthesize the source video frame and the first text corresponding to each target language to obtain a target video frame corresponding to each target language; and

The page display module is used to send the obtained target video frames in each target language to the corresponding counterparty member.
A computer device comprising a memory, a processor, and a computer program stored on the memory and capable of running on the processor. The processor implements the method described in any one of claims 1 to 12 when the processor executes the computer program step.
A computer-readable storage medium having a computer program stored thereon, and when the computer program is executed by a processor, the steps of the method according to any one of claims 1 to 12 are realized.