WO2019184499A1

WO2019184499A1 - Video call method and device, and computer storage medium

Info

Publication number: WO2019184499A1
Application number: PCT/CN2018/124933
Authority: WO
Inventors: 肖树山; 马小捷; 石范潘; 李斯楠; 夏吟
Original assignee: 上海掌门科技有限公司
Priority date: 2018-03-29
Filing date: 2019-01-31
Publication date: 2019-10-03
Also published as: CN108259810A

Abstract

Provided is a video call method executed at a terminal. The method comprises: a terminal collecting a video frame of a local scene of a first user via a first camera, and acquiring, based on data from a server end, an image of a second user video calling the first user; and superimposing the image of the second user on the video frame of the local scene of the first user for synthesis, and displaying a synthesized video frame on a video call interface. Provided is a video call method executed at a server end. The method comprises: a server end receiving a video frame, sent by a terminal of a second user, of the second user; and sending, according to the video frame of the second user, an image of the second user to a terminal of a first user video calling the second user, so that the terminal of the first user synthesizes the image of the second user with a video frame of a local scene of the first user and displays a synthesized video frame on a video call interface. The present application can improve the call effect of a video call.

Description

Method, device and computer storage medium for video call

[Technical Field]

The present application relates to Internet application technologies, and in particular, to a video call method, device, and computer storage medium.

【Background technique】

In the prior art, when a video call is made, a video image recorded by the opposite camera is generally displayed in the video call interface; in some cases, a video image recorded by the local camera is also displayed. For example, if user A and user B make a video call, the video image recorded by the camera of user B is displayed in the video call interface of user A, and the video of the user A is displayed in the video call interface of user B. Video image. Therefore, both parties can intuitively see the video transmitted from the opposite end of the call for communication. This type of call has become a habit in the field of video calling, so that those skilled in the art have not found that such a video call fails to create a more realistic call scene.

[Summary of the Invention]

In view of this, the present application provides a method, device, and computer storage medium for video calling.

Some embodiments of the present application provide a method for video calling, the method comprising: collecting, by a first camera, a video frame of a first user local scene, and acquiring video with the first user based on data from a server end. a portrait of the second user of the call; synthesizing the portrait of the second user with a video frame of the first user local scene, and displaying the synthesized video frame on the video call interface.

Some embodiments of the present application provide a method for a video call, where the method includes: receiving, by a server, a video frame of a second user sent by a terminal of a second user; according to the video frame of the second user, The portrait of the second user is sent to the terminal of the first user who performs a video call with the second user, so that the terminal of the first user synthesizes the portrait of the second user with the video frame of the local scene of the first user, and The video call interface displays the video frames obtained after the synthesis.

An apparatus, comprising: one or more processors; storage means for storing one or more programs, when the one or more programs are executed by the one or more processors The one or more processors are implemented to implement the method of any of the claims.

A storage medium comprising computer executable instructions for performing the method of any of the claims when executed by a computer processor.

As can be seen from the above technical solution, the above embodiment of the present application combines the portrait of one of the video call parties to the video frame of the local scene of the other user. Compared with the manner in which the video image recorded by the camera device of any one of the video calls is completely displayed in the video call interface of the other party in the prior art, the above embodiment of the present application can create a more realistic manner for both parties of the video call. Really call environment, improve video call performance.

[Description of the Drawings]

FIG. 1 is a structural diagram of a video call provided by some embodiments of the present application;

2 is a flowchart of a method for performing a video call by a terminal according to some embodiments of the present disclosure;

FIG. 3 is a flowchart of a method for performing a video call by a server according to some embodiments of the present disclosure;

4 is an interaction diagram of a video call provided by a system including a terminal and a server according to some embodiments of the present application;

FIG. 5 is a block diagram of a computer system/server provided by some embodiments of the present application.

【detailed description】

In order to make the objects, technical solutions, and advantages of the present application more clear, the present application will be described in detail below with reference to the accompanying drawings and specific embodiments.

The terms used in the following embodiments of the present application are for the purpose of describing particular embodiments only, and are not intended to limit the application. The singular forms "a", "the" and "the"

It should be understood that the term "and/or" as used herein is merely an association describing the associated object, indicating that there may be three relationships, for example, A and/or B, which may indicate that A exists separately, while A and B, there are three cases of B alone. In addition, the character "/" in this article generally indicates that the contextual object is an "or" relationship.

Depending on the context, the word "if" as used herein may be interpreted as "when" or "when" or "in response to determining" or "in response to detecting." Similarly, depending on the context, the phrase "if determined" or "if detected (conditions or events stated)" may be interpreted as "when determined" or "in response to determination" or "when detected (stated condition or event) "Time" or "in response to a test (condition or event stated)".

The core idea of some embodiments of the present application includes: when a user performs a video call, displaying in the video call interface of the user is a video frame that synthesizes the user portrait of the other party to the local scene of the user; The synthesized video frame is used for video calls. Therefore, some embodiments of the present application can provide a more realistic call environment and improve the video call effect of the user by adopting the above manner. Some embodiments of the present application may implement a video call based on the following architecture, as shown in FIG. 1, the architecture includes a server, a first terminal, a second terminal, an nth terminal. In some embodiments of the present application, the terminal may include a user equipment, such as a mobile user equipment such as a mobile phone or a tablet computer, and a fixed user equipment such as a desktop computer; in some embodiments, the terminal may include software running on the user equipment. Or a client, such as a third-party application running on the user device, a program or application that comes with the user device system, a plug-in in the application, or a functional unit such as a Software Development Kit (SDK). The server may include a centralized server, and may also include a distributed server; in some embodiments of the present application, the server includes a service server that serves video call services. It can be understood that the number of the terminals is not limited in this application, that is, some embodiments of the present application can implement a two-person video call or a multi-person video call. In some of the following embodiments of the present application, a video call of two people is taken as an example for description.

FIG. 2 is a flowchart of a method for a video call performed by a terminal according to some embodiments of the present disclosure. As shown in FIG. 2, the method includes:

In 201, the terminal collects a video frame of the first user local scene through the first camera, and acquires a portrait of the second user who performs a video call with the first user based on data from the server end.

It can be understood that if user A and user B make a video call, for user A, it is the first user and user B is the second user; likewise, for user B, it is the first User, user A is the second user.

In this step, the terminal collects video frames of the local scene of the first user through the first camera of the terminal device. That is, for the user A, the terminal of the user A collects the video frame of the local scene of the user A through the first camera of the terminal device of the user A; and for the user B, the terminal of the user B passes the terminal device of the user B. A camera captures video frames of user B's local scene.

Meanwhile, in this step, the terminal acquires a portrait of the second user who makes a video call with the first user based on the data from the server side. Among them, the data from the server side can be directly the portrait of the second user. That is to say, for the user A, the terminal of the user A receives the portrait of the user B from the server side; and for the user B, the terminal of the user B receives the portrait of the user A from the server side. It can be understood that the terminal collects the portrait of the first user through the second camera and sends it to the server, and then the server sends the portrait of the second user who makes a video call with the first user to the terminal. The data from the server side can also be the video frame of the second user. That is, for the user A, the terminal of the user A receives the video frame of the user B from the server, and the terminal of the user A acquires the portrait of the user B from the received video frame; and for the user B, The terminal of user B receives the video frame of user A from the server end, and the terminal of user B acquires the portrait of user A from the received video frame.

The first camera may be a rear camera of the terminal device, and the second camera may be a front camera of the terminal device. That is, in some embodiments of the present application, the video frame of the first user is captured by the front camera of the terminal device, and the video frame of the local scene of the first user is collected by the rear camera of the terminal device.

Optionally, when the terminal sends the portrait of the first user to the server, the terminal may obtain the portrait of the first user from the video frame of the first user collected by the second camera, and then obtain the image of the first user. The portrait of the first user is sent to the server side. The video frame of the first user collected by the second camera may be directly sent by the terminal to the server, and the server may retrieve the portrait of the first user from the video frame.

In 202, the portrait of the second user is combined with the video frame of the first user local scene, and the synthesized video frame is displayed on the video call interface.

In this step, the portrait of the second user is compared with the first user according to the video frame of the first user local scene collected by the terminal and the portrait of the second user who performs a video call with the first user acquired based on the data of the server end. The video frames of the local scene are synthesized such that the user makes a video call based on the synthesized video frames.

In some embodiments of the present application, when the portrait of the second user is combined with the video frame of the first user local scene, the following manner may be adopted: determining the portrait display size of the second user according to the portrait eye distance of the second user. And synthesizing the portrait of the second user with the video frame of the first user local scene according to the determined display size. The portrait distance of the second user may be sent by the server to the terminal. It can be understood that the size of the second user's portrait may not be adjusted based on the second user's portrait eye distance, and the second user's portrait may be directly adjusted according to the preset ratio.

Optionally, when determining the portrait display size of the second user according to the eye distance of the second user, the following manner may be adopted: the terminal acquires the screen size of the terminal device, and determines the second user according to the relationship between the eye distance of the portrait and the screen size. The portrait shows the size. The terminal may determine the screen size of the terminal device according to the attribute information of the terminal device, for example, the model information of the terminal device. For example, if the terminal device is Apple 7, it can be determined that the screen size of the Apple 7 is 4.7 inches. In addition, it can be understood that the unit of the eye distance obtained can be consistent with the default unit of the screen size, for example, the unit of the eye distance is in inches; the unit of the eye distance and the screen size can also be converted into the same, for example, if the portrait The unit of eye distance is in centimeters, and the screen size is in inches. You can convert centimeters to inches or convert inches to centimeters.

In some embodiments, the display size of the portrait of the second user and the portion displayed by the portrait may be determined based on the relationship of the portrait eye distance to the screen size. For example, if the portrait eye distance is greater than the E% (for example, 20%) screen size, only the body part of the eye length of F times (for example, 3 times) below the head in the second user portrait is displayed, and the excess portion is not displayed; If the portrait eye distance is equal to a screen size smaller than E% (for example, 20%), only the body part of the eye length of G times (for example, 4 times) below the head in the second user portrait is displayed, and the excess portion is not displayed. In the embodiment of the present application, E, F, and G are preset values, and it is preferable that the value of F is smaller than the value of G. It can be understood that when the portrait eye distance is too large or too small, the portrait of the second user can also be scaled so that the portrait of the second user can be normally displayed in the terminal screen of the first user. For example, if the portrait eye distance is too large and exceeds the screen size, for example, the portrait eye distance is twice the screen size, and the screen cannot display the user portrait, the second user's portrait is scaled, for example, to reduce it to the original half. Then display it again. If the eye distance of the portrait is too small, the portrait of the second user can be scaled, for example, enlarged to the original 2 times and then displayed.

In some embodiments of the present application, when synthesizing the portrait of the second user with the video frame of the first user local scene, the following manner may also be adopted: first, selecting N pixel points from the portrait of the second user, and Selecting M pixels from the video frame of the first user local scene, where N and M are positive integers greater than 0, and the pixels may be randomly selected, or may be selected according to a preset selected position; then the selected N pixels and an intermediate color between the M pixels, and stroke the second user's portrait based on the calculated intermediate color, so that the second user's portrait is more integrated into the first user's local scene. In the video frame, the portrait of the second user obtained by the stroke is combined with the video frame of the local scene of the first user. It can be understood that the portrait of the second user may not be re-stroked, and the portrait of the second user is directly superimposed in the video frame of the first user local scene. That is, the intermediate color corresponding to the portrait of the second user and the video frame of the local scene of the first user is acquired, and the portrait of the second user is stroked by using the acquired intermediate color, and the second user after the stroke is obtained. The portrait is synthesized with the first user local scene video frame.

In addition, when superimposing the second user portrait, the portrait of the second user and the video frame of the first user local scene may be combined according to the preset position. That is to say, the present embodiment is to place the portrait of the second user at a suitable position in the video frame of the first user's local scene, instead of randomly placing the portrait of the second user. For example, if the preset position is the middle position of the bottom edge of the first user local scene video frame, the portrait of the second user is centered on the bottom edge of the video frame of the first user local scene, and the preset position may also be The left end, the right end, and the like in the video frame of the first user local scene, and the superimposed position of the portrait of the second user in the present application is not limited.

After combining the portrait of the second user with the video frame of the first user local scene, displaying the synthesized video frame on the video call interface, and performing video call on both sides of the video call by the two sides of the video call, thereby Improve the call performance of your video call.

FIG. 3 is a flowchart of a method for performing a video call on a server according to an embodiment of the present disclosure. As shown in FIG. 3, the method includes:

In 301, the server receives the video frame of the second user sent by the terminal of the second user.

In this step, the server receives the video frame of the second user sent by the terminal of the second user. For example, user A and user B make a video call. For user A, the video frame of the second user received by the server is the video frame of user B. Similarly, for user B, the server receives the video. The video frame of the second user is the video frame of User A.

It can be understood that the video frame of the second user received by the server may be the image of the second user extracted by the terminal of the second user; or may be the video frame of the second user sent by the terminal of the second user. .

In 302, the portrait of the second user is sent to the terminal of the first user who performs a video call with the second user according to the video frame of the second user, so that the terminal of the first user will The portrait of the two users is combined with the video frame of the local scene of the first user, and the video frame obtained after the synthesis is displayed on the video call interface.

In this step, if the server receives the video frame of the second user, it also needs to capture the portrait of the second user from the video frame of the second user; and then send the captured image of the second user to the second user. The terminal of the first user making a video call with the second user. In addition, the server may further detect the portrait eye distance of the second user after acquiring the portrait of the second user, and then provide the detected eye distance information to the terminal of the first user, so that the terminal of the first user is based on the portrait eye. The size of the portrait display of the second user is adjusted.

In this step, if the server side cannot detect the portrait eye distance of the second user, it indicates that the terminal of the second user fails to accurately acquire the image of the second user, for example, the second user does not face the camera or the camera is blocked. Then, the server returns a prompt message to the terminal of the second user, prompting the user to re-collect the portrait.

In this step, if it is detected that there are multiple eye distances in the portrait of the second user, the determined eye distance and the portrait of the second user corresponding to the eye distance may be determined after determining the eye distance that meets the preset requirement. The terminal sent to the first user. For example, a portrait of a second user having the largest eye distance among the plurality of eye distances and the largest eye distance are selected and transmitted to the terminal of the first user.

After the server sends the portrait of the second user to the terminal of the first user, the terminal of the first user synthesizes the received portrait of the second user with the video frame of the first user local scene.

The following describes the above process. User A and user B make a video call. For user A, it is the first user and user B is the second user. Similarly, for user B, it is itself The first user, user A is the second user. If the terminal corresponding to the user A is the terminal UA, and the terminal corresponding to the user B is the UB, the terminal UA sends the user image IA of the user A to the server, and the terminal UB sends the user image IB of the user B to the server. After acquiring the user image IA and the user image IB, the server side captures the user portrait Ia in the user image IA and the user portrait Ib in the user image IB, and then sends the obtained user portrait Ia to the terminal UB, and obtains the user image Ia. It is sent to the terminal UA to the user portrait Ib. Further, the terminal UA synthesizes based on the obtained user portrait Ib, and the terminal UB synthesizes based on the obtained user portrait Ia.

FIG. 4 is a flow chart of interaction of a video call according to an embodiment of the present application.

As shown in FIG. 4, user A performs a video call with user B, and the terminal corresponding to user A is the terminal UA, and the terminal corresponding to user B is the terminal UB. First, the terminal UA can collect the video frame of the user A by using the front camera of the terminal device of the user A, and collect the local scene video frame of the user A by using the rear camera of the terminal device; the terminal UB can utilize the front end of the terminal device of the user B. The camera captures the video frame of the user B, and collects the local scene video frame of the user B by using the rear camera of the terminal device; then the terminal UA and the terminal UB respectively send the video frame of the user A and the video frame of the user B that are recorded. To the server side; the server side processes the received video frame of the user A and the video frame of the user B, and obtains the portrait Ia of the user A and the portrait Ib of the user B; the portrait of the user A that the server side will obtain Ia is sent to the terminal UB, and the obtained portrait B of the user B is sent to the terminal UA; the terminal UA synthesizes the portrait Ib of the user B transmitted by the server and the local scene video frame of the user A, and the terminal UB is sent by the server. The portrait Aa of the user A is synthesized with the local scene video frame of the user B; therefore, the user A and the user B respectively perform a video call based on the synthesized image. Make the video call more realistic.

FIG. 5 illustrates a block diagram of an exemplary computer system/server 012 suitable for implementing some embodiments of the present application. The computer system/server 012 shown in FIG. 5 is merely an example and should not impose any limitation on the function and scope of use of the embodiments of the present application.

As shown in Figure 5, computer system/server 012 is represented in the form of a general purpose computing device. Components of computer system/server 012 may include, but are not limited to, one or more processors or processing units 016, system memory 028, and bus 018 that connects different system components, including system memory 028 and processing unit 016.

Bus 018 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a graphics acceleration port, a processor, or a local bus using any of a variety of bus structures. For example, these architectures include, but are not limited to, an Industry Standard Architecture (ISA) bus, a Micro Channel Architecture (MAC) bus, an Enhanced ISA Bus, a Video Electronics Standards Association (VESA) local bus, and peripheral component interconnects ( PCI) bus.

Computer system/server 012 typically includes a variety of computer system readable media. These media can be any available media that can be accessed by computer system/server 012, including volatile and non-volatile media, removable and non-removable media.

System memory 028 can include computer system readable media in the form of volatile memory, such as random access memory (RAM) 030 and/or cache memory 032. Computer system/server 012 may further include other removable/non-removable, volatile/non-volatile computer system storage media. By way of example only, storage system 034 can be used to read and write non-removable, non-volatile magnetic media (not shown in Figure 5, commonly referred to as a "hard disk drive"). Although not shown in FIG. 5, a disk drive for reading and writing to a removable non-volatile disk (such as a "floppy disk"), and a removable non-volatile disk (such as a CD-ROM, DVD-ROM) may be provided. Or other optical media) read and write optical drive. In these cases, each drive can be coupled to bus 018 via one or more data medium interfaces. Memory 028 can include at least one program product having a set (e.g., at least one) of program modules configured to perform the functions of the various embodiments of the present application.

Program/utility 040 having a set (at least one) of program modules 042, which may be stored, for example, in memory 028, such program module 042 includes, but is not limited to, an operating system, one or more applications, other programs Modules and program data, each of these examples or some combination may include an implementation of a network environment. Program module 042 typically performs the functions and/or methods of the embodiments described herein.

Computer system/server 012 may also be in communication with one or more external devices 014 (eg, a keyboard, pointing device, display 024, etc.), in some embodiments of the present application, computer system/server 012 is in communication with an external radar device, Any device (eg, a network card that can communicate with one or more devices that enable a user to interact with the computer system/server 012, and/or with the computer system/server 012 to communicate with one or more other computing devices, Modem, etc.) communication. This communication can take place via an input/output (I/O) interface 022. Also, computer system/server 012 can also communicate with one or more networks (e.g., a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) via network adapter 020. As shown, network adapter 020 communicates with other modules of computer system/server 012 via bus 018. It should be understood that although not shown in the figures, other hardware and/or software modules may be utilized in connection with computer system/server 012, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, Tape drives and data backup storage systems.

The processing unit 016, by executing a program stored in the system memory 028, performs various function applications and data processing, for example, a method for implementing a video call, which may include:

The terminal collects a video frame of the first user local scene through the first camera, and acquires a portrait of the second user that performs a video call with the first user based on data from the server end;

Combining the portrait of the second user with the video frame of the first user local scene, and displaying the synthesized video frame on the video call interface.

A method of video calling can also be implemented, including:

The server receives the video frame of the second user sent by the terminal of the second user;

Sending, according to the video frame of the second user, the portrait of the second user to the terminal of the first user who performs a video call with the second user, so that the terminal of the first user displays the portrait of the second user The video frame of the first user local scene is synthesized, and the synthesized video frame is displayed on the video call interface.

The computer program described above may be provided in a computer storage medium encoded with a computer program that, when executed by one or more computers, causes one or more computers to perform the operations described in the above-described embodiments of the present application. Method flow and/or device operation. For example, the method flow executed by one or more of the above processors may include:

It can also include:

With the development of time and technology, the meaning of media is more and more extensive. The transmission route of computer programs is no longer limited by tangible media, and can also be downloaded directly from the network. Any combination of one or more computer readable media can be utilized. The computer readable medium can be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the above. More specific examples (non-exhaustive lists) of computer readable storage media include: electrical connections having one or more wires, a portable computer disk, a hard disk, a random access memory (RAM), a read only memory (ROM), Erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium can be any tangible medium that can contain or store a program, which can be used by or in connection with an instruction execution system, apparatus or device.

A computer readable signal medium may include a data signal that is propagated in the baseband or as part of a carrier, carrying computer readable program code. Such propagated data signals can take a variety of forms including, but not limited to, electromagnetic signals, optical signals, or any suitable combination of the foregoing. The computer readable signal medium can also be any computer readable medium other than a computer readable storage medium, which can transmit, propagate, or transport a program for use by or in connection with the instruction execution system, apparatus, or device. .

Program code embodied on a computer readable medium can be transmitted by any suitable medium, including but not limited to wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing. Computer program code for performing the operations of the present application may be written in one or more programming languages, or a combination thereof, including an object oriented programming language such as Java, Smalltalk, C++, and conventional Procedural programming language—such as the "C" language or a similar programming language. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer, partly on the remote computer, or entirely on the remote computer or server. In the case of a remote computer, the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or can be connected to an external computer (eg, using an Internet service provider to access the Internet) connection).

By using the technical solution provided by the present application, by superimposing the portrait of one of the video call parties on the video frame of the local scene of the other user, it is possible to create a more realistic call environment for both parties of the video call, thereby Achieve the purpose of improving the video call.

In the several embodiments provided herein, it should be understood that the disclosed systems, devices, and methods may be implemented in other ways. For example, the device embodiments described above are merely illustrative. For example, the division of the unit is only a logical function division, and the actual implementation may have another division manner.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit. The above integrated unit can be implemented in the form of hardware or in the form of hardware plus software functional units.

The above-described integrated unit implemented in the form of a software functional unit can be stored in a computer readable storage medium. The software functional unit described above is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor to perform the methods described in various embodiments of the present application. Part of the steps. The foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like, which can store program codes. .

The above is only the preferred embodiment of the present application, and is not intended to limit the present application. Any modifications, equivalent substitutions, improvements, etc., which are made within the spirit and principles of the present application, should be included in the present application. Within the scope of protection.

Claims

A method for video calling, the method comprising:

The terminal collects a video frame of the first user local scene through the first camera, and acquires a portrait of the second user that performs a video call with the first user based on data from the server end;

Combining the portrait of the second user with the video frame of the first user local scene, and displaying the synthesized video frame on the video call interface.
The method of claim 1 further comprising:

The terminal collects the portrait of the first user through the second camera and sends the image to the server.
The method of claim 2 wherein said first camera is a rear camera and said second camera is a front camera.
The method according to claim 2, wherein the collecting, by the terminal, the portrait of the first user by the second camera and transmitting the image to the server comprises:

The terminal collects a video frame of the first user by using the second camera, and extracts a portrait from the video frame, and sends the captured portrait to the server; or

The terminal collects a video frame of the first user by using the second camera, and sends the video frame to the server, so that the server end extracts a portrait from the video frame.
The method according to claim 1, wherein synthesizing the portrait of the second user with the video frame of the first user local scene comprises:

Determining, according to the portrait eye distance of the second user, a portrait display size of the second user;

And synthesizing the portrait of the second user with the video frame of the first user local scene according to the determined display size.
The method according to claim 1, wherein synthesizing the portrait of the second user with the video frame of the first user local scene comprises:

Selecting N pixel points from the portrait of the second user, and selecting M pixel points from the video frames of the first user local scene, where N and M are positive integers greater than 0;

Calculating the selected N pixels and the intermediate color between the M pixels, and performing strokes on the second user's portrait based on the calculated intermediate color;

The portrait of the second user obtained by the stroke is combined with the video frame of the first user local scene.
The method according to claim 1, wherein synthesizing the portrait of the second user with the video frame of the first user local scene comprises:

And superimposing the portrait of the second user in a video frame of the first user local scene according to a preset location.
A method for video calling, the method comprising:

The server receives the video frame of the second user sent by the terminal of the second user;

Sending, according to the video frame of the second user, the portrait of the second user to the terminal of the first user who performs a video call with the second user, so that the terminal of the first user displays the portrait of the second user The video frame of the first user local scene is synthesized, and the synthesized video frame is displayed on the video call interface.
The method according to claim 8, wherein the video frame of the second user is a portrait of the second user extracted by the terminal of the second user.
The method according to claim 8, wherein the transmitting the portrait of the second user to the terminal of the first user performing a video call with the second user according to the video frame of the second user comprises:

Extracting a portrait of the second user from the video frame of the second user;

Sending the acquired portrait of the second user to the terminal of the first user who is in a video call with the second user.
The method of claim 10, further comprising:

The portrait eye distance of the second user is detected, and the detected eye distance information is provided to the terminal of the first user.
The method of claim 11 further comprising:

If the server side cannot detect the portrait eye distance of the second user, the terminal of the second user returns a prompt that the eye distance of the second user cannot be obtained.
The method of claim 11 further comprising:

If it is detected that the second user has multiple eye distances, after determining the eye distance that meets the preset requirement, the determined eye distance and the portrait of the second user corresponding to the eye distance are sent to the terminal of the first user.
A device, characterized in that the device comprises:

One or more processors;

a storage device for storing one or more programs,

The one or more programs are executed by the one or more processors such that the one or more processors implement the method of any of claims 1-13.
A storage medium containing computer executable instructions for performing the method of any of claims 1-13 when executed by a computer processor.