CN112312063A

CN112312063A - Multimedia communication method and device

Info

Publication number: CN112312063A
Application number: CN202011194154.0A
Authority: CN
Inventors: 施国庆
Original assignee: Vivo Mobile Communication Co Ltd
Current assignee: Vivo Mobile Communication Co Ltd
Priority date: 2020-10-30
Filing date: 2020-10-30
Publication date: 2021-02-02

Abstract

The application discloses a multimedia communication method and device, and belongs to the field of mobile communication. The method comprises the following steps: determining a first start time of a first speech of a first user in a target talk group; wherein the target talk group comprises at least the first user and a second user; if the interval between the first starting time and the first ending time of the second speech is smaller than a first preset threshold value, delaying the playing time of the first speech; wherein the second speech is a speech of the second user, and the first start time is later than the first end time. The embodiment of the application solves the problem that in the prior art, a speech conflict is easy to occur in the teleconference process.

Description

Multimedia communication method and device

Technical Field

The application belongs to the field of mobile communication, and particularly relates to a multimedia communication method and device.

Background

With the rapid development of mobile communication technology, various mobile electronic devices and non-mobile electronic devices have become indispensable tools in various aspects of people's lives. The functions of various Application programs (APPs) of the electronic equipment are gradually improved, and the functions do not only play a role in communication, but also provide various intelligent services for users, so that great convenience is brought to the work and life of the users.

At present, teleconferencing has become the conference form of comparatively using frequently at present, and the user realizes many people's teleconferencing through logging in the user on respective electronic equipment, and various teleconferencing class APP appears gradually, and some communication class APP's group function also can be used for the teleconference. However, in the process of carrying out a teleconference, a situation that a plurality of persons speak at the same time often occurs, so that speaking conflicts are caused, and the conference effect is influenced.

Disclosure of Invention

An object of the embodiments of the present application is to provide a multimedia communication method and apparatus, which can solve the problem in the prior art that a speech conflict is likely to occur in a teleconference process.

In order to solve the technical problem, the present application is implemented as follows:

in a first aspect, an embodiment of the present application provides a multimedia call method, where the method includes:

determining a first start time of a first speech of a first user in a target talk group; wherein the target talk group comprises at least the first user and a second user;

if the interval between the first starting time and the first ending time of the second speech is smaller than a first preset threshold value, delaying the playing time of the first speech; wherein the second speech is a speech of the second user, and the first start time is later than the first end time.

Optionally, the method further comprises:

and if the number of the users in the speaking state in the target conversation group is greater than or equal to a second preset threshold value, controlling the audio acquisition functions of the microphones of the other users not in the speaking state to be in a closed state.

Optionally, the method further comprises:

and if the number of the users in the speaking state in the target conversation group is less than the second preset threshold value, controlling the audio acquisition functions of the microphones of all the users to be in an open state.

Optionally, the delaying the playing time of the first speech specifically includes:

acquiring the speaking duration of the second speech;

determining the playing time of the first speech according to a preset starting interval and the speech duration; the playing time is the difference between the preset starting interval and the speaking time.

Optionally, the method further comprises:

acquiring speaking audios collected by microphones of all users, and determining time information of the speaking audios;

adjusting the speaking sequence of each speaking audio according to the time information of each speaking audio;

and playing each speaking audio according to the speaking sequence.

In a second aspect, an embodiment of the present application further provides a multimedia communication device, where the multimedia communication device includes:

the determining module is used for determining a first starting time of a first speech of a first user in the target conversation group; wherein the target talk group comprises at least the first user and a second user;

a delay module, configured to delay a play time of the first utterance if an interval between the first start time and a first end time of the second utterance is smaller than a first preset threshold; wherein the second speech is a speech of the second user, and the first start time is later than the first end time.

Optionally, the apparatus comprises:

and the first control module is used for controlling the audio acquisition functions of the microphones of other users who are not in the speaking state to be in a closed state if the number of the users who are in the speaking state in the target conversation group is greater than or equal to a second preset threshold value.

Optionally, the apparatus comprises:

and the second control module is used for controlling the audio acquisition functions of the microphones of all the users to be in an open state if the number of the users in the speaking state in the target conversation group is less than the second preset threshold value.

Optionally, the delay module comprises:

the obtaining submodule is used for obtaining the speaking duration of the second speaking;

the determining submodule is used for determining the playing time of the first speech according to a preset starting interval and the speech duration; the playing time is the difference between the preset starting interval and the speaking time.

Optionally, the apparatus comprises:

the information acquisition submodule is used for acquiring speaking audios collected by microphones of all users and determining time information of the speaking audios;

the adjusting module is used for adjusting the speaking sequence of each speaking audio according to the time information of each speaking audio;

and the playing sub-module is used for playing each speaking audio according to the speaking sequence.

In a third aspect, an embodiment of the present application further provides an electronic device, which includes a memory, a processor, and a program or an instruction stored on the memory and executable on the processor, where the processor implements the steps in the multimedia call method as described above when executing the program or the instruction.

In a fourth aspect, the present application further provides a readable storage medium, on which a program or instructions are stored, and when the program or instructions are executed by a processor, the program or instructions implement the steps in the multimedia call method as described above.

In a fifth aspect, an embodiment of the present application provides a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and the processor is configured to execute a program or instructions to implement the method according to the first aspect.

In the embodiment of the application, a first starting time of a first speech of a first user in a target conversation group is determined; if the interval between the first starting time and the first ending time of the second speech is smaller than a first preset threshold value, the playing time of the first speech is delayed, the situation that the first user and other second users send speech conflicts is avoided, the participants speak orderly, the confusion of a teleconference or group voice and group video process caused by simultaneous speech of multiple users is avoided, and the participation effect is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments of the present application will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.

Fig. 1 is a flowchart illustrating a multimedia call method according to an embodiment of the present application;

fig. 2 is a flowchart illustrating a first example provided by an embodiment of the present application;

FIG. 3 is a block diagram of a multimedia telephony device according to an embodiment of the present application;

FIG. 4 shows one of the block diagrams of an electronic device provided by an embodiment of the application;

fig. 5 shows a second block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In various embodiments of the present application, it should be understood that the sequence numbers of the following processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

The terms first, second and the like in the description and in the claims of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application are capable of operation in sequences other than those illustrated or described herein. In addition, "and/or" in the specification and claims means at least one of connected objects, a character "/" generally means that a preceding and succeeding related objects are in an "or" relationship.

The multimedia communication method provided by the embodiment of the present application is described in detail below with reference to the accompanying drawings through specific embodiments and application scenarios thereof.

Referring to fig. 1, an embodiment of the present application provides a multimedia call method, which is optionally applicable to electronic devices including various handheld devices, vehicle-mounted devices, wearable devices, computing devices or other processing devices connected to a wireless modem, as well as various forms of Mobile Stations (MSs), Terminal devices (Terminal devices), and the like.

The electronic device is a server (or administrator) of the target talk group.

The method comprises the following steps:

step 101, determining a first starting time of a first speech of a first user in a target talk group; wherein the target talk group comprises at least the first user and a second user.

The multimedia call can be a video call activity and a voice call, or a conference including a voice call, such as a teleconference, or a group video or group voice participating in a communication group of a communication APP;

taking a teleconference as an example, a conference APP is usually a multi-party login APP, and a user can join a conference after inputting a conference number in the conference APP; before all the participants are in the same place, each user can freely discuss, and after the target state is entered, the server administrator can set the free speaking mode to be the conference mode. In this mode, multiple users can be prevented from speaking at the same time.

Upon detecting the first user speaking, such as detecting the first user turning on a microphone or detecting the user's audio, the server determines a first start time for the first utterance.

A target talk group such as a participant group for a teleconference, a group voice or group video group. The second user is a user in the same group as the first user, for example, a user participating in the same conference, refer to the same group of video or group of voice users.

Step 102, if the interval between the first start time and the first end time of the second speech is smaller than a first preset threshold, delaying the play time of the first speech; wherein the second speech is a speech of the second user, and the first start time is later than the first end time.

If the interval between the first starting time and the first ending time of the second speech is smaller than a first preset threshold value, delaying the playing time of the first speech, and avoiding the situation that the first user and other second users send speech conflicts; for example, the first preset threshold is T1, and the first start time is X; after the last user (the user of the third microphone) finishes speaking, the user of the first user needs to start speaking after T1, and if the user of the first user speaks in T1, it is indicated that the user speaks more quickly, and a speaking eager experience is caused to other participants, so that the listening effect of the other participants is poor; at this time, the speech of the current user is delayed, so that the speech interval between the first user and the second user is not less than T1, and the confusion of a teleconference or group voice and group video process caused by simultaneous speech of multiple users is avoided.

In the embodiment of the application, a first starting time of a first speech of a first user in a target conversation group is determined; if the interval between the first starting time and the first ending time of the second speech is smaller than a first preset threshold value, the playing time of the first speech is delayed, so that participants can speak orderly, the problem that a teleconference or a group voice or a group video process is disordered due to simultaneous speech of multiple persons is avoided, and the participation effect is improved. The embodiment of the application solves the problem that in the prior art, a speech conflict is easy to occur in the teleconference process.

In an optional embodiment, if the number of users in the speaking state in the target talk group is greater than or equal to a second preset threshold, controlling the audio acquisition functions of the microphones of the other users not in the speaking state to be in an off state.

For example, the second preset threshold is N, and if at least N users are in a speaking state, the second preset threshold avoids a situation that a speaking conflict occurs between a user who does not speak and other users who are speaking; where N is the maximum threshold for the number of users speaking simultaneously.

In an optional embodiment, the method further comprises:

If the number of the users in the speaking state is less than that of the users not in the speaking state, controlling the audio acquisition functions of the microphones of all the users to be in an open state so as to ensure that the multimedia communication is normally carried out; and subsequently adding other users into the speaking queue according to the speaking sequence until the number of the users in the speaking state is equal to a second preset threshold value, and stopping controlling the audio acquisition function of the microphone of the user not in the speaking state to be in a closed state.

In an optional embodiment, the delaying the playing time of the first utterance specifically includes:

acquiring the speaking duration of the second speech;

The speaking order of the second utterance is adjacent to and before the first utterance;

for example, the preset starting interval is T1, and the speaking duration of the second utterance is X; after the second speech is finished, the user of the first user needs to start to speak after T1, and if the user of the first user speaks in T1, it indicates that the user speaks quickly, so that an experience of speaking swiftly is caused to other participants, and the listening effect of the other participants is poor; and at the moment, delaying the speech of the current user for T-X time so that the speech interval of the current user is kept at the position of T, and playing the first speech after the delay is finished.

In an optional embodiment, the method further comprises:

acquiring speaking audios collected by microphones of all users and acquiring time information of the speaking audios; the time information comprises at least a start time of the speech audio; the end time of the speech audio may also be included.

Adjusting the speaking sequence of each speaking audio according to the time information of each speaking audio; and sequencing each speaking audio in sequence according to the starting time of each speaking audio to obtain the speaking sequence of the user corresponding to each speaking audio.

And playing each speaking audio according to the speaking sequence, and playing each speaking audio in sequence according to the speaking sequence.

As a first example, referring to fig. 2, fig. 2 takes a teleconference as an example, and a multimedia call method provided in an embodiment of the present application mainly includes the following steps:

step 201, a first user and a second user of a target talk group join a teleconference.

The first user and the second user open the APP of the teleconference, and join the teleconference after the voice equipment is successfully connected; for example, after a conference number is input, the conference is added, and the own voice device is checked.

Before all the persons to be participated arrive at the same time, each user can freely discuss, and the free speaking mode can be set to be the conference mode by the administrator. In this mode, multiple users can be prevented from speaking at the same time.

Step 202, entering a conference mode and starting monitoring.

Optionally, when the administrator of the conference opens the conference mode, all users will be in the monitoring mode, and start to detect whether there is a person speaking in the conference. If there are other users speaking, the start time and end time of the speech are recorded as t1 and t2, respectively. After recording t1 and t2, if there are other users to continue speaking, then t1 and t2 are re-recorded. If the user of the first user wants to speak and there is a speaking user in the conference, executing step 203; if no one speaks, step 204 is performed.

Step 203, existing speaking users exist in the conference;

when a user of a second user is in a speaking stage, if the user of the first user wants to speak, but the user already speaking in the system is monitored, the voice of the user cannot be turned on, and interference of other users to the speaking user is avoided. Until no one speaks in the conference, step 204 is performed.

Step 204, no speaking user exists in the conference;

the first user may be allowed to speak when no one is speaking in the conference.

Step 205, determine whether the first user is the first speaker.

Since the users in the conference will record the start and end times of the speech of the other users, i.e. t1 and t 2. Thus, it may be checked whether the time between t1 and t2 recorded by the user is 0: if it is 0, it indicates that the user is the first user to speak for the first time, and no user speaks before speaking, so no delay processing is needed, and step 206 is executed. If the time between t1 and t2 is not 0, then step 207 is performed.

Step 206, playing the first speech of the first user in real time;

and playing the first speech of the first user in real time, automatically closing the microphone after the user finishes speaking, continuously entering a monitoring mode, and continuously recording the speaking time and the ending time of other users, namely t1 and t 2.

Step 207, playing the first speech of the first user in a delayed manner;

because the user is always monitoring the speaking time of other users, if the time between t1 and t2 is not 0, it indicates that the user has already spoken by other users before speaking, in this case, a delay judgment needs to be made, that is, a certain delay adjustment is made according to the speaking time of the previous user and the delay time (first preset threshold) set by the administrator, so that the speaking intervals (preset starting intervals are) of each user are as same as possible, and the specific strategy is as follows:

since the current user records the speaking time and the ending time of the last user, i.e. T1 and T2, and the first preset threshold of the conference mode is set to T by the administrator.

After the last user finishes speaking, the current user needs to start speaking in the time interval of T, and if the user does not speak in the specified time interval, it indicates that the user is speaking slowly, so the delay policy is no longer used, and step 206 is executed.

If the user speaks within the preset starting interval, assuming that the first starting time is X, when X < T, it is stated that the first utterance is faster than expected, a sense of speaking swiftly is given to the listener, so that the utterance of the current user is delayed by T-X, so that the speaking interval of the current user is maintained at the position of T, and after the delay is over, step 206 is executed.

At step 208, the participant makes multiple rounds of speech.

After several rounds of speaking, the speaking interval of each user is relatively fixed, and the experience of audiences is improved. For a user who does not speak for a long time, for example, other users have entered the second round to speak, but a certain user has not started to speak yet, the user can think that the user has no speaking demand temporarily, and the microphone is turned off to avoid the interference of noise.

When the user wants to accelerate the conference as soon as possible or interrupt the demand of the speaking user, a reminder can be sent to the administrator, or the hand-holding icon is clicked, the administrator exits from the conference mode and returns to the state of free discussion of each user. After the conference is finished, the administrator can exit the conference mode, and the conference is finished.

In this example, a plurality of users cannot interrupt a speaker in a conference, and the situation that a plurality of users speak at the same time does not occur; and all speak in turn in a relatively fixed time interval, thereby improving the hearing experience of the participants. In addition, when the conference starts, the user can speak for multiple turns, the server can automatically detect the user who does not speak for a long time, and the microphone is turned off, so that the influence of noise is avoided. The conference mode may be exited directly by the conference administrator when the conference time expires or special circumstances arise.

In an embodiment of the present application, a first start time of a first utterance of a first user in a target talk group is determined; if the interval between the first starting time and the first ending time of the second speech is smaller than a first preset threshold value, the playing time of the first speech is delayed, so that participants can speak orderly, the problem that a teleconference or a group voice or a group video process is disordered due to simultaneous speech of multiple persons is avoided, and the participation effect is improved.

In the foregoing, the multimedia communication method provided by the embodiment of the present application is introduced, and the multimedia communication device provided by the embodiment of the present application is described below with reference to the accompanying drawings.

It should be noted that, in the multimedia call method provided in the embodiment of the present application, the execution main body may be a multimedia call device, or a control module in the multimedia call device for executing the multimedia call method. In the embodiment of the present application, a multimedia communication device is taken as an example to execute a multimedia communication method, and the multimedia communication method provided in the embodiment of the present application is described.

Referring to fig. 3, an embodiment of the present application further provides a multimedia communication device 300, where the multimedia communication device 300 includes:

a determining module 301, configured to determine a first start time of a first speech of a first user in a target talk group; wherein the target talk group comprises at least the first user and a second user;

a delay module 302, configured to delay a playing time of the first utterance if an interval between the first start time and a first end time of the second utterance is smaller than a first preset threshold; wherein the second speech is a speech of the second user, and the first start time is later than the first end time.

Optionally, in an embodiment of the present application, the apparatus 300 includes:

Optionally, in this embodiment of the present application, the delay module 302 includes:

the information acquisition module is used for acquiring speaking audios collected by microphones of all users and determining time information of the speaking audios;

In an embodiment of the present application, the determining module 301 determines a first start time of a first speech of a first user in a target talk group; if the interval between the first start time and the first end time of the second speech is smaller than a first preset threshold, the delay module 302 delays the play time of the first speech, so that participants speak in order, a situation that a teleconference or a group voice or a group video process is disordered due to simultaneous speech of multiple persons is avoided, and a participation effect is improved.

The multimedia communication device in the embodiment of the present application may be a device, or may be a component, an integrated circuit, or a chip in a terminal. The device can be mobile electronic equipment or non-mobile electronic equipment. By way of example, the mobile electronic device may be a mobile phone, a tablet computer, a notebook computer, a palm top computer, a vehicle-mounted electronic device, a wearable device, an ultra-mobile personal computer (UMPC), a netbook or a Personal Digital Assistant (PDA), and the like, and the non-mobile electronic device may be a server, a Network Attached Storage (NAS), a Personal Computer (PC), a Television (TV), a teller machine or a self-service machine, and the like, and the embodiments of the present application are not particularly limited.

The multimedia communication device in the embodiment of the present application may be a device having an operating system. The operating system may be an Android (Android) operating system, an ios operating system, or other possible operating systems, and embodiments of the present application are not limited specifically.

The multimedia communication device provided in the embodiment of the present application can implement each process implemented by the multimedia communication device in the method embodiments of fig. 1 to fig. 2, and is not described herein again to avoid repetition.

Optionally, as shown in fig. 4, an electronic device 400 is further provided in this embodiment of the present application, and includes a processor 401, a memory 402, and a program or an instruction stored in the memory 402 and executable on the processor 401, where the program or the instruction is executed by the processor 401 to implement each process of the multimedia call method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

It should be noted that the electronic devices in the embodiments of the present application include the mobile electronic devices and the non-mobile electronic devices described above.

Fig. 5 is a schematic hardware configuration diagram of an electronic device 500 implementing various embodiments of the present application;

the electronic device 500 includes, but is not limited to: a radio frequency unit 501, a network module 502, an audio output unit 503, an input unit 504, a sensor 505, a display unit 506, a user input unit 507, an interface unit 508, a memory 509, a processor 510, and a power supply 511. Those skilled in the art will appreciate that the electronic device 500 may further include a power supply (e.g., a battery) for supplying power to various components, and the power supply may be logically connected to the processor 510 via a power management system, so as to implement functions of managing charging, discharging, and power consumption via the power management system. The electronic device structure shown in fig. 5 does not constitute a limitation of the electronic device, and the electronic device may include more or less components than those shown, or combine some components, or arrange different components, and thus, the description is omitted here.

Wherein, the processor 510 is configured to determine a first start time of a first speech of a first user in the target talk group; wherein the target talk group comprises at least the first user and a second user;

Optionally, the processor 510 is configured to control the audio capturing function of the microphone of the other users who are not in the speaking state to be in the off state if the number of users in the speaking state in the target talk group is greater than or equal to a second preset threshold.

Optionally, the processor 510 is configured to control the audio acquisition functions of the microphones of all the users to be in an open state if the number of the users in the speaking state in the target conversation group is smaller than the second preset threshold.

Optionally, the processor 510 is configured to obtain a speaking duration of the second speaking;

Optionally, the processor 510 is configured to acquire speech audios collected by microphones of all users, and determine time information of the speech audios;

and playing each speaking audio according to the speaking sequence.

In the embodiment of the application, a first starting time of a first speech of a first user in a target conversation group is determined; if the interval between the first starting time and the first ending time of the second speech is smaller than a first preset threshold value, the playing time of the first speech is delayed, so that participants can speak orderly, the problem that a teleconference or a group voice or a group video process is disordered due to simultaneous speech of multiple persons is avoided, and the participation effect is improved.

It should be understood that in the embodiment of the present application, the input Unit 504 may include a Graphics Processing Unit (GPU) 5041 and a microphone 5042, and the Graphics processor 5041 processes image data of still pictures or videos obtained by an image capturing device (such as a camera) in a video capturing mode or an image capturing mode. The display unit 506 may include a display panel 5061, and the display panel 5061 may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 507 includes a touch panel 5071 and other input devices 5072. A touch panel 5071, also referred to as a touch screen. The touch panel 5071 may include two parts of a touch detection device and a touch controller. Other input devices 5072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, and a joystick, which are not described in further detail herein. The memory 509 may be used to store software programs as well as various data including, but not limited to, application programs and operating systems. Processor 510 may integrate an application processor, which primarily handles operating systems, user interfaces, applications, etc., and a modem processor, which primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 510.

The embodiment of the present application further provides a readable storage medium, where a program or an instruction is stored on the readable storage medium, and when the program or the instruction is executed by a processor, the process of the embodiment of the multimedia communication method is implemented, and the same technical effect can be achieved, and in order to avoid repetition, details are not repeated here.

The processor is the processor in the electronic device described in the above embodiment. The readable storage medium includes a computer readable storage medium, such as a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and so on.

The embodiment of the present application further provides a chip, where the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is configured to execute a program or an instruction to implement each process of the above-mentioned multimedia communication method embodiment, and can achieve the same technical effect, and is not described herein again to avoid repetition.

It should be understood that the chips mentioned in the embodiments of the present application may also be referred to as system-on-chip, system-on-chip or system-on-chip, etc.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Further, it should be noted that the scope of the methods and apparatus of the embodiments of the present application is not limited to performing the functions in the order illustrated or discussed, but may include performing the functions in a substantially simultaneous manner or in a reverse order based on the functions involved, e.g., the methods described may be performed in an order different than that described, and various steps may be added, omitted, or combined. In addition, features described with reference to certain examples may be combined in other examples.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the invention is not limited to the precise embodiments described above, which are meant to be illustrative and not restrictive, and that various changes may be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A multimedia call method, comprising:

2. The method of claim 1, further comprising:

3. The method of claim 2, further comprising:

4. The method of claim 1, wherein the delaying the playing time of the first utterance comprises:

acquiring the speaking duration of the second speech;

5. The method of claim 1, further comprising:

and playing each speaking audio according to the speaking sequence.

6. A multimedia telephony device, the device comprising:

7. An apparatus for multimedia telephony as claimed in claim 6, characterised in that the apparatus comprises:

8. An apparatus for multimedia telephony as claimed in claim 7, characterised in that the apparatus comprises:

9. The multimedia telephony device of claim 6, wherein the delay module comprises:

10. An apparatus for multimedia telephony as claimed in claim 6, characterised in that the apparatus comprises: