CN111447325A

CN111447325A - Call auxiliary method, device, terminal and storage medium

Info

Publication number: CN111447325A
Application number: CN202010258710.XA
Authority: CN
Inventors: 杨鑫
Original assignee: Shanghai Wingtech Electronic Technology Co Ltd
Current assignee: Shanghai Wingtech Electronic Technology Co Ltd
Priority date: 2020-04-03
Filing date: 2020-04-03
Publication date: 2020-07-24

Abstract

The embodiment of the invention discloses a call assisting method, a call assisting device, a terminal and a storage medium. The method comprises the following steps: when communicating with a target terminal device, lip language information and voice information of a user are acquired; confirming the voice information based on the lip language information; if the confirmation result shows that the voice information is abnormal, generating auxiliary information based on the lip language information and the voice information; and sending the auxiliary information to the target terminal equipment to assist the target terminal equipment in correcting the call information. When the voice information is confirmed to be abnormal, the lip language information is used for perfecting the voice information, and auxiliary information containing complete user voice is generated, so that a user of the target terminal equipment can obtain the current call content of the local terminal user based on the auxiliary information, normal call between two parties of the call can be realized under the condition of environmental interference, the problem of unsmooth call of the user caused by the environmental interference in the prior art is solved, and the call quality of the user is improved.

Description

Call auxiliary method, device, terminal and storage medium

Technical Field

Embodiments of the present invention relate to communications technologies, and in particular, to a method, an apparatus, a terminal, and a storage medium for assisting a call.

Background

At present, mobile communication terminals, AI intelligence and 5G technologies are increasingly popularized, but the problem of unsmooth communication in some special scenes cannot be avoided, such as noise interference (airport, waterfall, underwater and outer space silent environments), sight interference (night, strong light and smoke), signal interference (basements and base station edges) and the like, which causes voice or video data loss of users during communication, causes unsmooth communication for the users, and affects communication efficiency and communication quality of the users.

Disclosure of Invention

The embodiment of the invention provides a call assisting method, a call assisting device, a terminal and a storage medium, which are used for improving the call quality of a user.

In an embodiment, an embodiment of the present invention provides a call assisting method, including:

when communicating with a target terminal device, lip language information and voice information of a user are acquired;

confirming the voice information based on the lip language information;

if the confirmation result shows that the voice information is abnormal, generating auxiliary information based on the lip language information and the voice information;

and sending the auxiliary information to target terminal equipment to assist the target terminal equipment in correcting the call information.

Optionally, the obtaining lip language information of the user includes:

acquiring a mouth shape action of a user through an image acquisition unit of the terminal equipment;

determining the lip language information based on the mouth shape action.

Optionally, the confirming the voice information based on the lip language information includes:

respectively analyzing the lip language information and the voice information to obtain first text information corresponding to the lip language information and second text information corresponding to the voice information;

and comparing the first text information with the second text information to confirm the voice information.

Optionally, the voice information existing abnormality includes at least one of:

the dissimilarity rate of the analysis result of the voice information and the analysis result of the lip language information is greater than or equal to a preset threshold value;

the voice information is incomplete;

the voice message includes a noise signal exceeding a predetermined intensity.

Optionally, the generating auxiliary information based on the lip language information and the voice information includes:

adjusting the analysis result of the voice information based on the analysis result of the lip language information;

generating synthetic voice information based on the adjusted analysis result of the voice information;

and synthesizing the synthesized voice information and the call video acquired by the image acquisition unit of the terminal equipment to generate auxiliary information.

Optionally, if the analysis result of the lip language information includes a lip language text analysis result, the auxiliary information further includes subtitle information, and the subtitle information is generated based on the lip language text analysis result.

In one embodiment, there is also provided a talk assisting apparatus comprising:

the information acquisition module is used for acquiring lip language information and voice information of a user when communicating with the target terminal equipment;

the confirming module is used for confirming the voice information based on the lip language information;

the auxiliary information generation module is used for generating auxiliary information based on the lip language information and the voice information if the confirmation result shows that the voice information is abnormal;

and the auxiliary information transmission module is used for sending the auxiliary information to the target terminal equipment so as to assist the target terminal equipment in correcting the call information.

In one embodiment, there is also provided a terminal device, including:

one or more processors;

storage means for storing one or more programs;

the sensor is used for collecting lip language information and voice information of a user;

when executed by the one or more processors, cause the one or more processors to implement a call assistant method as described in any of the embodiments of the invention.

Optionally, the sensor is an infrared camera sensor.

In one embodiment, a computer-readable storage medium is also provided, on which a computer program is stored, which when executed by a processor implements the call assistant method according to any of the embodiments of the present invention.

The embodiment of the invention provides a conversation auxiliary method, which can determine whether the voice information of a user is abnormal or not by acquiring the lip language information and the voice information of the user and confirming the voice information by the lip language information; when the voice information is confirmed to be abnormal, the voice information is perfected through the lip language information, auxiliary information containing complete user voice is generated, and the auxiliary information is sent to the target terminal equipment, so that a user of the target terminal equipment can obtain the current call content of the user at the local terminal based on the auxiliary information, normal call between two parties of the call can still be realized under the condition of environmental interference, the problem that the call of the user is not smooth due to the environmental interference in the prior art is solved, and the call quality of the user is improved.

Drawings

Fig. 1 is a flowchart of a call assistance method according to an embodiment of the present invention;

fig. 2 is a block diagram of a call assistant device according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a terminal device according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Artificial Intelligence (AI) is a new technical science for researching and developing theories, methods, techniques and application systems for simulating, extending and expanding human Intelligence. Currently, artificial intelligence technology is widely used in the fields of robotics, language identification, image identification, natural language processing, expert systems, and the like.

The call assisting method provided by the embodiment of the invention is based on the development of AI technology, under some specific scenes, the expression content of the user is identified by collecting the mouth shape action (lip language) of the user, and then based on a fifth-Generation mobile communication network (5th-Generation, 5G), the millisecond-level communication of large-capacity data is realized, extra call assisting information is provided for the user in real time, and the call quality and the communication efficiency of the user are improved. Typically, when a user is in a pub, a competition field, a concert, a seaside, a track and other busy places for communication, an opposite-end user of the communication cannot accurately acquire the communication voice of the local-end user due to environmental noise, and the communication is not smooth.

Because the user semantics can be recognized by collecting the mouth shape actions of the user and can be applied to all users, the call assisting method provided by the embodiment of the invention has more general applicability.

The following describes the call assisting method according to the embodiment of the present invention in detail with reference to the accompanying drawings. Fig. 1 is a flowchart of a call assistance method according to an embodiment of the present invention, where the method may be executed by a terminal device, and referring to fig. 1, the method specifically includes:

s110, when the device is communicated with the target terminal device, lip language information and voice information of the user are obtained.

The lip language information is information expressed by the mouth shape movement of the user. Corresponding lip information can be obtained by recognizing the mouth shape action of the user through an AI processor with a lip language recognition function or an AI processing chip configured in the terminal equipment.

The voice information refers to audio information output by an audio acquisition device of the terminal device, and the audio acquisition device may be a microphone of the terminal device, for example. The collected voice information is used for subsequently judging whether the current call state of the user is abnormal or not.

In one embodiment, the acquisition of the lip language information and the voice information of the user may be triggered and started by the user, for example, a floating window may be set in the terminal, when the user determines that the current communication environment is not good, the user may start the terminal device to acquire the lip language information and the voice information of the user by a one-key starting manner, and the terminal device generates the auxiliary information based on the lip language information and the voice information to assist the user in communication.

In one embodiment, the lip language information and the voice information of the user can be triggered and started by the terminal device. For example, the terminal device starts to collect voice signals of the user when the call is started, identifies the collected voice signals, and starts the image collection unit to collect lip language information of the user when the identified voice signals have abnormal conditions such as excessive background noise or discontinuous voice signals, so as to assist the user in the call.

In one embodiment, the lip language information of the user is obtained specifically by the following method:

lip language information is determined based on the mouth shape actions.

Specifically, the image acquisition unit may be, for example, a camera of the terminal device. In some scenes with dark light, such as in a basement, at night, or the like, the mouth shape action of the user can be acquired through a sensor with an infrared function, for example, the mouth shape action of the user is acquired through an infrared camera arranged on the terminal.

Each mouth shape action corresponds to an expression, so that the content to be expressed by the user can be determined by identifying the mouth shape action, namely the lip language information of the user.

Optionally, after the mouth shape action of the user is collected, the AI processing chip may perform image segmentation on the video including the mouth shape action to obtain each mouth shape image, and then perform image feature extraction on each mouth shape image in sequence to obtain a corresponding image feature set, and further use the image recognition model to recognize the image feature set to obtain lip language information corresponding to the mouth shape action.

And S120, confirming the voice information based on the lip language information.

The voice information is confirmed based on the lip language information, namely the lip language information is used as a detection basis to detect whether the voice information meets the requirement.

Because the lip language information and the voice information are two different information expression modes, the lip language information and the voice information need to be converted, and the two information can be confirmed in the same information expression mode.

In one embodiment, the confirmation of voice information based on lip language information may be specifically optimized as:

Specifically, lip language information and voice information of the user can be converted through the third-party equipment, and corresponding text information is obtained. For example, the terminal device can perform instant communication with the cloud server through the 5G network, send the lip language information and the voice information of the user to the cloud server, perform instant analysis on the lip language information and the voice information through the cloud server, compare the analyzed first text information with the analyzed second text information, and confirm the voice information based on the lip language information. Or the AI processing chip configured in the terminal device directly analyzes the lip language information and the voice information of the user to obtain the first text information and the second text information, and then the AI processing chip compares the first text information with the second text information to confirm the voice information based on the lip language information.

And S130, if the confirmation result shows that the voice information is abnormal, generating auxiliary information based on the lip language information and the voice information.

The abnormal voice information means that the voice information cannot completely or correctly reflect the call information of the user, and at this time, the terminal device generates auxiliary information to assist the opposite terminal user to acquire the call information of the local terminal user. For example, during a call, it is determined through the collected lip language information that there is a packet loss phenomenon in the voice information of the user, and this situation may cause the call content of the opposite-end user acquired by the receiving-end user to be inaccurate, thereby causing a poor communication. At this time, the terminal device may provide auxiliary information to the opposite-end call user based on the collected lip language information, and assist the opposite-end call user in obtaining the call voice which cannot be identified.

Optionally, when the voice information is confirmed by text comparison, a corresponding confirmation result may be output according to a difference between text information corresponding to the lip language information and the voice information, respectively. When the difference between the text information corresponding to the lip language information and the text information corresponding to the voice information is small and within a set range, the terminal equipment outputs information with a normal confirmation result; when the difference greatly exceeds the set range, the terminal equipment outputs information for confirming that the result is abnormal.

In one embodiment, the voice message presence anomaly comprises at least one of:

the voice information is incomplete;

the voice message includes a noise signal exceeding a predetermined intensity.

Specifically, the dissimilarity rate of the analysis result of the voice information and the analysis result of the lip language information represents the similarity between the voice information and the lip language information. When the dissimilarity rate of the analysis results of the voice information and the lip language information is larger than or equal to a preset threshold value, the similarity between the voice information and the lip language information is low, the voice information is greatly different from the real semantic meaning which the user wants to express, and the conversation content of the user cannot be truly reflected. For example, when a user makes a call in a bar, too much noise information is superimposed on the collected voice information, which causes a large difference between the voice information at this time and the original call voice of the user, and therefore auxiliary information needs to be generated to assist an opposite-end user in information acquisition.

The incompleteness of the voice information indicates that the packet loss phenomenon exists in the voice information of the user. Obviously, when there is a packet loss phenomenon, the receiving end user cannot obtain accurate call content due to information loss, resulting in poor call. In some cases, the incomplete speech information may appear as a discontinuity in the speech information, typically as the receiving terminal receives a discontinuous speech signal.

The noise signal exceeding the preset intensity means that the noise signal has reached a degree of interfering with the voice signal, so that the receiving-end user cannot acquire accurate call content. For example, when the intensity of the noise signal is close to or even overwhelms the speech signal of the user, it is difficult for the receiving end user to distinguish the speech content of the transmitting end user. The intensity (or loudness) of a sound signal is typically expressed in terms of the amplitude of the sound, with greater intensity for greater amplitude. In one embodiment, the predetermined intensity is set to 60db, and when the amplitude of the noise signal is greater than or equal to 60db, it indicates that the noise of the speech signal is relatively noisy, and the speech signal needs to be refined.

In one embodiment, generating the supplementary information based on the lip language information and the voice information may be optimized as:

Specifically, the adjusting of the analysis result of the voice information may specifically be supplementing and/or replacing the analysis result of the voice information. When the packet loss of the voice information is determined after the analysis, supplementing the analysis result of the voice information according to the analysis result of the lip language information so as to perfect the voice information; and when the analysis result of the voice information and the analysis result of the lip language information have inconsistent contents, replacing the analysis result of the voice information according to the analysis result of the lip language information so as to correct the voice information.

When the analysis result of the lip language information and the analysis result of the voice information can be both text information, correspondingly, the adjusted analysis result of the voice information is also the text information, and at the moment, the synthesized voice information can be obtained through text-to-voice conversion.

The call video can be a video which is collected by the image collecting unit and contains the mouth shape action of the user. Synthesizing the synthesized voice information and the call video may be, for example, performing image typesetting on the call video to generate continuous picture data, and synthesizing the synthesized voice information and the picture data into new video data through an AI rendering engine, so that the new video data can synchronously play the synthesized voice information during playing, and a receiving end user can accurately determine the speaking content of the sending end user through the auxiliary information.

In an embodiment, the generated auxiliary information further includes subtitle information, and specifically, the subtitle information is consistent with the content of the lip language text parsing result, and the AI rendering engine synthesizes the synthesized speech information, the composed picture data, and the subtitle information to obtain new video data capable of synchronously displaying the corresponding subtitle information. Therefore, the receiving end user can confirm the conversation content of the opposite end user through three dimensions of image, voice and caption, and the receiving end user can be more helped to accurately identify the conversation content of the opposite end user.

In the embodiment, the voice signal of the user is analyzed, when the voice signal is determined to be abnormal, the voice information is perfected based on the lip language information, auxiliary information containing complete user voice is generated, and the auxiliary information assists the opposite-end user to acquire information.

And S140, sending the auxiliary information to the target terminal equipment to assist the target terminal equipment in correcting the call information.

As described above, the auxiliary information is information obtained by modifying the voice information based on the lip language information, and includes the complete call voice of the user, so that the call content of the user can be reflected truly, and the user of the target terminal can obtain the true call content by means of the auxiliary information. The method has the advantages that when the call is abnormal, the user of the target terminal device can determine the call content of the opposite party through the auxiliary information, so that the intention of the opposite party does not need to be guessed or the current call is interrupted to carry out one or more times of information confirmation, and the calls of the two parties can be smoother and more efficient.

The principle of the call assisting method is as follows: under the condition of external interference, lip language information of the user is obtained by collecting lip language action translation of the user, and voice information of the user is perfected according to the lip language information, so that the influence of environmental interference is overcome, and both parties of a call can realize barrier-free communication based on the perfected voice information.

The embodiment of the invention provides a conversation auxiliary method, which can determine whether the voice information of a user is abnormal or not by acquiring the lip language information and the voice information of the user and confirming the voice information by the lip language information; when the confirmation result shows that the voice information of the user is abnormal, the voice information is perfected through the lip language information, auxiliary information containing complete user voice is generated, and the auxiliary information is sent to the target terminal equipment, so that the user of the target terminal equipment can obtain the current call content of the user at the local terminal based on the auxiliary information, normal call between two parties of the call can be realized under the condition of environmental interference, the problem of unsmooth call of the user caused by the environmental interference in the prior art is solved, and the call quality of the user is improved.

On the basis of the above embodiments, the embodiments of the present invention further provide a call assisting apparatus, which may be implemented in a software and/or hardware manner and may be configured in a terminal device. Fig. 2 is a block diagram of a call assistant device according to an embodiment of the present invention, and as shown in fig. 2, the call assistant device includes: an information acquisition module 210, a confirmation module 220, an auxiliary information generation module 230, and an auxiliary information transmission module 240, wherein:

an information obtaining module 210, configured to obtain lip language information and voice information of a user when communicating with a target terminal device;

a confirmation module 220, configured to confirm the voice information based on the lip language information;

an auxiliary information generating module 230, configured to generate auxiliary information based on the lip language information and the voice information if the confirmation result indicates that the voice information is abnormal;

an auxiliary information transmission module 240, configured to send auxiliary information to the target terminal device to assist the target terminal device in correcting the call information;

optionally, the information obtaining module 210 includes:

the mouth shape action acquisition unit is used for acquiring the mouth shape action of a user through an image acquisition unit of the terminal equipment;

a lip information determination unit for determining lip information based on the mouth shape motion.

Optionally, the confirming module 220 includes:

the analysis unit is used for respectively analyzing the lip language information and the voice information to obtain first text information corresponding to the lip language information and second text information corresponding to the voice information;

and the text comparison unit is used for comparing the first text information with the second text information so as to confirm the voice information.

Optionally, on the basis of the above technical solution, the voice information having an exception includes at least one of the following:

the voice information is incomplete;

the voice message includes a noise signal exceeding a predetermined intensity.

Optionally, the auxiliary information generating module 230 specifically includes:

the information adjusting unit is used for adjusting the analysis result of the voice information based on the analysis result of the lip language information;

a voice synthesis unit for generating synthesized voice information based on the adjusted analysis result of the voice information;

and the auxiliary information generating unit is used for synthesizing the synthesized voice information and the call video acquired by the image acquisition unit of the terminal equipment to generate auxiliary information.

On the basis of the above technical solution, optionally, if the analysis result of the lip language information includes a lip language text analysis result, the auxiliary information further includes subtitle information, and the subtitle information is generated based on the lip language text analysis result.

The call assisting device provided by the embodiment of the invention can execute the call assisting method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. Reference may be made to the description of any method embodiment of the invention not specifically described in this embodiment.

On the basis of the foregoing embodiments, an embodiment of the present invention further provides a terminal device, and fig. 3 is a schematic structural diagram of the terminal device provided in the embodiment of the present invention. Fig. 3 illustrates a block diagram of an exemplary terminal device 312 suitable for use in implementing embodiments of the present invention. The terminal device 312 shown in fig. 3 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiment of the present invention.

As shown in fig. 3, terminal device 312 is in the form of a general purpose computing device. The components of the terminal device 312 may include, but are not limited to: one or more processors or processing units 316, a system memory 328, and a bus 318 that couples the various system components including the system memory 328 and the processing unit 316.

Bus 318 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Terminal device 312 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by terminal device 312 and includes both volatile and nonvolatile media, removable and non-removable media.

The system memory 328 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)330 and/or cache memory 332. The terminal device 312 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 334 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 3, and commonly referred to as a "hard drive"). Although not shown in FIG. 3, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 318 by one or more data media interfaces. System memory 328 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.

A program/utility 340 having a set (at least one) of program modules 342 may be stored, for example, in memory 328, such program modules 342 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 342 generally perform the functions and/or methodologies of the described embodiments of the invention.

The terminal device 312 may also communicate with one or more external devices 314 (e.g., keyboard, pointing device, display 324, etc.), and may also communicate with one or more devices that enable a user to interact with the terminal device 312, and/or with any devices (e.g., network card, modem, etc.) that enable the terminal device 312 to communicate with one or more other computing devices.

In this embodiment, the external device 314 further includes a sensor (not shown in fig. 3) for collecting lip language information and voice information of the user. The sensor may be a camera configured in the terminal. In an alternative embodiment, the sensor is an infrared camera sensor to enable the capture of a video image including the user's lip language information in poor or no light conditions.

The processing unit 316 executes various functional applications and data processing by executing programs stored in the system memory 328, for example, to implement a call assistance method provided by any embodiment of the present invention, the method includes:

confirming the voice information based on the lip language information;

and sending the auxiliary information to the target terminal equipment to assist the target terminal equipment in correcting the call information.

On the basis of the foregoing embodiments, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a call assistant method as provided by the embodiment of the present invention, where the method includes: when communicating with a target terminal device, lip language information and voice information of a user are acquired; confirming the voice information based on the lip language information; if the confirmation result shows that the voice information is abnormal, generating auxiliary information based on the lip language information and the voice information; and sending the auxiliary information to the target terminal equipment to assist the target terminal equipment in correcting the call information.

Computer storage media for embodiments of the invention may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including AN object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A call assistant method, comprising:

confirming the voice information based on the lip language information;

2. The call assisting method according to claim 1, wherein the acquiring lip language information of the user comprises:

determining the lip language information based on the mouth shape action.

3. The call assistant method of claim 1, wherein the confirming the voice message based on the lip language message comprises:

4. The call assistant method of claim 1, wherein the voice message presence anomaly comprises at least one of:

the voice information is incomplete;

the voice message includes a noise signal exceeding a predetermined intensity.

5. The call assisting method according to claim 1, wherein the generating of the assistance information based on the lip language information and the voice information comprises:

6. The call assisting method according to claim 5, wherein if the parsing result of the lip language information includes a lip language text parsing result, the auxiliary information further includes subtitle information, and the subtitle information is generated based on the lip language text parsing result.

7. A call assistant apparatus, comprising:

8. A terminal device, comprising:

one or more processors;

storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the call assistant method as recited in any of claims 1-6.

9. The terminal device of claim 8, wherein the sensor is an infrared camera sensor.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a call assistant method according to any one of claims 1-6.