CN116939091A

CN116939091A - Voice call content display method and device

Info

Publication number: CN116939091A
Application number: CN202310898438.5A
Authority: CN
Inventors: 周丹丹
Original assignee: Vivo Mobile Communication Co Ltd
Current assignee: Vivo Mobile Communication Co Ltd
Priority date: 2023-07-20
Filing date: 2023-07-20
Publication date: 2023-10-24

Abstract

The application discloses a voice call content display method and device, and belongs to the technical field of communication. The voice call content display method comprises the following steps: receiving a first input; and responding to the first input, performing voice recognition processing on the audio data in the voice call to obtain content information corresponding to the audio data, object information and emotion information corresponding to the content information, and outputting the content information, the object information and the emotion information corresponding to the content information.

Description

Voice call content display method and device

Technical Field

The application belongs to the technical field of communication, and particularly relates to a voice call content display method and device.

Background

In the related art, the barrier-free technology can provide a barrier-free auxiliary communication function, which mainly performs voice recognition on talking voice to obtain and display text information.

However, in the actual conversation process, the voice communication scene may be complex, for example, besides text information, the voice communication scene also includes information such as intonation, and all language meanings of a party in the conversation cannot be completely expressed only through text content, so that the problem of inaccurate display of voice conversation content information exists.

Disclosure of Invention

The embodiment of the application aims to provide a voice call content display method and device, which can solve the problem of inaccurate display of voice call content information.

In a first aspect, an embodiment of the present application provides a method for displaying voice call content, including:

receiving a first input;

responding to the first input, performing voice recognition processing on the audio data in the voice call to obtain content information corresponding to the audio data, and object information and emotion information corresponding to the content information;

and outputting the content information and the object information and emotion information corresponding to the content information.

In a second aspect, an embodiment of the present application provides a voice call content display apparatus, including:

a receiving module for receiving a first input;

the processing module is used for responding to the first input, performing voice recognition processing on the audio data in the voice call, and obtaining content information corresponding to the audio data, object information corresponding to the content information and emotion information;

and the output module is used for outputting the content information, and the object information and the emotion information corresponding to the content information.

In a third aspect, embodiments of the present application provide an electronic device comprising a processor and a memory storing a program or instructions executable on the processor, the program or instructions implementing the steps of the method as in the first aspect when executed by the processor.

In a fourth aspect, embodiments of the present application provide a readable storage medium having stored thereon a program or instructions which when executed by a processor perform the steps of the method as in the first aspect.

In a fifth aspect, embodiments of the present application provide a chip comprising a processor and a communication interface coupled to the processor for running a program or instructions implementing the steps of the method as in the first aspect.

In a sixth aspect, embodiments of the present application provide a computer program product stored in a storage medium, the program product being executable by at least one processor to implement a method as in the first aspect.

In the embodiment of the application, in the process of voice call, the audio data in the process of voice call is acquired in real time according to the first input of the user, and the acquired audio data is subjected to voice recognition processing to obtain the content information of the audio data, namely the speaking content of the talker, and meanwhile, the object information of the person speaking the speaking language and the emotion information when the speaking language is obtained.

When the call content is displayed, the object information corresponding to the content information is displayed at the same time when the call content information is displayed in the call, namely, the object corresponding to the call is displayed, and meanwhile, the emotion information of the object when the call is spoken is displayed, so that the meaning of the call object can be expressed more completely and accurately by displaying the text information, the object information and the emotion information.

Drawings

Fig. 1 is a flowchart showing a voice call content display method according to an embodiment of the present application;

FIG. 2 illustrates one of the interface schematics of an electronic device according to an embodiment of the application;

FIG. 3 shows a second interface diagram of an electronic device according to an embodiment of the application;

FIG. 4 is a schematic diagram of a voice call content display setup interface according to an embodiment of the present application;

fig. 5 is a block diagram showing the structure of a voice call content display apparatus according to an embodiment of the present application;

FIG. 6 shows a block diagram of an electronic device according to an embodiment of the application;

fig. 7 is a schematic diagram of a hardware structure of an electronic device implementing an embodiment of the present application.

Detailed Description

The technical solutions of the embodiments of the present application will be clearly described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which are obtained by a person skilled in the art based on the embodiments of the present application, fall within the scope of protection of the present application.

The terms first, second and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments of the application are capable of operation in sequences other than those illustrated or otherwise described herein, and that the objects identified by "first," "second," etc. are generally of a type not limited to the number of objects, for example, the first object may be one or more. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/", generally means that the associated object is an "or" relationship.

The method and the device for displaying the voice call content provided by the embodiment of the application are described in detail through specific embodiments and application scenes thereof with reference to the accompanying drawings.

In some embodiments of the present application, a method for displaying voice call content is provided, fig. 1 shows a flowchart of a method for displaying voice call content according to an embodiment of the present application, and as shown in fig. 1, the method for displaying voice call content includes:

step 102, receiving a first input;

in the embodiment of the application, the first input can be the touch input of the user to the electronic equipment, and when the user performs voice call, the user starts the barrier-free call function through the first input, and at the moment, the electronic equipment can convert the audio signal in the call process into the image signal.

Step 104, responding to the first input, performing voice recognition processing on the audio data in the voice call to obtain content information corresponding to the audio data, and object information and emotion information corresponding to the content information;

in the embodiment of the application, in the process of voice call through electronic equipment such as a mobile phone and the like, audio data in the call process is collected in real time, and the audio data comprises voice audio of an object which performs voice call with a current user.

After audio data including voice audio of an object to be voice-communicated with a current user is acquired, voice recognition processing is performed on the audio data, and corresponding content information, specifically, text information of words spoken by the user voice-communicated object, such as "good morning, good weather today, is acquired through the voice recognition processing.

The object information is the object corresponding to each sentence in the content information for representing the audio data, i.e. the object speaking the current sentence. For example, the object information corresponding to each sentence may be identified by means of voiceprint recognition.

For example, when a sentence is detected, voiceprint information corresponding to the current sentence is identified, and whether the voiceprint is recorded is determined. If the voiceprint is not recorded, new object information is created, such as "object A", by which the current statement is marked. If the voiceprint has been recorded as "object B", the current sentence is marked directly by the already recorded "voiceprint B".

The emotion information is predicted based on the audio intonation and other information of each sentence in the audio data through a neural network model, and is used for expressing the true emotion, such as happiness, anxiety, anger and the like, of the talking object when speaking the sentence.

And step 106, outputting the content information, and the object information and emotion information corresponding to the content information.

In the related art, the barrier-free technology can provide a communication barrier-free auxiliary function, and the main function is to perform voice recognition on call voice, obtain text information and output the text information. The output mode includes displaying text information or converting text information into other electric signals, and transmitting the electric signals to a touch output device for outputting as touch information.

However, in the process of voice communication, words in voice cannot completely express emotion and other information of communication objects, meanwhile, when a plurality of communication objects exist in voice communication, different communication objects cannot be distinguished, only all the recognized words are converted into words and displayed, the words of the plurality of communication objects are difficult to distinguish by a user using barrier-free communication, and emotion of the different communication objects when different words are spoken cannot be perceived.

In order to solve the above problems, in the embodiment of the present application, during a voice call, audio data during the voice call is acquired in real time, and voice recognition processing is performed on the acquired audio data, so as to obtain content information of the audio data, that is, speech content of a speaker, and meanwhile, obtain object information of a person speaking a speech and emotion information when speaking the speech.

When the conversation content is displayed, the embodiment of the application displays the object information corresponding to the content information, namely the object which is used for speaking the corresponding words, and simultaneously displays the emotion information when the object speaks the words, and the meaning of the conversation object can be expressed more completely and accurately by displaying the text information, the object information and the emotion information.

In some embodiments of the application, the audio data comprises a plurality of audio statements;

performing voice recognition processing on the audio data in the voice call to obtain content information corresponding to the audio data, and object information and emotion information corresponding to the content information, wherein the method comprises the following steps:

performing voice feature recognition on the audio data to obtain object information, wherein the object information comprises information of one or more voice objects, and each voice object in the one or more voice objects is associated with at least one audio sentence in a plurality of audio sentences;

carrying out emotion feature recognition on the audio data to obtain emotion information, wherein the emotion information comprises emotion types corresponding to each sentence in a plurality of audio sentences;

and performing voice-to-text processing on the audio data to obtain content information, wherein the content information comprises statement information corresponding to a plurality of audio statements.

In the embodiment of the application, the audio data specifically includes a plurality of audio sentences, wherein each audio sentence corresponds to one sentence spoken by a call object in a voice call with the current user, and the one-section speech spoken by the call object can be divided into a plurality of audio sentences by way of natural sentence breaking.

When the voice recognition processing is carried out on the audio data, through a voice feature recognition algorithm, voice features, such as voiceprint features, corresponding to each sentence in the audio data are recognized through the voice features to obtain object information corresponding to one audio sentence.

Specifically, sound characteristic information of each audio sentence is detected separately, the identity of the object speaking the current sentence is identified based on the sound characteristic information, and it is determined whether the object has been recorded. If the object is not recorded, new object information is created, such as "object A", which is a speech object corresponding to the current audio statement. If the object has been recorded, as "object B", then "object B" is the speech object corresponding to the current audio statement.

In emotion feature recognition of audio data, emotion of a call object, such as "happy", "anxiety", "anger", etc., at the time of speaking each sentence is predicted based on features of speech speed, intonation, etc., of each audio sentence in the audio data by an emotion recognition model, such as a neural network model based on a GRNN (General Regression Neural Network, generalized regression neural network) network.

After the object information and the emotion information are obtained through recognition, voice conversion word processing is carried out on the audio data through a voice recognition algorithm, such as a voice recognition algorithm based on a hidden Markov model or a voice recognition algorithm based on a Gaussian mixture model, and sentence information corresponding to a plurality of sentences, such as 'today weather is good' or 'outside raining', is included in the processed content information.

According to the embodiment of the application, the meaning of the call object can be expressed more completely and accurately by respectively identifying the object information and the emotion information corresponding to each sentence in the call audio and simultaneously displaying the object information and the emotion information when the text information corresponding to the call audio is displayed, so that the accuracy of displaying the content information of the voice call is improved.

In some embodiments of the present application, outputting content information and object information and emotion information corresponding to the content information includes:

when the number of the voice objects is plural, respectively outputting the plural voice objects and sentence information corresponding to the plural voice objects, and outputting emotion type corresponding to each sentence information.

In the embodiment of the application, if the number of the voice objects is multiple, namely when a multi-person speaking scene is detected, each voice object and statement information corresponding to each voice object are respectively displayed when the statement information is displayed.

For example, the number of the voice objects is two, specifically, an object a and an object B, and the object a and the object B simultaneously make a voice call with the current user.

When the object A speaks, the object information of the object A is displayed in the conversation interface, and the sentence information of the object A is displayed near the object information of the object A.

When the object B speaks, the object information of the object B is displayed in the call interface, and the sentence information of the object B is displayed near the object information of the object B.

When the object a and the object B speak simultaneously, the object information of the object a and the object information of the object B are displayed in the conversation interface, the sentence information of the object a is displayed near the object information of the object a, and the sentence information of the object B is displayed near the object information of the object B.

Illustratively, fig. 2 shows one of interface diagrams of an electronic device according to an embodiment of the present application, and as shown in fig. 2, a call interface 200 displays text information 202, object information 204, and emotion information 206.

The object information includes an object a and an object B, and sentence information 2022 corresponding to the object a and sentence information 2024 corresponding to the object B are displayed, respectively.

At the same time, in the vicinity of the sentence information 2022 corresponding to the object a, the emotion information 2062 corresponding to the sentence information 2022 is displayed, and in the vicinity of the sentence information 2024 corresponding to the object B, the emotion information 2064 corresponding to the sentence information 2024 is displayed.

It can be appreciated that in some embodiments, the gender information of each voice call object may be identified according to the characteristics such as the tone when the voice call object speaks, for example, the male call object a and the female call object B are identified, so as to further help the user distinguish different call objects, and help the user accurately understand the call content.

According to the embodiment of the application, when the text content of the voice call is displayed, sentence information of different call objects can be distinguished and displayed, so that a user watching the content information of the voice call can more accurately grasp the content of the call object, and the display efficiency of the content information of the voice call is improved.

when the number of emotion types is plural, speech object and sentence information corresponding to the speech object are output, and a weight ratio of each emotion type among the plural emotion types corresponding to the sentence information is output.

In the embodiment of the application, under some scenes, the emotion of the conversation object may change when speaking, and for the case that the emotion types include a plurality of emotion types, the emotion types possibly contained in a sentence or a section of speech and the weight ratio corresponding to each emotion type can be respectively identified when the emotion information of the voice conversation object is identified through the emotion identification model.

For example, fig. 3 shows a second interface schematic diagram of the electronic device according to the embodiment of the application, and as shown in fig. 3, a text message 302, an object message 304, and an emotion message 306 are displayed in a call interface 300.

The text information 302 includes sentence information: "weather today is good, but me still feels a bit of anxiety", wherein the emotion information corresponding to "weather is good" is happy, and the emotion information corresponding to "bit of anxiety" is anxiety.

When two different emotion information are respectively determined, according to the statement number corresponding to the different emotion, the weight ratio of each emotion information is determined, for example, emotion information 3062 of "weather today is good" is happy, the weight ratio is 50%, and emotion information 3064 of "but me still feels a bit of worry" is "anxiety, the weight ratio is 50%.

According to the embodiment of the application, when the text content of the voice call is displayed, when a certain sentence information contains a plurality of emotion information, each emotion information corresponding to the sentence information and the weight ratio corresponding to each emotion information are respectively displayed, so that a user watching the content information of the voice call can be helped to grasp the emotion of a call object more accurately, and the display effect of the content information of the voice call is improved.

determining keyword information in the text information under the condition that the text quantity of the text information is larger than a preset text quantity threshold value;

determining a target text according to the keyword information;

and outputting object information and emotion information corresponding to the content information by the target text.

In some situations, the voice call object may continuously speak a long speech, which includes important key information, and also includes some unimportant words such as a mood word, a greeting word, or a polite word.

If the whole content of the long voice is converted into text display, the reading difficulty of the user is increased. For this case, the user may manually turn on the long voice reduction function in the setup interface.

Fig. 4 is a schematic diagram of a voice call content display setting interface according to an embodiment of the present application, as shown in fig. 4, in which the setting interface 400 includes a function indicator 402, when a user opens a long voice reduction function and detects that the text number of text information corresponding to call audio data is greater than a preset text number threshold, keyword information in the text information is extracted, and reduced target text is generated according to the keyword information and displayed.

For example, the call object speaks a long voice: what is "hi? I have seen what you give my file, some of which need to be updated. I want to talk about the day and you talk about the latest business changes, determine further steps. We also need to discuss how we process this report next. Meeting time is at three pm on the Ming day, you consider this schedule appropriate, or you have other advice.

When the long voice simplifying function is started, the electronic equipment automatically extracts keyword information from the text of the long voice, wherein the keyword information comprises 'seen', 'file', 'required to be updated', 'tomorrow', 'interview', 'service change', 'report' and 'three afternoon'.

According to the keywords, the electronic equipment automatically connects the keywords according to a logic sequence to obtain a simplified target text: "the speaker has viewed the file and needs to be updated; three-point meeting interviews in the afternoon are planned to discuss business changes, process reports and the next step.

According to the embodiment of the application, the keyword-based simplifying treatment is carried out on the long voice text, so that unimportant content in voice call content is removed, reading difficulty caused by a large number of characters to a user is avoided, and the display efficiency of content information of voice call is improved.

According to the voice call content display method provided by the embodiment of the application, the execution main body can be the voice call content display device. In the embodiment of the application, a method for displaying voice call contents by using the voice call contents display device is taken as an example, and the voice call contents display device provided by the embodiment of the application is described.

In some embodiments of the present application, a voice call content display apparatus is provided, fig. 5 shows a block diagram of the voice call content display apparatus according to an embodiment of the present application, and as shown in fig. 5, a voice call content display apparatus 500 includes:

a receiving module 502 for receiving a first input;

the processing module 504 is configured to perform a voice recognition process on the audio data in the voice call in response to the first input, so as to obtain content information corresponding to the audio data, and object information and emotion information corresponding to the content information;

and an output module 506, configured to output the object information and the emotion information corresponding to the content information.

the processing module is specifically used for:

According to the embodiment of the application, the meaning of the call object can be expressed more completely and accurately by respectively identifying the object information and the emotion information corresponding to each sentence in the call audio and simultaneously displaying the object information and the emotion information when the text information corresponding to the call audio is displayed, so that the display accuracy of the voice call content is improved.

In some embodiments of the present application, the output module is specifically configured to:

In some embodiments of the present application, the voice call content display apparatus further includes:

the determining module is used for determining keyword information in the text information under the condition that the text quantity of the text information is larger than a preset text quantity threshold value; determining a target text according to the keyword information;

and the output module is also used for outputting object information and emotion information corresponding to the target text and the content information.

According to the embodiment of the application, the keyword-based simplifying treatment is carried out on the long voice text, so that unimportant content in voice call content is removed, a large number of words are prevented from increasing reading difficulty for users, and the display efficiency of content information of voice calls is improved.

The voice call content display device in the embodiment of the application can be electronic equipment, and can also be a component in the electronic equipment, such as an integrated circuit or a chip. The electronic device may be a terminal, or may be other devices than a terminal. By way of example, the electronic device may be a mobile phone, tablet computer, notebook computer, palm computer, vehicle-mounted electronic device, mobile internet appliance (Mobile Internet Device, MID), augmented reality (augmented reality, AR)/Virtual Reality (VR) device, robot, wearable device, ultra-mobile personal computer, UMPC, netbook or personal digital assistant (personal digital assistant, PDA), etc., but may also be a server, network attached storage (Network Attached Storage, NAS), personal computer (personal computer, PC), television (TV), teller machine or self-service machine, etc., and the embodiments of the present application are not limited in particular.

The voice call content display device in the embodiment of the application can be a device with an operating system. The operating system may be an Android operating system, an iOS operating system, or other possible operating systems, and the embodiment of the present application is not limited specifically.

The voice call content display device provided by the embodiment of the application can realize each process realized by the method embodiment, and in order to avoid repetition, the description is omitted.

Optionally, an embodiment of the present application further provides an electronic device, fig. 6 shows a block diagram of an electronic device according to an embodiment of the present application, as shown in fig. 6, where the electronic device 600 includes a processor 602, a memory 604, and a program or an instruction stored in the memory 604 and capable of running on the processor 602, where the program or the instruction is executed by the processor 602 to implement each process of the foregoing method embodiment, and the same technical effects are achieved, and are not repeated herein.

The electronic device in the embodiment of the application includes the mobile electronic device and the non-mobile electronic device.

The electronic device 700 includes, but is not limited to: radio frequency unit 701, network module 702, audio output unit 703, input unit 704, sensor 705, display unit 706, user input unit 707, interface unit 708, memory 709, and processor 710.

Those skilled in the art will appreciate that the electronic device 700 may also include a power source (e.g., a battery) for powering the various components, which may be logically connected to the processor 710 via a power management system so as to perform functions such as managing charge, discharge, and power consumption via the power management system. The electronic device structure shown in fig. 7 does not constitute a limitation of the electronic device, and the electronic device may include more or less components than shown, or may combine certain components, or may be arranged in different components, which are not described in detail herein.

The processor 710 is configured to obtain audio data corresponding to a voice call; performing voice recognition processing on the audio data to determine content information corresponding to the audio data, and object information and emotion information corresponding to the content information;

the display unit 706 is used to display object information and emotion information of content information corresponding to the content information.

Optionally, the audio data comprises a plurality of audio sentences;

the processor 710 is further configured to perform voice feature recognition on the audio data to obtain object information, where the object information includes information of one or more voice objects, and each of the one or more voice objects is associated with a plurality of audio sentences of the plurality of audio sentences; carrying out emotion feature recognition on the audio data to obtain emotion information, wherein the emotion information comprises emotion types corresponding to each sentence in a plurality of audio sentences; and performing voice-to-text processing on the audio data to obtain content information, wherein the content information comprises statement information corresponding to a plurality of audio statements.

Optionally, the display unit 706 is further configured to, when the number of voice objects is plural, display a plurality of voice objects and sentence information corresponding to the plurality of voice objects, respectively, and display an emotion type corresponding to each sentence information in an associated manner.

Optionally, the display unit 706 is further configured to display the speech object and sentence information corresponding to the speech object, and display a weight ratio of each emotion type among the plurality of emotion types corresponding to the sentence information, where the number of emotion types is a plurality.

Optionally, the processor 710 is further configured to determine keyword information in the text information if the text number of the text information is greater than a preset text number threshold; determining a target text according to the keyword information;

the display unit 706 is also used to display object information and emotion information of the target text corresponding to the content information.

It should be appreciated that in embodiments of the present application, the input unit 704 may include a graphics processor (Graphics Processing Unit, GPU) 7041 and a microphone 7042, with the graphics processor 7041 processing image data of still pictures or video obtained by an image capturing device (e.g., a camera) in a video capturing mode or an image capturing mode. The display unit 706 may include a display panel 7061, and the display panel 7061 may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 707 includes at least one of a touch panel 7071 and other input devices 7072. The touch panel 7071 is also referred to as a touch screen. The touch panel 7071 may include two parts, a touch detection device and a touch controller. Other input devices 7072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and so forth, which are not described in detail herein.

The memory 709 may be used to store software programs as well as various data. The memory 709 may mainly include a first storage area storing programs or instructions and a second storage area storing data, wherein the first storage area may store an operating system, application programs or instructions (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like. Further, the memory 709 may include volatile memory or nonvolatile memory, or the memory 709 may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM), static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (ddr SDRAM), enhanced SDRAM (Enhanced SDRAM), synchronous DRAM (SLDRAM), and Direct RAM (DRRAM). Memory 709 in embodiments of the application includes, but is not limited to, these and any other suitable types of memory.

Processor 710 may include one or more processing units; optionally, processor 710 integrates an application processor that primarily processes operations involving an operating system, user interface, application programs, and the like, and a modem processor that primarily processes wireless communication signals, such as a baseband processor. It will be appreciated that the modem processor described above may not be integrated into the processor 710.

The embodiment of the application also provides a readable storage medium, and the readable storage medium stores a program or an instruction, which when executed by a processor, implements each process of the above method embodiment, and can achieve the same technical effects, so that repetition is avoided, and no further description is provided herein.

The processor is a processor in the electronic device in the above embodiment. Readable storage media include computer readable storage media such as Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic or optical disks, and the like.

The embodiment of the application further provides a chip, the chip comprises a processor and a communication interface, the communication interface is coupled with the processor, the processor is used for running programs or instructions, the processes of the embodiment of the method can be realized, the same technical effects can be achieved, and the repetition is avoided, and the description is omitted here.

It should be understood that the chips referred to in the embodiments of the present application may also be referred to as system-on-chip chips, chip systems, or system-on-chip chips, etc.

Embodiments of the present application provide a computer program product stored in a storage medium, where the program product is executed by at least one processor to implement the respective processes of the above method embodiments, and achieve the same technical effects, and for avoiding repetition, a detailed description is omitted herein.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Furthermore, it should be noted that the scope of the methods and apparatus in the embodiments of the present application is not limited to performing the functions in the order shown or discussed, but may also include performing the functions in a substantially simultaneous manner or in an opposite order depending on the functions involved, e.g., the described methods may be performed in an order different from that described, and various steps may be added, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in part in the form of a computer software product stored on a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method of the embodiments of the present application.

The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are to be protected by the present application.

Claims

1. A voice call content display method, comprising:

receiving a first input;

responding to the first input, performing voice recognition processing on the audio data in voice communication to obtain content information corresponding to the audio data, and object information and emotion information corresponding to the content information;

and outputting the content information and the object information and the emotion information corresponding to the content information.

2. The method of claim 1, wherein the audio data comprises a plurality of audio sentences;

the voice recognition processing is performed on the audio data in the voice call to obtain content information corresponding to the audio data, and object information and emotion information corresponding to the content information, including:

performing voice feature recognition on the audio data to obtain object information, wherein the object information comprises information of one or more voice objects, and each voice object in the one or more voice objects is associated with at least one audio sentence in the plurality of audio sentences;

carrying out emotion feature recognition on the audio data to obtain emotion information, wherein the emotion information comprises emotion types corresponding to each sentence in the plurality of audio sentences;

and performing voice-to-text processing on the audio data to obtain the content information, wherein the content information comprises statement information corresponding to the plurality of audio statements.

3. The method according to claim 2, wherein the outputting the content information and the object information and the emotion information corresponding to the content information includes:

and under the condition that the number of the voice objects is a plurality of, respectively outputting a plurality of voice objects and sentence information corresponding to the voice objects, and outputting emotion types corresponding to the sentence information.

4. The method according to claim 2, wherein the outputting the content information and the object information and the emotion information corresponding to the content information includes:

outputting the voice object and sentence information corresponding to the voice object when the number of emotion types is a plurality, and outputting the weight ratio of each emotion type in the plurality of emotion types corresponding to the sentence information.

5. The method according to claim 2, wherein the outputting the content information and the object information and the emotion information corresponding to the content information includes:

determining keyword information in the content information under the condition that the text quantity of the content information is larger than a preset text quantity threshold value;

determining a target text according to the keyword information;

and outputting the target text, the object information and the emotion information corresponding to the content information.

6. A voice call content display apparatus, comprising:

a receiving module for receiving a first input;

and the output module is used for outputting the content information, the object information and the emotion information corresponding to the content information.

7. The apparatus of claim 6, wherein the audio data comprises a plurality of audio sentences;

the processing module is specifically configured to:

and performing voice-to-text processing on the audio data to obtain the content information, wherein the content information comprises statement information corresponding to a plurality of audio statements.

8. The apparatus of claim 7, wherein the output module is specifically configured to:

9. The apparatus of claim 7, wherein the output module is specifically configured to:

10. The apparatus as recited in claim 7, further comprising:

the determining module is used for determining keyword information in the content information under the condition that the text quantity of the content information is larger than a preset text quantity threshold value; and

determining a target text according to the keyword information;

the output module is further used for outputting the target text, the object information and the emotion information corresponding to the content information.