CN116939091A - Voice call content display method and device - Google Patents

Voice call content display method and device Download PDF

Info

Publication number
CN116939091A
CN116939091A CN202310898438.5A CN202310898438A CN116939091A CN 116939091 A CN116939091 A CN 116939091A CN 202310898438 A CN202310898438 A CN 202310898438A CN 116939091 A CN116939091 A CN 116939091A
Authority
CN
China
Prior art keywords
information
voice
emotion
content
content information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310898438.5A
Other languages
Chinese (zh)
Inventor
周丹丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Vivo Mobile Communication Co Ltd
Original Assignee
Vivo Mobile Communication Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Vivo Mobile Communication Co Ltd filed Critical Vivo Mobile Communication Co Ltd
Priority to CN202310898438.5A priority Critical patent/CN116939091A/en
Publication of CN116939091A publication Critical patent/CN116939091A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72403User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality
    • H04M1/7243User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72403User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality
    • H04M1/7243User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages
    • H04M1/72433User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages for voice messaging, e.g. dictaphones

Abstract

The application discloses a voice call content display method and device, and belongs to the technical field of communication. The voice call content display method comprises the following steps: receiving a first input; and responding to the first input, performing voice recognition processing on the audio data in the voice call to obtain content information corresponding to the audio data, object information and emotion information corresponding to the content information, and outputting the content information, the object information and the emotion information corresponding to the content information.

Description

Voice call content display method and device
Technical Field
The application belongs to the technical field of communication, and particularly relates to a voice call content display method and device.
Background
In the related art, the barrier-free technology can provide a barrier-free auxiliary communication function, which mainly performs voice recognition on talking voice to obtain and display text information.
However, in the actual conversation process, the voice communication scene may be complex, for example, besides text information, the voice communication scene also includes information such as intonation, and all language meanings of a party in the conversation cannot be completely expressed only through text content, so that the problem of inaccurate display of voice conversation content information exists.
Disclosure of Invention
The embodiment of the application aims to provide a voice call content display method and device, which can solve the problem of inaccurate display of voice call content information.
In a first aspect, an embodiment of the present application provides a method for displaying voice call content, including:
receiving a first input;
responding to the first input, performing voice recognition processing on the audio data in the voice call to obtain content information corresponding to the audio data, and object information and emotion information corresponding to the content information;
and outputting the content information and the object information and emotion information corresponding to the content information.
In a second aspect, an embodiment of the present application provides a voice call content display apparatus, including:
a receiving module for receiving a first input;
the processing module is used for responding to the first input, performing voice recognition processing on the audio data in the voice call, and obtaining content information corresponding to the audio data, object information corresponding to the content information and emotion information;
and the output module is used for outputting the content information, and the object information and the emotion information corresponding to the content information.
In a third aspect, embodiments of the present application provide an electronic device comprising a processor and a memory storing a program or instructions executable on the processor, the program or instructions implementing the steps of the method as in the first aspect when executed by the processor.
In a fourth aspect, embodiments of the present application provide a readable storage medium having stored thereon a program or instructions which when executed by a processor perform the steps of the method as in the first aspect.
In a fifth aspect, embodiments of the present application provide a chip comprising a processor and a communication interface coupled to the processor for running a program or instructions implementing the steps of the method as in the first aspect.
In a sixth aspect, embodiments of the present application provide a computer program product stored in a storage medium, the program product being executable by at least one processor to implement a method as in the first aspect.
In the embodiment of the application, in the process of voice call, the audio data in the process of voice call is acquired in real time according to the first input of the user, and the acquired audio data is subjected to voice recognition processing to obtain the content information of the audio data, namely the speaking content of the talker, and meanwhile, the object information of the person speaking the speaking language and the emotion information when the speaking language is obtained.
When the call content is displayed, the object information corresponding to the content information is displayed at the same time when the call content information is displayed in the call, namely, the object corresponding to the call is displayed, and meanwhile, the emotion information of the object when the call is spoken is displayed, so that the meaning of the call object can be expressed more completely and accurately by displaying the text information, the object information and the emotion information.
Drawings
Fig. 1 is a flowchart showing a voice call content display method according to an embodiment of the present application;
FIG. 2 illustrates one of the interface schematics of an electronic device according to an embodiment of the application;
FIG. 3 shows a second interface diagram of an electronic device according to an embodiment of the application;
FIG. 4 is a schematic diagram of a voice call content display setup interface according to an embodiment of the present application;
fig. 5 is a block diagram showing the structure of a voice call content display apparatus according to an embodiment of the present application;
FIG. 6 shows a block diagram of an electronic device according to an embodiment of the application;
fig. 7 is a schematic diagram of a hardware structure of an electronic device implementing an embodiment of the present application.
Detailed Description
The technical solutions of the embodiments of the present application will be clearly described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which are obtained by a person skilled in the art based on the embodiments of the present application, fall within the scope of protection of the present application.
The terms first, second and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments of the application are capable of operation in sequences other than those illustrated or otherwise described herein, and that the objects identified by "first," "second," etc. are generally of a type not limited to the number of objects, for example, the first object may be one or more. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/", generally means that the associated object is an "or" relationship.
The method and the device for displaying the voice call content provided by the embodiment of the application are described in detail through specific embodiments and application scenes thereof with reference to the accompanying drawings.
In some embodiments of the present application, a method for displaying voice call content is provided, fig. 1 shows a flowchart of a method for displaying voice call content according to an embodiment of the present application, and as shown in fig. 1, the method for displaying voice call content includes:
step 102, receiving a first input;
in the embodiment of the application, the first input can be the touch input of the user to the electronic equipment, and when the user performs voice call, the user starts the barrier-free call function through the first input, and at the moment, the electronic equipment can convert the audio signal in the call process into the image signal.
Step 104, responding to the first input, performing voice recognition processing on the audio data in the voice call to obtain content information corresponding to the audio data, and object information and emotion information corresponding to the content information;
in the embodiment of the application, in the process of voice call through electronic equipment such as a mobile phone and the like, audio data in the call process is collected in real time, and the audio data comprises voice audio of an object which performs voice call with a current user.
After audio data including voice audio of an object to be voice-communicated with a current user is acquired, voice recognition processing is performed on the audio data, and corresponding content information, specifically, text information of words spoken by the user voice-communicated object, such as "good morning, good weather today, is acquired through the voice recognition processing.
The object information is the object corresponding to each sentence in the content information for representing the audio data, i.e. the object speaking the current sentence. For example, the object information corresponding to each sentence may be identified by means of voiceprint recognition.
For example, when a sentence is detected, voiceprint information corresponding to the current sentence is identified, and whether the voiceprint is recorded is determined. If the voiceprint is not recorded, new object information is created, such as "object A", by which the current statement is marked. If the voiceprint has been recorded as "object B", the current sentence is marked directly by the already recorded "voiceprint B".
The emotion information is predicted based on the audio intonation and other information of each sentence in the audio data through a neural network model, and is used for expressing the true emotion, such as happiness, anxiety, anger and the like, of the talking object when speaking the sentence.
And step 106, outputting the content information, and the object information and emotion information corresponding to the content information.
In the related art, the barrier-free technology can provide a communication barrier-free auxiliary function, and the main function is to perform voice recognition on call voice, obtain text information and output the text information. The output mode includes displaying text information or converting text information into other electric signals, and transmitting the electric signals to a touch output device for outputting as touch information.
However, in the process of voice communication, words in voice cannot completely express emotion and other information of communication objects, meanwhile, when a plurality of communication objects exist in voice communication, different communication objects cannot be distinguished, only all the recognized words are converted into words and displayed, the words of the plurality of communication objects are difficult to distinguish by a user using barrier-free communication, and emotion of the different communication objects when different words are spoken cannot be perceived.
In order to solve the above problems, in the embodiment of the present application, during a voice call, audio data during the voice call is acquired in real time, and voice recognition processing is performed on the acquired audio data, so as to obtain content information of the audio data, that is, speech content of a speaker, and meanwhile, obtain object information of a person speaking a speech and emotion information when speaking the speech.
When the conversation content is displayed, the embodiment of the application displays the object information corresponding to the content information, namely the object which is used for speaking the corresponding words, and simultaneously displays the emotion information when the object speaks the words, and the meaning of the conversation object can be expressed more completely and accurately by displaying the text information, the object information and the emotion information.
In some embodiments of the application, the audio data comprises a plurality of audio statements;
performing voice recognition processing on the audio data in the voice call to obtain content information corresponding to the audio data, and object information and emotion information corresponding to the content information, wherein the method comprises the following steps:
performing voice feature recognition on the audio data to obtain object information, wherein the object information comprises information of one or more voice objects, and each voice object in the one or more voice objects is associated with at least one audio sentence in a plurality of audio sentences;
carrying out emotion feature recognition on the audio data to obtain emotion information, wherein the emotion information comprises emotion types corresponding to each sentence in a plurality of audio sentences;
and performing voice-to-text processing on the audio data to obtain content information, wherein the content information comprises statement information corresponding to a plurality of audio statements.
In the embodiment of the application, the audio data specifically includes a plurality of audio sentences, wherein each audio sentence corresponds to one sentence spoken by a call object in a voice call with the current user, and the one-section speech spoken by the call object can be divided into a plurality of audio sentences by way of natural sentence breaking.
When the voice recognition processing is carried out on the audio data, through a voice feature recognition algorithm, voice features, such as voiceprint features, corresponding to each sentence in the audio data are recognized through the voice features to obtain object information corresponding to one audio sentence.
Specifically, sound characteristic information of each audio sentence is detected separately, the identity of the object speaking the current sentence is identified based on the sound characteristic information, and it is determined whether the object has been recorded. If the object is not recorded, new object information is created, such as "object A", which is a speech object corresponding to the current audio statement. If the object has been recorded, as "object B", then "object B" is the speech object corresponding to the current audio statement.
In emotion feature recognition of audio data, emotion of a call object, such as "happy", "anxiety", "anger", etc., at the time of speaking each sentence is predicted based on features of speech speed, intonation, etc., of each audio sentence in the audio data by an emotion recognition model, such as a neural network model based on a GRNN (General Regression Neural Network, generalized regression neural network) network.
After the object information and the emotion information are obtained through recognition, voice conversion word processing is carried out on the audio data through a voice recognition algorithm, such as a voice recognition algorithm based on a hidden Markov model or a voice recognition algorithm based on a Gaussian mixture model, and sentence information corresponding to a plurality of sentences, such as 'today weather is good' or 'outside raining', is included in the processed content information.
According to the embodiment of the application, the meaning of the call object can be expressed more completely and accurately by respectively identifying the object information and the emotion information corresponding to each sentence in the call audio and simultaneously displaying the object information and the emotion information when the text information corresponding to the call audio is displayed, so that the accuracy of displaying the content information of the voice call is improved.
In some embodiments of the present application, outputting content information and object information and emotion information corresponding to the content information includes:
when the number of the voice objects is plural, respectively outputting the plural voice objects and sentence information corresponding to the plural voice objects, and outputting emotion type corresponding to each sentence information.
In the embodiment of the application, if the number of the voice objects is multiple, namely when a multi-person speaking scene is detected, each voice object and statement information corresponding to each voice object are respectively displayed when the statement information is displayed.
For example, the number of the voice objects is two, specifically, an object a and an object B, and the object a and the object B simultaneously make a voice call with the current user.
When the object A speaks, the object information of the object A is displayed in the conversation interface, and the sentence information of the object A is displayed near the object information of the object A.
When the object B speaks, the object information of the object B is displayed in the call interface, and the sentence information of the object B is displayed near the object information of the object B.
When the object a and the object B speak simultaneously, the object information of the object a and the object information of the object B are displayed in the conversation interface, the sentence information of the object a is displayed near the object information of the object a, and the sentence information of the object B is displayed near the object information of the object B.
Illustratively, fig. 2 shows one of interface diagrams of an electronic device according to an embodiment of the present application, and as shown in fig. 2, a call interface 200 displays text information 202, object information 204, and emotion information 206.
The object information includes an object a and an object B, and sentence information 2022 corresponding to the object a and sentence information 2024 corresponding to the object B are displayed, respectively.
At the same time, in the vicinity of the sentence information 2022 corresponding to the object a, the emotion information 2062 corresponding to the sentence information 2022 is displayed, and in the vicinity of the sentence information 2024 corresponding to the object B, the emotion information 2064 corresponding to the sentence information 2024 is displayed.
It can be appreciated that in some embodiments, the gender information of each voice call object may be identified according to the characteristics such as the tone when the voice call object speaks, for example, the male call object a and the female call object B are identified, so as to further help the user distinguish different call objects, and help the user accurately understand the call content.
According to the embodiment of the application, when the text content of the voice call is displayed, sentence information of different call objects can be distinguished and displayed, so that a user watching the content information of the voice call can more accurately grasp the content of the call object, and the display efficiency of the content information of the voice call is improved.
In some embodiments of the present application, outputting content information and object information and emotion information corresponding to the content information includes:
when the number of emotion types is plural, speech object and sentence information corresponding to the speech object are output, and a weight ratio of each emotion type among the plural emotion types corresponding to the sentence information is output.
In the embodiment of the application, under some scenes, the emotion of the conversation object may change when speaking, and for the case that the emotion types include a plurality of emotion types, the emotion types possibly contained in a sentence or a section of speech and the weight ratio corresponding to each emotion type can be respectively identified when the emotion information of the voice conversation object is identified through the emotion identification model.
For example, fig. 3 shows a second interface schematic diagram of the electronic device according to the embodiment of the application, and as shown in fig. 3, a text message 302, an object message 304, and an emotion message 306 are displayed in a call interface 300.
The text information 302 includes sentence information: "weather today is good, but me still feels a bit of anxiety", wherein the emotion information corresponding to "weather is good" is happy, and the emotion information corresponding to "bit of anxiety" is anxiety.
When two different emotion information are respectively determined, according to the statement number corresponding to the different emotion, the weight ratio of each emotion information is determined, for example, emotion information 3062 of "weather today is good" is happy, the weight ratio is 50%, and emotion information 3064 of "but me still feels a bit of worry" is "anxiety, the weight ratio is 50%.
According to the embodiment of the application, when the text content of the voice call is displayed, when a certain sentence information contains a plurality of emotion information, each emotion information corresponding to the sentence information and the weight ratio corresponding to each emotion information are respectively displayed, so that a user watching the content information of the voice call can be helped to grasp the emotion of a call object more accurately, and the display effect of the content information of the voice call is improved.
In some embodiments of the present application, outputting content information and object information and emotion information corresponding to the content information includes:
determining keyword information in the text information under the condition that the text quantity of the text information is larger than a preset text quantity threshold value;
determining a target text according to the keyword information;
and outputting object information and emotion information corresponding to the content information by the target text.
In some situations, the voice call object may continuously speak a long speech, which includes important key information, and also includes some unimportant words such as a mood word, a greeting word, or a polite word.
If the whole content of the long voice is converted into text display, the reading difficulty of the user is increased. For this case, the user may manually turn on the long voice reduction function in the setup interface.
Fig. 4 is a schematic diagram of a voice call content display setting interface according to an embodiment of the present application, as shown in fig. 4, in which the setting interface 400 includes a function indicator 402, when a user opens a long voice reduction function and detects that the text number of text information corresponding to call audio data is greater than a preset text number threshold, keyword information in the text information is extracted, and reduced target text is generated according to the keyword information and displayed.
For example, the call object speaks a long voice: what is "hi? I have seen what you give my file, some of which need to be updated. I want to talk about the day and you talk about the latest business changes, determine further steps. We also need to discuss how we process this report next. Meeting time is at three pm on the Ming day, you consider this schedule appropriate, or you have other advice.
When the long voice simplifying function is started, the electronic equipment automatically extracts keyword information from the text of the long voice, wherein the keyword information comprises 'seen', 'file', 'required to be updated', 'tomorrow', 'interview', 'service change', 'report' and 'three afternoon'.
According to the keywords, the electronic equipment automatically connects the keywords according to a logic sequence to obtain a simplified target text: "the speaker has viewed the file and needs to be updated; three-point meeting interviews in the afternoon are planned to discuss business changes, process reports and the next step.
According to the embodiment of the application, the keyword-based simplifying treatment is carried out on the long voice text, so that unimportant content in voice call content is removed, reading difficulty caused by a large number of characters to a user is avoided, and the display efficiency of content information of voice call is improved.
According to the voice call content display method provided by the embodiment of the application, the execution main body can be the voice call content display device. In the embodiment of the application, a method for displaying voice call contents by using the voice call contents display device is taken as an example, and the voice call contents display device provided by the embodiment of the application is described.
In some embodiments of the present application, a voice call content display apparatus is provided, fig. 5 shows a block diagram of the voice call content display apparatus according to an embodiment of the present application, and as shown in fig. 5, a voice call content display apparatus 500 includes:
a receiving module 502 for receiving a first input;
the processing module 504 is configured to perform a voice recognition process on the audio data in the voice call in response to the first input, so as to obtain content information corresponding to the audio data, and object information and emotion information corresponding to the content information;
and an output module 506, configured to output the object information and the emotion information corresponding to the content information.
When the conversation content is displayed, the embodiment of the application displays the object information corresponding to the content information, namely the object which is used for speaking the corresponding words, and simultaneously displays the emotion information when the object speaks the words, and the meaning of the conversation object can be expressed more completely and accurately by displaying the text information, the object information and the emotion information.
In some embodiments of the application, the audio data comprises a plurality of audio statements;
the processing module is specifically used for:
performing voice feature recognition on the audio data to obtain object information, wherein the object information comprises information of one or more voice objects, and each voice object in the one or more voice objects is associated with at least one audio sentence in a plurality of audio sentences;
carrying out emotion feature recognition on the audio data to obtain emotion information, wherein the emotion information comprises emotion types corresponding to each sentence in a plurality of audio sentences;
and performing voice-to-text processing on the audio data to obtain content information, wherein the content information comprises statement information corresponding to a plurality of audio statements.
According to the embodiment of the application, the meaning of the call object can be expressed more completely and accurately by respectively identifying the object information and the emotion information corresponding to each sentence in the call audio and simultaneously displaying the object information and the emotion information when the text information corresponding to the call audio is displayed, so that the display accuracy of the voice call content is improved.
In some embodiments of the present application, the output module is specifically configured to:
when the number of the voice objects is plural, respectively outputting the plural voice objects and sentence information corresponding to the plural voice objects, and outputting emotion type corresponding to each sentence information.
According to the embodiment of the application, when the text content of the voice call is displayed, sentence information of different call objects can be distinguished and displayed, so that a user watching the content information of the voice call can more accurately grasp the content of the call object, and the display efficiency of the content information of the voice call is improved.
In some embodiments of the present application, the output module is specifically configured to:
when the number of emotion types is plural, speech object and sentence information corresponding to the speech object are output, and a weight ratio of each emotion type among the plural emotion types corresponding to the sentence information is output.
According to the embodiment of the application, when the text content of the voice call is displayed, when a certain sentence information contains a plurality of emotion information, each emotion information corresponding to the sentence information and the weight ratio corresponding to each emotion information are respectively displayed, so that a user watching the content information of the voice call can be helped to grasp the emotion of a call object more accurately, and the display effect of the content information of the voice call is improved.
In some embodiments of the present application, the voice call content display apparatus further includes:
the determining module is used for determining keyword information in the text information under the condition that the text quantity of the text information is larger than a preset text quantity threshold value; determining a target text according to the keyword information;
and the output module is also used for outputting object information and emotion information corresponding to the target text and the content information.
According to the embodiment of the application, the keyword-based simplifying treatment is carried out on the long voice text, so that unimportant content in voice call content is removed, a large number of words are prevented from increasing reading difficulty for users, and the display efficiency of content information of voice calls is improved.
The voice call content display device in the embodiment of the application can be electronic equipment, and can also be a component in the electronic equipment, such as an integrated circuit or a chip. The electronic device may be a terminal, or may be other devices than a terminal. By way of example, the electronic device may be a mobile phone, tablet computer, notebook computer, palm computer, vehicle-mounted electronic device, mobile internet appliance (Mobile Internet Device, MID), augmented reality (augmented reality, AR)/Virtual Reality (VR) device, robot, wearable device, ultra-mobile personal computer, UMPC, netbook or personal digital assistant (personal digital assistant, PDA), etc., but may also be a server, network attached storage (Network Attached Storage, NAS), personal computer (personal computer, PC), television (TV), teller machine or self-service machine, etc., and the embodiments of the present application are not limited in particular.
The voice call content display device in the embodiment of the application can be a device with an operating system. The operating system may be an Android operating system, an iOS operating system, or other possible operating systems, and the embodiment of the present application is not limited specifically.
The voice call content display device provided by the embodiment of the application can realize each process realized by the method embodiment, and in order to avoid repetition, the description is omitted.
Optionally, an embodiment of the present application further provides an electronic device, fig. 6 shows a block diagram of an electronic device according to an embodiment of the present application, as shown in fig. 6, where the electronic device 600 includes a processor 602, a memory 604, and a program or an instruction stored in the memory 604 and capable of running on the processor 602, where the program or the instruction is executed by the processor 602 to implement each process of the foregoing method embodiment, and the same technical effects are achieved, and are not repeated herein.
The electronic device in the embodiment of the application includes the mobile electronic device and the non-mobile electronic device.
Fig. 7 is a schematic diagram of a hardware structure of an electronic device implementing an embodiment of the present application.
The electronic device 700 includes, but is not limited to: radio frequency unit 701, network module 702, audio output unit 703, input unit 704, sensor 705, display unit 706, user input unit 707, interface unit 708, memory 709, and processor 710.
Those skilled in the art will appreciate that the electronic device 700 may also include a power source (e.g., a battery) for powering the various components, which may be logically connected to the processor 710 via a power management system so as to perform functions such as managing charge, discharge, and power consumption via the power management system. The electronic device structure shown in fig. 7 does not constitute a limitation of the electronic device, and the electronic device may include more or less components than shown, or may combine certain components, or may be arranged in different components, which are not described in detail herein.
The processor 710 is configured to obtain audio data corresponding to a voice call; performing voice recognition processing on the audio data to determine content information corresponding to the audio data, and object information and emotion information corresponding to the content information;
the display unit 706 is used to display object information and emotion information of content information corresponding to the content information.
When the conversation content is displayed, the embodiment of the application displays the object information corresponding to the content information, namely the object which is used for speaking the corresponding words, and simultaneously displays the emotion information when the object speaks the words, and the meaning of the conversation object can be expressed more completely and accurately by displaying the text information, the object information and the emotion information.
Optionally, the audio data comprises a plurality of audio sentences;
the processor 710 is further configured to perform voice feature recognition on the audio data to obtain object information, where the object information includes information of one or more voice objects, and each of the one or more voice objects is associated with a plurality of audio sentences of the plurality of audio sentences; carrying out emotion feature recognition on the audio data to obtain emotion information, wherein the emotion information comprises emotion types corresponding to each sentence in a plurality of audio sentences; and performing voice-to-text processing on the audio data to obtain content information, wherein the content information comprises statement information corresponding to a plurality of audio statements.
According to the embodiment of the application, the meaning of the call object can be expressed more completely and accurately by respectively identifying the object information and the emotion information corresponding to each sentence in the call audio and simultaneously displaying the object information and the emotion information when the text information corresponding to the call audio is displayed, so that the display accuracy of the voice call content is improved.
Optionally, the display unit 706 is further configured to, when the number of voice objects is plural, display a plurality of voice objects and sentence information corresponding to the plurality of voice objects, respectively, and display an emotion type corresponding to each sentence information in an associated manner.
According to the embodiment of the application, when the text content of the voice call is displayed, sentence information of different call objects can be distinguished and displayed, so that a user watching the content information of the voice call can more accurately grasp the content of the call object, and the display efficiency of the content information of the voice call is improved.
Optionally, the display unit 706 is further configured to display the speech object and sentence information corresponding to the speech object, and display a weight ratio of each emotion type among the plurality of emotion types corresponding to the sentence information, where the number of emotion types is a plurality.
According to the embodiment of the application, when the text content of the voice call is displayed, when a certain sentence information contains a plurality of emotion information, each emotion information corresponding to the sentence information and the weight ratio corresponding to each emotion information are respectively displayed, so that a user watching the content information of the voice call can be helped to grasp the emotion of a call object more accurately, and the display effect of the content information of the voice call is improved.
Optionally, the processor 710 is further configured to determine keyword information in the text information if the text number of the text information is greater than a preset text number threshold; determining a target text according to the keyword information;
the display unit 706 is also used to display object information and emotion information of the target text corresponding to the content information.
According to the embodiment of the application, the keyword-based simplifying treatment is carried out on the long voice text, so that unimportant content in voice call content is removed, reading difficulty caused by a large number of characters to a user is avoided, and the display efficiency of content information of voice call is improved.
It should be appreciated that in embodiments of the present application, the input unit 704 may include a graphics processor (Graphics Processing Unit, GPU) 7041 and a microphone 7042, with the graphics processor 7041 processing image data of still pictures or video obtained by an image capturing device (e.g., a camera) in a video capturing mode or an image capturing mode. The display unit 706 may include a display panel 7061, and the display panel 7061 may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 707 includes at least one of a touch panel 7071 and other input devices 7072. The touch panel 7071 is also referred to as a touch screen. The touch panel 7071 may include two parts, a touch detection device and a touch controller. Other input devices 7072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and so forth, which are not described in detail herein.
The memory 709 may be used to store software programs as well as various data. The memory 709 may mainly include a first storage area storing programs or instructions and a second storage area storing data, wherein the first storage area may store an operating system, application programs or instructions (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like. Further, the memory 709 may include volatile memory or nonvolatile memory, or the memory 709 may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM), static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (ddr SDRAM), enhanced SDRAM (Enhanced SDRAM), synchronous DRAM (SLDRAM), and Direct RAM (DRRAM). Memory 709 in embodiments of the application includes, but is not limited to, these and any other suitable types of memory.
Processor 710 may include one or more processing units; optionally, processor 710 integrates an application processor that primarily processes operations involving an operating system, user interface, application programs, and the like, and a modem processor that primarily processes wireless communication signals, such as a baseband processor. It will be appreciated that the modem processor described above may not be integrated into the processor 710.
The embodiment of the application also provides a readable storage medium, and the readable storage medium stores a program or an instruction, which when executed by a processor, implements each process of the above method embodiment, and can achieve the same technical effects, so that repetition is avoided, and no further description is provided herein.
The processor is a processor in the electronic device in the above embodiment. Readable storage media include computer readable storage media such as Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic or optical disks, and the like.
The embodiment of the application further provides a chip, the chip comprises a processor and a communication interface, the communication interface is coupled with the processor, the processor is used for running programs or instructions, the processes of the embodiment of the method can be realized, the same technical effects can be achieved, and the repetition is avoided, and the description is omitted here.
It should be understood that the chips referred to in the embodiments of the present application may also be referred to as system-on-chip chips, chip systems, or system-on-chip chips, etc.
Embodiments of the present application provide a computer program product stored in a storage medium, where the program product is executed by at least one processor to implement the respective processes of the above method embodiments, and achieve the same technical effects, and for avoiding repetition, a detailed description is omitted herein.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Furthermore, it should be noted that the scope of the methods and apparatus in the embodiments of the present application is not limited to performing the functions in the order shown or discussed, but may also include performing the functions in a substantially simultaneous manner or in an opposite order depending on the functions involved, e.g., the described methods may be performed in an order different from that described, and various steps may be added, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in part in the form of a computer software product stored on a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method of the embodiments of the present application.
The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are to be protected by the present application.

Claims (10)

1. A voice call content display method, comprising:
receiving a first input;
responding to the first input, performing voice recognition processing on the audio data in voice communication to obtain content information corresponding to the audio data, and object information and emotion information corresponding to the content information;
and outputting the content information and the object information and the emotion information corresponding to the content information.
2. The method of claim 1, wherein the audio data comprises a plurality of audio sentences;
the voice recognition processing is performed on the audio data in the voice call to obtain content information corresponding to the audio data, and object information and emotion information corresponding to the content information, including:
performing voice feature recognition on the audio data to obtain object information, wherein the object information comprises information of one or more voice objects, and each voice object in the one or more voice objects is associated with at least one audio sentence in the plurality of audio sentences;
carrying out emotion feature recognition on the audio data to obtain emotion information, wherein the emotion information comprises emotion types corresponding to each sentence in the plurality of audio sentences;
and performing voice-to-text processing on the audio data to obtain the content information, wherein the content information comprises statement information corresponding to the plurality of audio statements.
3. The method according to claim 2, wherein the outputting the content information and the object information and the emotion information corresponding to the content information includes:
and under the condition that the number of the voice objects is a plurality of, respectively outputting a plurality of voice objects and sentence information corresponding to the voice objects, and outputting emotion types corresponding to the sentence information.
4. The method according to claim 2, wherein the outputting the content information and the object information and the emotion information corresponding to the content information includes:
outputting the voice object and sentence information corresponding to the voice object when the number of emotion types is a plurality, and outputting the weight ratio of each emotion type in the plurality of emotion types corresponding to the sentence information.
5. The method according to claim 2, wherein the outputting the content information and the object information and the emotion information corresponding to the content information includes:
determining keyword information in the content information under the condition that the text quantity of the content information is larger than a preset text quantity threshold value;
determining a target text according to the keyword information;
and outputting the target text, the object information and the emotion information corresponding to the content information.
6. A voice call content display apparatus, comprising:
a receiving module for receiving a first input;
the processing module is used for responding to the first input, performing voice recognition processing on the audio data in the voice call, and obtaining content information corresponding to the audio data, object information corresponding to the content information and emotion information;
and the output module is used for outputting the content information, the object information and the emotion information corresponding to the content information.
7. The apparatus of claim 6, wherein the audio data comprises a plurality of audio sentences;
the processing module is specifically configured to:
performing voice feature recognition on the audio data to obtain object information, wherein the object information comprises information of one or more voice objects, and each voice object in the one or more voice objects is associated with at least one audio sentence in the plurality of audio sentences;
carrying out emotion feature recognition on the audio data to obtain emotion information, wherein the emotion information comprises emotion types corresponding to each sentence in the plurality of audio sentences;
and performing voice-to-text processing on the audio data to obtain the content information, wherein the content information comprises statement information corresponding to a plurality of audio statements.
8. The apparatus of claim 7, wherein the output module is specifically configured to:
and under the condition that the number of the voice objects is a plurality of, respectively outputting a plurality of voice objects and sentence information corresponding to the voice objects, and outputting emotion types corresponding to the sentence information.
9. The apparatus of claim 7, wherein the output module is specifically configured to:
outputting the voice object and sentence information corresponding to the voice object when the number of emotion types is a plurality, and outputting the weight ratio of each emotion type in the plurality of emotion types corresponding to the sentence information.
10. The apparatus as recited in claim 7, further comprising:
the determining module is used for determining keyword information in the content information under the condition that the text quantity of the content information is larger than a preset text quantity threshold value; and
determining a target text according to the keyword information;
the output module is further used for outputting the target text, the object information and the emotion information corresponding to the content information.
CN202310898438.5A 2023-07-20 2023-07-20 Voice call content display method and device Pending CN116939091A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310898438.5A CN116939091A (en) 2023-07-20 2023-07-20 Voice call content display method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310898438.5A CN116939091A (en) 2023-07-20 2023-07-20 Voice call content display method and device

Publications (1)

Publication Number Publication Date
CN116939091A true CN116939091A (en) 2023-10-24

Family

ID=88380118

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310898438.5A Pending CN116939091A (en) 2023-07-20 2023-07-20 Voice call content display method and device

Country Status (1)

Country Link
CN (1) CN116939091A (en)

Similar Documents

Publication Publication Date Title
US10614803B2 (en) Wake-on-voice method, terminal and storage medium
US9053096B2 (en) Language translation based on speaker-related information
CN100424632C (en) Semantic object synchronous understanding for highly interactive interface
US20130144619A1 (en) Enhanced voice conferencing
US20190221208A1 (en) Method, user interface, and device for audio-based emoji input
CN115668371A (en) Classifying auditory and visual conferencing data to infer importance of user utterances
US11699360B2 (en) Automated real time interpreter service
CN109986569B (en) Chat robot with role and personality
CN110493123B (en) Instant messaging method, device, equipment and storage medium
US11328711B2 (en) User adaptive conversation apparatus and method based on monitoring of emotional and ethical states
US11321675B2 (en) Cognitive scribe and meeting moderator assistant
CN112148850A (en) Dynamic interaction method, server, electronic device and storage medium
KR20200076439A (en) Electronic apparatus, controlling method of electronic apparatus and computer readadble medium
CN110379406B (en) Voice comment conversion method, system, medium and electronic device
CN109977390B (en) Method and device for generating text
CN110740212B (en) Call answering method and device based on intelligent voice technology and electronic equipment
US20190088270A1 (en) Estimating experienced emotions
CN110931002B (en) Man-machine interaction method, device, computer equipment and storage medium
KR102222637B1 (en) Apparatus for analysis of emotion between users, interactive agent system using the same, terminal apparatus for analysis of emotion between users and method of the same
CN112052316A (en) Model evaluation method, model evaluation device, storage medium and electronic equipment
CN116939091A (en) Voice call content display method and device
KR20200040625A (en) An electronic device which is processing user's utterance and control method thereof
CN117882365A (en) Verbal menu for determining and visually displaying calls
CN114138960A (en) User intention identification method, device, equipment and medium
CN112969000A (en) Control method and device of network conference, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination