CN110880330A - Audio conversion method and terminal equipment - Google Patents

Audio conversion method and terminal equipment Download PDF

Info

Publication number
CN110880330A
CN110880330A CN201911033600.7A CN201911033600A CN110880330A CN 110880330 A CN110880330 A CN 110880330A CN 201911033600 A CN201911033600 A CN 201911033600A CN 110880330 A CN110880330 A CN 110880330A
Authority
CN
China
Prior art keywords
call
voice
target
event
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911033600.7A
Other languages
Chinese (zh)
Inventor
刘秋菊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Vivo Mobile Communication Co Ltd
Original Assignee
Vivo Mobile Communication Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Vivo Mobile Communication Co Ltd filed Critical Vivo Mobile Communication Co Ltd
Priority to CN201911033600.7A priority Critical patent/CN110880330A/en
Publication of CN110880330A publication Critical patent/CN110880330A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72403User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72403User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality
    • H04M1/7243User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages
    • H04M1/72433User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages for voice messaging, e.g. dictaphones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72469User interfaces specially adapted for cordless or mobile telephones for operating the device by selecting functions from two or more displayed items, e.g. menus or icons
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72484User interfaces specially adapted for cordless or mobile telephones wherein functions are triggered by incoming communication events

Abstract

The embodiment of the invention provides an audio conversion method and terminal equipment, which are applied to the technical field of communication and are used for solving the problem that important call contents are missed due to untimely user operation in the related technology. The method comprises the following steps: acquiring target voice emotional characteristics of first call voice in a call process; under the condition that the target voice emotional characteristics are matched with the preset voice emotional characteristics, second call voice in the call process is stored, semantic analysis is carried out on the second call voice, and audio text of the second call voice is obtained; wherein, the second call voice is: and the target call time is the preset time before the first call voice.

Description

Audio conversion method and terminal equipment
Technical Field
The embodiment of the invention relates to the technical field of communication, in particular to an audio conversion method and terminal equipment.
Background
With the development of terminal device technology, the frequency of using the terminal device by a user is higher and higher, and when the user uses the terminal device to make a call, important information needs to be recorded in real time.
In the related art, when a user wants to record important information in a call process in the call process, the user needs to manually start a recording function in the call process, so that the user can acquire the important information in the call recording by repeatedly playing recording contents of the call recording after the call is finished by storing the call recording.
However, when the user manually starts the recording function during the call, it is likely that the important call content is not recorded in time due to the user operation being not in time, and thus the important call content is missed.
Disclosure of Invention
The embodiment of the invention provides an audio conversion method and terminal equipment, and aims to solve the problem that important conversation contents are missed due to untimely user operation in the related art.
In order to solve the technical problem, the present application is implemented as follows:
in a first aspect, an embodiment of the present invention provides an audio conversion method, where the method includes: acquiring target voice emotional characteristics of first call voice in a call process; under the condition that the target voice emotional characteristics are matched with the preset voice emotional characteristics, second call voice in the call process is stored, semantic analysis is carried out on the second call voice, and audio text of the second call voice is obtained; wherein, the second call voice is: and a call audio after the target call time, which is a predetermined time before the first call voice.
In a second aspect, an embodiment of the present invention further provides a terminal device, where the terminal device includes: the obtaining module is used for obtaining target voice emotional characteristics of the first call voice in the call process; the storage module is used for storing second call voice in the call process under the condition that the target voice emotion characteristics acquired by the acquisition module are matched with preset voice emotion characteristics; the analysis module is used for carrying out semantic analysis on the second communication voice stored by the storage module to obtain an audio text of the second communication voice; wherein, the second call voice is: and the target call time is the preset time before the first call voice.
In a third aspect, an embodiment of the present invention provides a terminal device, including a processor, a memory, and a computer program stored on the memory and operable on the processor, where the computer program, when executed by the processor, implements the steps of the audio conversion method according to the first aspect.
In a fourth aspect, the present invention provides a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the steps of the method of audio conversion according to the first aspect.
In the embodiment of the present invention, because the speech emotion feature of the speech can represent the emotion change of the user who utters the speech, and the important event is usually expressed within a certain time before the time when the emotion of the user changes, the terminal device in the embodiment of the present invention determines whether the second call speech is the important call speech in the call process by detecting whether the target speech emotion feature of the first call speech in the call process of the user matches with the predetermined speech emotion feature, where the second call speech is the call audio after the predetermined time before the first call speech. Therefore, when the terminal device detects that the emotion of the user changes, the second conversation voice can be automatically stored without manual operation, meanwhile, the second conversation voice is subjected to semantic analysis, and finally the audio text of the second conversation voice is obtained, so that the user can directly know the important conversation content stated in the second conversation voice through the audio text or the second conversation voice, and the user is prevented from missing the important conversation content.
Drawings
Fig. 1 is a schematic diagram of an architecture of a possible android operating system according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating an audio conversion method according to an embodiment of the present invention;
fig. 3 is a second flowchart of an audio conversion method according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a terminal device according to an embodiment of the present invention;
fig. 5 is a hardware schematic diagram of a terminal according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that "/" in this context means "or", for example, A/B may mean A or B; "and/or" herein is merely an association describing an associated object, and means that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone.
It should be noted that "a plurality" herein means two or more than two.
It should be noted that, in the embodiments of the present invention, words such as "exemplary" or "for example" are used to indicate examples, illustrations or explanations. Any embodiment or design described as "exemplary" or "e.g.," an embodiment of the present invention is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present concepts related in a concrete fashion.
It should be noted that, for the convenience of clearly describing the technical solutions of the embodiments of the present invention, in the embodiments of the present invention, words such as "first" and "second" are used to distinguish the same items or similar items with substantially the same functions or actions, and those skilled in the art can understand that the words such as "first" and "second" do not limit the quantity and execution order. For example, the first call voice and the second call voice are for distinguishing different call voices, not for describing a specific order of the call voices.
The execution main body of the audio conversion method provided in the embodiment of the present invention may be the terminal device (including a mobile terminal device and a non-mobile terminal device), or may also be a functional module and/or a functional entity capable of implementing the audio conversion method in the terminal device, which may be determined specifically according to actual use requirements, and the embodiment of the present invention is not limited. The following takes a terminal device as an example to exemplarily explain an audio conversion method provided by the embodiment of the present invention.
The terminal device in the embodiment of the invention can be a mobile terminal device and can also be a non-mobile terminal device. The mobile terminal device may be a mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted terminal device, a wearable device, an ultra-mobile personal computer (UMPC), a netbook or a Personal Digital Assistant (PDA), etc.; the non-mobile terminal device may be a Personal Computer (PC), a Television (TV), a teller machine, a self-service machine, or the like; the embodiments of the present invention are not particularly limited.
The terminal device in the embodiment of the present invention may be a terminal device having an operating system. The operating system may be an Android (Android) operating system, an ios operating system, or other possible operating systems, and embodiments of the present invention are not limited in particular.
The following describes a software environment applied to the audio conversion method provided by the embodiment of the present invention, taking an android operating system as an example.
Fig. 1 is a schematic diagram of an architecture of a possible android operating system according to an embodiment of the present invention. In fig. 1, the architecture of the android operating system includes 4 layers, which are respectively: an application layer, an application framework layer, a system runtime layer, and a kernel layer (specifically, a Linux kernel layer).
The application program layer comprises various application programs (including system application programs and third-party application programs) in an android operating system.
The application framework layer is a framework of the application, and a developer can develop some applications based on the application framework layer under the condition of complying with the development principle of the framework of the application.
The system runtime layer includes libraries (also called system libraries) and android operating system runtime environments. The library mainly provides various resources required by the android operating system. The android operating system running environment is used for providing a software environment for the android operating system.
The kernel layer is an operating system layer of an android operating system and belongs to the bottommost layer of an android operating system software layer. The kernel layer provides kernel system services and hardware-related drivers for the android operating system based on the Linux kernel.
Taking an android operating system as an example, in the embodiment of the present invention, a developer may develop a software program for implementing the audio conversion method provided in the embodiment of the present invention based on the system architecture of the android operating system shown in fig. 1, so that the audio conversion method may operate based on the android operating system shown in fig. 1. Namely, the processor or the terminal device can implement the audio conversion method provided by the embodiment of the invention by running the software program in the android operating system.
The following describes an audio conversion method according to an embodiment of the present invention with reference to a flowchart of the audio conversion method shown in fig. 2, where fig. 2 is a schematic flowchart of an audio conversion method provided by an embodiment of the present invention, and includes steps 201 and 202:
step 201: the terminal equipment acquires the target voice emotional characteristics of the first call voice in the call process.
In this embodiment of the present invention, the first call voice may be all call voices in the call process, or may be a call voice in a certain time period in the call process, which is not limited in this embodiment of the present invention.
In the embodiment of the present invention, during the voice call, the terminal device may monitor the call voice of the user in real time, or may monitor the call voice of the user according to the first predetermined time interval, which is not limited in the embodiment of the present invention. When the terminal device monitors the call voice of the user, the terminal device may detect the voice emotion feature of the current call voice in real time, or may detect the voice emotion feature of the current call voice according to the second predetermined time interval, which is not limited in the embodiment of the present invention.
Step 202: and under the condition that the target voice emotional characteristics are matched with the preset voice emotional characteristics, the terminal equipment stores the second call voice in the call process, and performs semantic analysis on the second call voice to obtain the audio text of the second call voice.
In an embodiment of the present invention, the second call voice is: and the target call time is the preset time before the first call voice.
Optionally, in an embodiment of the present invention, the target speech emotion feature is used to characterize a user emotion of a call user who sends the first call speech.
Illustratively, the above target speech emotion characteristics include at least one of: the first call voice comprises audio features used for representing voice tone, audio features used for representing voice speed, audio features used for representing voice rhythm and audio features used for representing voice volume. Generally, the characteristics of voice such as pitch, volume, rhythm and speed of voice can reflect the personal emotion/emotion of the user who utters the voice.
Illustratively, when a user is angry, the speaking voice becomes louder, the tone becomes higher, and the speech speed is unchanged or faster; when a user worries about, the speaking speed becomes slow, the tone is reduced, and the sound becomes small; when a user is happy, speaking voice can be restrained and frustrated, the voice speed is light and fast, and the voice size is kept stable; when the emotion/emotion of the user is neutral, the speaking voice is stable, and the speed of speech is unchanged; the tone remains unchanged; when a user is nervous and excited, the speech speed is high, the voice is low, and the rhythm is disordered. Therefore, in a general situation, when a user starts a conversation with neutral emotion/emotion or happy emotion/emotion, the terminal device may determine the emotion change trend of the user by detecting the voice intonation of the user, so that the terminal device may record a key event in the voice according to the detected emotion change moment of the user.
Optionally, in an embodiment of the present invention, the predetermined speech emotion feature is used to characterize a speech emotion feature of a specific emotion of a user. For example, the predetermined speech emotion feature may be at least one speech emotion feature preset in the terminal device or a predetermined database, and one predetermined speech emotion feature corresponds to one user emotion.
Optionally, in this embodiment of the present invention, the matching between the target speech emotion feature and the predetermined speech emotion feature refers to: the degree of similarity between the target speech emotion feature and the predetermined speech emotion feature is greater than or equal to a predetermined threshold (e.g., 80%).
For example, after acquiring the target speech emotion feature of the first call speech, the terminal device may use the target speech emotion feature as an index to search for whether a predetermined speech emotion feature matching the target speech emotion feature exists in a speech emotion feature library (e.g., a FAU AIBO child emotion database). The voice emotion feature library comprises at least one preset voice emotion feature.
Optionally, in an embodiment of the present invention, the first speech sound includes at least one first speech emotion feature, the first speech sound includes at least one speech time, each of the at least one first speech emotion feature corresponds to one speech time, and the at least one first speech emotion feature includes the target speech emotion feature; and the target call time is a preset time before the target call time corresponding to the target speech emotion characteristics.
For example, assuming that the first call voice includes 3 call moments (e.g., T1, T2, T3), the voice emotion feature corresponding to each call moment is extracted, for example, the voice emotion feature corresponding to the time T1 is feature 1, the voice emotion feature corresponding to the time T2 is feature 2, and the voice emotion feature corresponding to the time T3 is feature 3. After acquiring the feature 1, the feature 2 and the feature 3, the terminal device matches the feature 1, the feature 2 and the feature 3 with a predetermined speech emotion feature, and if the feature 2 matches the predetermined speech emotion feature, the time T2 corresponding to the feature 2 is taken as the target call time. That is, the terminal device collects the call voice after a predetermined time before the time T2.
Optionally, in an embodiment of the present invention, the audio text of the second call voice includes at least one of the following: event information of the target event and identity information of the opposite-end communication user.
Further optionally, in an embodiment of the present invention, the event information of the target event includes at least one of the following: keywords of the target event, event occurrence time of the target event, and statement content of the peer call user in the target event (e.g., opinion, idea, suggestion, response of the peer call user to the target event).
Optionally, in the embodiment of the present invention, after the terminal device acquires the audio text of the second call voice, the terminal device performs related search by using a keyword in the audio text as an index, and displays a search result in the first interface. For example, if a power failure in the home of an opposite-end call user is recorded in an audio text of a certain call voice, the terminal device may display the contact information including "electrician" stored in the address book in the first interface for the user to refer to due to the need of quickly recovering the power supply in the home.
According to the audio conversion method provided by the embodiment of the invention, because the voice emotion characteristics of the voice can represent the emotion changes of the user who sends the voice, and the important events are usually expressed in a certain time before the time when the emotion of the user changes, the terminal device in the embodiment of the invention determines whether the second call voice is the important call voice in the call process by detecting whether the target voice emotion characteristics of the first call voice are matched with the preset voice emotion characteristics in the call process of the user, wherein the second call voice is the call voice after the preset time before the first call voice. Therefore, when the terminal device detects that the emotion of the user changes, the second conversation voice can be automatically stored without manual operation, meanwhile, the second conversation voice is subjected to semantic analysis, and finally the audio text of the second conversation voice is obtained, so that the user can directly know the important conversation content stated in the second conversation voice through the audio text or the second conversation voice, and the user is prevented from missing the important conversation content.
Optionally, in this embodiment of the present invention, in a case that the audio text of the second call voice includes event information of a target event, as shown in fig. 3, after the step 202, the audio conversion method further includes the following steps:
step A1: and the terminal equipment generates an event description text of the target event according to the event information.
Step A2: the terminal device displays an event description text of the target event on the first interface.
For example, after obtaining the audio text of the second call voice, the terminal device may generate a corresponding event description text for the target event based on the audio text of the second call voice, and display the event description text on the first interface.
Illustratively, after obtaining the audio text of the second call voice, when the terminal device receives a first input of the user, the terminal device generates a corresponding event description text for the target event based on the audio text of the second call voice in response to the first input, and displays the event description text of the target event on the first interface.
In one example, the first input may include: the input of the user to the specific interface may be specifically set according to actual requirements, which is not limited in the embodiment of the present invention. For example, the specific interface is a text interface of the audio text, and the user input to the specific interface may include: and inputting the first control in the text interface of the audio text by the user. The first control is used for triggering the terminal device to display an event description text of the target event on the first interface.
Further optionally, in the embodiment of the present invention, in a case that the audio text of the second call voice includes event information of a target event, as shown in fig. 3, the step a1 may include the following steps:
step B1: and the terminal equipment acquires the identity information of an opposite-end communication user communicating with the terminal equipment user in the second communication voice.
Step B2: and the terminal equipment generates an event description text of the target event according to the event information of the target event and the identity information of the opposite-end communication user.
For example, the terminal device may use a contact phone and/or a contact name of the opposite-end call user as an index, and automatically obtain the identity information of the opposite-end call user through an address book stored in the terminal device or contact information stored in communication software. Wherein the contact information at least comprises one of the following items: contact name, contact remark, contact position, contact title, contact and user relationship, etc.
Illustratively, when the terminal device cannot acquire the identity information of the opposite-end call user from the address book stored in the terminal device or the contact information stored in the communication software, the terminal device acquires the identity information of the opposite-end call user from the audio text of the second call voice.
Illustratively, the event description text of the target event includes at least one of: keywords of the target event, event occurrence time of the target event, and statement content of the peer call user in the target event (e.g., opinion, idea, suggestion, response of the peer call user to the target event).
Therefore, the terminal equipment can form the event description text of the target event based on the acquired identity information of the opposite-end call user and the event information, so that the user can directly obtain the abbreviated version of the content stated in the second call voice, the user can conveniently and quickly know the key information in the second call voice, the time of the user is greatly saved, and the working and living efficiency of the user is improved.
Optionally, in the embodiment of the present invention, after the terminal device obtains the audio texts of the call voices of M (where M is a positive integer greater than 1) different peer-to-peer call users, the terminal device may perform identity emergency importance ranking on the M peer-to-peer call users, and may also determine the importance of each event in X event description texts corresponding to the M peer-to-peer call users.
For example, the terminal device may determine the priority of each event description text in X event description texts corresponding to M peer-to-peer call users according to the identity emergency importance ranking of the M peer-to-peer call users.
Illustratively, when performing the emergency importance ranking of identities for the M peer-to-peer users, the terminal device may perform the emergency importance ranking of identities for the M peer-to-peer users according to the identity information of the peer-to-peer users. The identity information of the opposite-end call user includes but is not limited to: name, remarks, position, title, relationship with the user, etc.
Example 1: the terminal device can sort the M opposite-end call users according to the positions of the opposite-end call users. For example, the job priority level may be: the superior leader > direct leader > business relatives.
Example 2: the terminal device may rank the M peer-to-peer call users according to the relationship between the peer-to-peer call users and the users. For example, the priority of the relationship between the peer-to-peer call user and the terminal device user may be: relatives > friends > colleagues.
For example, the terminal device may set a corresponding reminding policy for the corresponding event according to the text priority of each event description text in the X event description texts, so as to remind the user of the corresponding event.
Therefore, the terminal equipment can set a reminding strategy for the corresponding event by sequencing the identity emergency importance of the opposite-end conversation user, so that the user is reminded of processing the important event in time, and the user is prevented from omitting to process the important event.
Fig. 4 is a schematic structural diagram of a terminal device for audio conversion according to an embodiment of the present invention, and as shown in fig. 4, the terminal device 600 includes: an acquisition module 601, a storage module 602, and an analysis module 603, wherein:
the obtaining module 601 is configured to obtain a target speech emotion feature of a first call speech in a call process.
A storage module 602, configured to store the second call voice in the call process when the target voice emotion feature acquired by the acquisition module 601 matches a predetermined voice emotion feature.
The analysis module 603 is configured to perform semantic analysis on the second speech sound stored in the storage module 602 to obtain an audio text of the second speech sound.
Wherein, the second call voice is: and the target call time is the preset time before the first call voice.
Optionally, in this embodiment of the present invention, the target speech emotion feature is used to characterize a user emotion of a call user who sends the first call speech.
Optionally, in an embodiment of the present invention, the first call voice includes at least one first voice emotion feature, each of the at least one first voice emotion feature corresponds to a call time, and the at least one first voice emotion feature includes the target voice emotion feature; and the target call time is a preset time before the target call time corresponding to the target speech emotion characteristics.
Optionally, in this embodiment of the present invention, as shown in fig. 4, the terminal device 600 further includes a display module 604 and a generation module 605, where: a generating module 605, configured to generate an event description text of the target event according to the event information; a display module 604, configured to display an event description text of the target event on a first interface; wherein, the audio text comprises event information of the target event.
Optionally, in an embodiment of the present invention, as shown in fig. 4, the generating module 605 is specifically configured to: the terminal equipment acquires the identity information of an opposite-end communication user who communicates with the terminal equipment user in the second communication voice; and the terminal equipment generates an event description text of the target event according to the event information of the target event and the identity information of the opposite-end communication user.
The terminal device provided by the embodiment of the present invention can implement each process implemented by the terminal device in the above method embodiments, and is not described herein again to avoid repetition.
In the terminal device provided by the embodiment of the present invention, because the speech emotion feature of the speech can represent the emotion change of the user who utters the speech, and the important event is usually expressed within a certain time before the time when the emotion of the user changes, the terminal device in the embodiment of the present invention determines whether the second speech is the important speech in the call process by detecting whether the target speech emotion feature of the first speech in the call process of the user matches with the predetermined speech emotion feature, where the second speech is the speech after the predetermined time before the first speech, and is usually the important speech. Therefore, when the terminal device detects that the emotion of the user changes, the second conversation voice can be automatically stored without manual operation, meanwhile, the second conversation voice is subjected to semantic analysis, and finally the audio text of the second conversation voice is obtained, so that the user can directly know the important conversation content stated in the second conversation voice through the audio text or the second conversation voice, and the user is prevented from missing the important conversation content.
Fig. 5 is a schematic diagram of a hardware structure of a terminal device for implementing various embodiments of the present invention, where the terminal device 100 includes, but is not limited to: radio frequency unit 101, network module 102, audio output unit 103, input unit 104, sensor 105, display unit 106, user input unit 107, interface unit 108, memory 109, processor 110, and power supply 111. Those skilled in the art will appreciate that the configuration of the terminal device 100 shown in fig. 5 does not constitute a limitation of the terminal device, and that the terminal device 100 may include more or less components than those shown, or combine some components, or arrange different components. In the embodiment of the present invention, the terminal device 100 includes, but is not limited to, a mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted terminal device, a wearable device, a pedometer, and the like.
The processor 110 is configured to obtain a target speech emotion feature of a first call speech in a call process; a memory 109, configured to store a second call voice in the call process when the target voice emotion feature acquired by the processor 110 matches a predetermined voice emotion feature; a processor 110, configured to perform semantic analysis on the second communication sound stored in the memory 109 to obtain an audio text of the second communication sound; wherein, the second call voice is: and the target call time is the preset time before the first call voice.
In the terminal device provided by the embodiment of the present invention, because the speech emotion feature of the speech can represent the emotion change of the user who utters the speech, and the important event is usually expressed within a certain time before the time when the emotion of the user changes, the terminal device in the embodiment of the present invention determines whether the second speech is the important speech in the call process by detecting whether the target speech emotion feature of the first speech in the call process of the user matches with the predetermined speech emotion feature, where the second speech is the speech after the predetermined time before the first speech, and is usually the important speech. Therefore, when the terminal device detects that the emotion of the user changes, the second conversation voice can be automatically stored without manual operation, meanwhile, the second conversation voice is subjected to semantic analysis, and finally the audio text of the second conversation voice is obtained, so that the user can directly know the important conversation content stated in the second conversation voice through the audio text or the second conversation voice, and the user is prevented from missing the important conversation content.
It should be understood that, in the embodiment of the present invention, the radio frequency unit 101 may be used for receiving and sending signals during a message transmission or call process, and specifically, after receiving downlink data from a base station, the downlink data is processed by the processor 110; in addition, the uplink data is transmitted to the base station. Typically, radio frequency unit 101 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, and the like. In addition, the radio frequency unit 101 can also communicate with a network and other devices through a wireless communication system.
The terminal device 100 provides the user with wireless broadband internet access via the network module 102, such as helping the user send and receive e-mails, browse web pages, and access streaming media.
The audio output unit 103 may convert audio data received by the radio frequency unit 101 or the network module 102 or stored in the memory 109 into an audio signal and output as sound. Also, the audio output unit 103 may also provide audio output related to a specific function performed by the terminal device 100 (e.g., a call signal reception sound, a message reception sound, etc.). The audio output unit 103 includes a speaker, a buzzer, a receiver, and the like.
The input unit 104 is used to receive an audio or video signal. The input Unit 104 may include a Graphics Processing Unit (GPU) 1041 and a microphone 1042, and the Graphics processor 1041 processes image data of a still picture or video obtained by an image capturing device (e.g., a camera) in a video capturing mode or an image capturing mode. The processed image frames may be displayed on the display unit 106. The image frames processed by the graphic processor 1041 may be stored in the memory 109 (or other storage medium) or transmitted via the radio frequency unit 101 or the network module 102. The microphone 1042 may receive sound and may be capable of processing such sound into audio data. The processed audio data may be converted into a format output transmittable to a mobile communication base station via the radio frequency unit 101 in case of a phone call mode.
The terminal device 100 also includes at least one sensor 105, such as a light sensor, a motion sensor, and other sensors. Specifically, the light sensor includes an ambient light sensor that can adjust the brightness of the display panel 1061 according to the brightness of ambient light, and a proximity sensor that can turn off the display panel 1061 and/or the backlight when the terminal device 100 is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally three axes), detect the magnitude and direction of gravity when stationary, and can be used to identify the terminal device posture (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration identification related functions (such as pedometer, tapping), and the like; the sensors 105 may also include fingerprint sensors, pressure sensors, iris sensors, molecular sensors, gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc., which are not described in detail herein.
The display unit 106 is used to display information input by a user or information provided to the user. The Display unit 106 may include a Display panel 1061, and the Display panel 1061 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like.
The user input unit 107 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the terminal device 100. Specifically, the user input unit 107 includes a touch panel 1071 and other input devices 1072. Touch panel 1071, also referred to as a touch screen, may collect touch operations by a user on or near the touch panel 1071 (e.g., operations by a user on or near touch panel 1071 using a finger, stylus, or any suitable object or attachment). The touch panel 1071 may include two parts of a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 110, and receives and executes commands sent by the processor 110. In addition, the touch panel 1071 may be implemented in various types, such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. In addition to the touch panel 1071, the user input unit 107 may include other input devices 1072. Specifically, other input devices 1072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, and a joystick, which are not described in detail herein.
Further, the touch panel 1071 may be overlaid on the display panel 1061, and when the touch panel 1071 detects a touch operation thereon or nearby, the touch panel 1071 transmits the touch operation to the processor 110 to determine the type of the touch event, and then the processor 110 provides a corresponding visual output on the display panel 1061 according to the type of the touch event. Although in fig. 5, the touch panel 1071 and the display panel 1061 are two independent components to implement the input and output functions of the terminal device 100, in some embodiments, the touch panel 1071 and the display panel 1061 may be integrated to implement the input and output functions of the terminal device 100, and is not limited herein.
The interface unit 108 is an interface for connecting an external device to the terminal apparatus 100. For example, the external device may include a wired or wireless headset port, an external power supply (or battery charger) port, a wired or wireless data port, a memory card port, a port for connecting a device having an identification module, an audio input/output (I/O) port, a video I/O port, an earphone port, and the like. The interface unit 108 may be used to receive input (e.g., data information, power, etc.) from an external device and transmit the received input to one or more elements within the terminal apparatus 100 or may be used to transmit data between the terminal apparatus 100 and the external device.
The memory 109 may be used to store software programs as well as various data. The memory 109 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 109 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.
The processor 110 is a control center of the terminal device 100, connects various parts of the entire terminal device 100 by various interfaces and lines, and performs various functions of the terminal device 100 and processes data by running or executing software programs and/or modules stored in the memory 109 and calling data stored in the memory 109, thereby performing overall monitoring of the terminal device 100. Processor 110 may include one or more processing units; alternatively, the processor 110 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 110.
The terminal device 100 may further include a power supply 111 (such as a battery) for supplying power to each component, and optionally, the power supply 111 may be logically connected to the processor 110 through a power management system, so as to implement functions of managing charging, discharging, and power consumption through the power management system.
In addition, the terminal device 100 includes some functional modules that are not shown, and are not described in detail here.
Optionally, an embodiment of the present invention further provides a terminal device, which includes a processor, a memory, and a computer program stored in the memory and capable of running on the processor 110, where the computer program, when executed by the processor, implements each process of the audio conversion method embodiment, and can achieve the same technical effect, and details are not repeated here to avoid repetition.
The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process of the above-mentioned audio conversion method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here. The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the methods described in the embodiments of the present application.
While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the invention is not limited to the precise embodiments described above, which are meant to be illustrative and not restrictive, and that various changes may be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (12)

1. An audio conversion method applied to a terminal device is characterized by comprising the following steps:
acquiring target voice emotional characteristics of first call voice in a call process;
under the condition that the target voice emotional characteristics are matched with preset voice emotional characteristics, second call voice in the call process is stored, semantic analysis is carried out on the second call voice, and an audio text of the second call voice is obtained;
wherein the second call voice is: and a call audio after a target call time, wherein the target call time is a preset time before the first call voice.
2. The method of claim 1, wherein the target speech emotion characteristics are used to characterize a user emotion of a calling user who uttered the first call speech.
3. The method of claim 1, wherein the audio text comprises event information of a target event;
after the semantic analysis is performed on the second call voice to obtain the audio text of the second call voice, the method further includes:
generating an event description text of the target event according to the event information;
and displaying the event description text of the target event on the first interface.
4. The method according to claim 3, wherein the generating an event description text of the target event according to the event information comprises:
acquiring identity information of an opposite-end call user who communicates with a terminal equipment user in the second call voice;
and generating an event description text of the target event according to the event information and the identity information of the opposite-end communication user.
5. The method of claim 1, wherein the first speech comprises at least one first speech emotion feature, each of the at least one first speech emotion feature corresponds to a speech time, and the at least one first speech emotion feature comprises the target speech emotion feature;
and the target call time is preset time before the target call time corresponding to the target voice emotional characteristic.
6. A terminal device, characterized in that the terminal device comprises:
the obtaining module is used for obtaining target voice emotional characteristics of the first call voice in the call process;
the storage module is used for storing a second call voice in the call process under the condition that the target voice emotion characteristics acquired by the acquisition module are matched with preset voice emotion characteristics;
the analysis module is used for performing semantic analysis on the second communication voice stored in the storage module to obtain an audio text of the second communication voice;
wherein the second call voice is: and a call audio after a target call time, wherein the target call time is a preset time before the first call voice.
7. The terminal device of claim 6, wherein the target speech emotion characteristics are used for characterizing a user emotion of a calling user who uttered the first calling speech.
8. The terminal device according to claim 6, wherein the audio text comprises event information of a target event;
the terminal device further includes:
the generating module is used for generating an event description text of the target event according to the event information;
and the display module is used for displaying the event description text of the target event on the first interface.
9. The terminal device of claim 8, wherein the generating module is specifically configured to:
acquiring identity information of an opposite-end call user who communicates with a terminal equipment user in the second call voice;
and generating an event description text of the target event according to the event information and the identity information of the opposite-end communication user.
10. The terminal device of claim 6, wherein the first speech voice comprises at least one first speech emotion feature, each of the at least one first speech emotion feature corresponds to a speech time, and the at least one first speech emotion feature comprises the target speech emotion feature;
and the target call time is preset time before the target call time corresponding to the target voice emotional characteristic.
11. A terminal device comprising a processor, a memory and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the audio conversion method according to any one of claims 1 to 5.
12. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the audio conversion method according to one of claims 1 to 5.
CN201911033600.7A 2019-10-28 2019-10-28 Audio conversion method and terminal equipment Pending CN110880330A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911033600.7A CN110880330A (en) 2019-10-28 2019-10-28 Audio conversion method and terminal equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911033600.7A CN110880330A (en) 2019-10-28 2019-10-28 Audio conversion method and terminal equipment

Publications (1)

Publication Number Publication Date
CN110880330A true CN110880330A (en) 2020-03-13

Family

ID=69728490

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911033600.7A Pending CN110880330A (en) 2019-10-28 2019-10-28 Audio conversion method and terminal equipment

Country Status (1)

Country Link
CN (1) CN110880330A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111800543A (en) * 2020-06-30 2020-10-20 深圳传音控股股份有限公司 Audio file processing method, terminal device and storage medium
CN111899719A (en) * 2020-07-30 2020-11-06 北京字节跳动网络技术有限公司 Method, apparatus, device and medium for generating audio

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070288447A1 (en) * 2003-12-09 2007-12-13 Swiss Reinsurance Comany System and Method for the Aggregation and Monitoring of Multimedia Data That are Stored in a Decentralized Manner
CN104463139A (en) * 2014-12-23 2015-03-25 福州大学 Sports video wonderful event detection method based on audio emotion driving
CN107293309A (en) * 2017-05-19 2017-10-24 四川新网银行股份有限公司 A kind of method that lifting public sentiment monitoring efficiency is analyzed based on customer anger
CN109842712A (en) * 2019-03-12 2019-06-04 贵州财富之舟科技有限公司 Method, apparatus, computer equipment and the storage medium that message registration generates
CN110335596A (en) * 2019-06-19 2019-10-15 深圳壹账通智能科技有限公司 Products Show method, apparatus, equipment and storage medium based on speech recognition

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070288447A1 (en) * 2003-12-09 2007-12-13 Swiss Reinsurance Comany System and Method for the Aggregation and Monitoring of Multimedia Data That are Stored in a Decentralized Manner
CN104463139A (en) * 2014-12-23 2015-03-25 福州大学 Sports video wonderful event detection method based on audio emotion driving
CN107293309A (en) * 2017-05-19 2017-10-24 四川新网银行股份有限公司 A kind of method that lifting public sentiment monitoring efficiency is analyzed based on customer anger
CN109842712A (en) * 2019-03-12 2019-06-04 贵州财富之舟科技有限公司 Method, apparatus, computer equipment and the storage medium that message registration generates
CN110335596A (en) * 2019-06-19 2019-10-15 深圳壹账通智能科技有限公司 Products Show method, apparatus, equipment and storage medium based on speech recognition

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111800543A (en) * 2020-06-30 2020-10-20 深圳传音控股股份有限公司 Audio file processing method, terminal device and storage medium
CN111899719A (en) * 2020-07-30 2020-11-06 北京字节跳动网络技术有限公司 Method, apparatus, device and medium for generating audio

Similar Documents

Publication Publication Date Title
CN110768805B (en) Group message display method and electronic equipment
CN108391008B (en) Message reminding method and mobile terminal
US20210352040A1 (en) Message sending method and terminal device
CN108616448B (en) Information sharing path recommendation method and mobile terminal
CN109412932B (en) Screen capturing method and terminal
CN108388403B (en) Method and terminal for processing message
CN110971510A (en) Message processing method and electronic equipment
CN109495638B (en) Information display method and terminal
CN111124345A (en) Audio source processing method and mobile terminal
CN108376096B (en) Message display method and mobile terminal
CN108600079B (en) Chat record display method and mobile terminal
CN109982273B (en) Information reply method and mobile terminal
CN110012151B (en) Information display method and terminal equipment
CN108520760B (en) Voice signal processing method and terminal
CN108270928B (en) Voice recognition method and mobile terminal
CN110989847A (en) Information recommendation method and device, terminal equipment and storage medium
CN110795188A (en) Message interaction method and electronic equipment
CN111061446A (en) Display method and electronic equipment
CN108307048B (en) Message output method and device and mobile terminal
CN110880330A (en) Audio conversion method and terminal equipment
CN108093119B (en) Strange incoming call number marking method and mobile terminal
CN112217713B (en) Method and device for displaying message
CN111369994B (en) Voice processing method and electronic equipment
CN110032320B (en) Page rolling control method and device and terminal
CN109286726B (en) Content display method and terminal equipment

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200313