CN111696537B - Voice processing method, device and medium - Google Patents

Voice processing method, device and medium Download PDF

Info

Publication number
CN111696537B
CN111696537B CN202010507539.1A CN202010507539A CN111696537B CN 111696537 B CN111696537 B CN 111696537B CN 202010507539 A CN202010507539 A CN 202010507539A CN 111696537 B CN111696537 B CN 111696537B
Authority
CN
China
Prior art keywords
information
user
emotion
voice data
prompt
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010507539.1A
Other languages
Chinese (zh)
Other versions
CN111696537A (en
Inventor
王颖
李健涛
张丹
刘宝
张硕
杨天府
梁宵
荣河江
李鹏翀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sogou Technology Development Co Ltd
Original Assignee
Beijing Sogou Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sogou Technology Development Co Ltd filed Critical Beijing Sogou Technology Development Co Ltd
Priority to CN202010507539.1A priority Critical patent/CN111696537B/en
Publication of CN111696537A publication Critical patent/CN111696537A/en
Application granted granted Critical
Publication of CN111696537B publication Critical patent/CN111696537B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G08SIGNALLING
    • G08BSIGNALLING OR CALLING SYSTEMS; ORDER TELEGRAPHS; ALARM SYSTEMS
    • G08B21/00Alarms responsive to a single specified undesired or abnormal condition and not otherwise provided for
    • G08B21/18Status alarms
    • G08B21/24Reminder alarms, e.g. anti-loss alarms
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Hospice & Palliative Care (AREA)
  • General Health & Medical Sciences (AREA)
  • Child & Adolescent Psychology (AREA)
  • Psychiatry (AREA)
  • Artificial Intelligence (AREA)
  • Business, Economics & Management (AREA)
  • Emergency Management (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a voice processing method and device and a device for voice processing, wherein the method is applied to an earphone storage device and specifically comprises the following steps: receiving voice data of a conversation from a headset device; the participants of the conversation include: at least two call users; determining prompt information corresponding to the voice data; the prompt information is obtained according to semantic information and/or emotion information corresponding to the voice data; and sending the prompt information to the earphone device during the conversation and/or after the conversation is finished so as to enable the earphone device to output the prompt information. The embodiment of the invention can improve the conversation quality of the current conversation or the subsequent conversation.

Description

Voice processing method, device and medium
Technical Field
The present invention relates to the field of speech processing technologies, and in particular, to a speech processing method and apparatus, and a machine readable medium.
Background
As one of the most natural communication modes, speech is widely used in speech processing scenes such as speech dialogue, speech social interaction, k-song (Karaok TV), live broadcasting, games, video recording, and the like.
Currently, the collected speech is typically used directly for speech processing scenarios. For example, sending the collected voice to the communication opposite terminal; as another example, the collected recordings are carried in video, etc.
In practical applications, there may be situations where the user is not satisfied with the collected speech, in which case the user will have a need to beautify the speech. For example, some users wish to achieve the goal of driving listeners and enhancing confidence by beautifying speech.
Disclosure of Invention
In view of the foregoing, embodiments of the present invention are provided to provide a speech processing method, a speech processing apparatus, and a device for speech processing that overcome or at least partially solve the foregoing problems, where the embodiments of the present invention can improve the session quality of a current session or a subsequent session.
In order to solve the above problems, the present invention discloses a voice processing method, comprising:
receiving voice data of a conversation from a headset device; the participants of the conversation include: at least two call users;
determining prompt information corresponding to the voice data; the prompt information is obtained according to semantic information and/or emotion information corresponding to the voice data;
and sending the prompt information to the earphone device during the conversation and/or after the conversation is finished so as to enable the earphone device to output the prompt information.
On the other hand, the embodiment of the invention discloses a voice processing device,
Be applied to earphone storage device, speech processing device includes:
a receiving module for receiving voice data of a conversation from the earphone device; the participants of the conversation include: at least two call users;
the determining module is used for determining prompt information corresponding to the voice data; the prompt information is obtained according to semantic information and/or emotion information corresponding to the voice data;
and the sending module is used for sending the prompt information to the earphone device in the conversation process and/or after the conversation is finished so as to enable the earphone device to output the prompt information.
In yet another aspect, an embodiment of the present invention discloses an apparatus for speech processing, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors, the one or more programs comprising instructions for:
receiving voice data of a conversation from a headset device; the participants of the conversation include: at least two call users;
determining prompt information corresponding to the voice data; the prompt information is obtained according to semantic information and/or emotion information corresponding to the voice data;
And sending the prompt information to the earphone device during the conversation and/or after the conversation is finished so as to enable the earphone device to output the prompt information.
One or more machine-readable media are also disclosed in embodiments of the invention, wherein the instructions, when executed by one or more processors, cause an apparatus to perform the aforementioned method.
The embodiment of the invention has the following advantages:
the earphone containing device provided by the embodiment of the invention can output the prompt information in the conversation process and/or after the conversation is finished. The prompt information can be obtained according to semantic information and/or emotion information corresponding to the voice data, so that the prompt information can prompt a user of problems in the conversation process, and the user can timely improve the problems in the conversation process, so that the conversation quality of the conversation can be improved; alternatively, the user may be enabled to improve the questions after the session is completed to enhance the session quality of subsequent sessions.
In addition, the prompt information of the embodiment of the invention can prompt the relevant information of the dialogue, such as the evaluation information of the second user for the first user, the trust information of the second user, the dialogue quality information and the like, so that the user can know the dialogue condition to help the user to decide the dialogue related transaction.
For example, in the interview scene, assuming that the first user is a job seeker and the second user is an interviewer, the interviewer evaluates information of the job seeker, so that the job seeker can learn more interview information, and further the job seeker can be helped to judge the success probability of interviewing. If the first user is an interview officer and the second user is a job seeker, the information of the job seeker can enable the interview officer to know the credibility of the job seeker, and further the interview officer can be helped to evaluate the job seeker better.
As another example, in an interview scenario, the dialogue quality information may help interviewees to learn about interview quality and accumulate interview experience to promote interview quality for subsequent interviews.
For another example, in a speech practice scene, the related information of the dialogue can help the user to know the defects of the user in the speech process, such as too fast speech in the period 1, and jamming in the period 2, so that the quality of the subsequent speech practice is improved.
Drawings
FIG. 1 is a schematic illustration of a flow of a speech processing method according to an embodiment of the present invention;
FIG. 2 is a flow chart of steps of an embodiment of a speech processing method of the present invention;
FIG. 3 is a block diagram of a speech processing apparatus of the present invention;
fig. 4 is a block diagram of an apparatus 1300 for speech processing according to the present invention; a kind of electronic device with high-pressure air-conditioning system
Fig. 5 is a schematic structural diagram of a server according to the present invention.
Detailed Description
In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.
The embodiment of the invention can be applied to dialogue scenes. The dialog scenario may include: a communication-based dialog scenario, such as an operator-based dialog scenario, or a network-based dialog scenario. Alternatively, the dialog scene may include: scene dialogue scenes such as face-to-face interview scenes, etc.
Depending on the domain to which the dialog relates, the dialog scenario may include: interview scenes, business communication scenes, lecture exercise scenes, and the like.
Depending on the type of dialog, the dialog scenario may include: a voice conversation scenario, or a video conversation scenario, etc., it will be appreciated that embodiments of the present invention are not limited to a particular conversation scenario.
The embodiment of the invention provides a voice processing scheme which can be executed by an earphone containing device, and specifically comprises the following steps: receiving voice data of a conversation from a headset device; the participants of the conversation specifically include: at least two call users; determining prompt information corresponding to the voice data; the prompt information can be obtained according to semantic information and/or emotion information corresponding to the voice data; and outputting the prompt information in the conversation process and/or after the conversation is finished.
The earphone storage device provided by the embodiment of the invention can receive the voice data of the conversation from the earphone device. The voice data may include: voice data of at least one participant.
In one embodiment of the present invention, the participants may include: a first user and a second user. The first user may refer to a local end user wearing the headset device. The second user may refer to a counterpart user, and the second user may or may not wear the headset device.
Of course, in addition to the first user and the second user, the participants may include: third and fourth users, etc.
The earphone containing device provided by the embodiment of the invention can output the prompt information in the conversation process and/or after the conversation is finished. The prompt information is obtained according to semantic information and/or emotion information corresponding to the voice data, and can prompt a user of problems in the conversation process, so that the user can timely improve the problems in the conversation process, and the conversation quality of the conversation can be improved; alternatively, the user may be enabled to improve the questions after the session is completed to enhance the session quality of subsequent sessions.
The prompt information of the embodiment of the invention can prompt the relevant information of the dialogue, such as the evaluation information of the second user for the first user, the trust information of the second user, the dialogue quality information and the like, so that the user can know the dialogue condition to help the user to decide the dialogue related transaction.
For example, in the interview scene, assuming that the first user is a job seeker and the second user is an interviewer, the interviewer evaluates information of the job seeker, so that the job seeker can learn more interview information, and further the job seeker can be helped to judge the success probability of interviewing. If the first user is an interview officer and the second user is a job seeker, the information of the job seeker can enable the interview officer to know the credibility of the job seeker, and further the interview officer can be helped to evaluate the job seeker better.
As another example, in an interview scenario, the dialogue quality information may help interviewees to learn about interview quality and accumulate interview experience to promote interview quality for subsequent interviews.
For another example, in a speech practice scene, the related information of the dialogue can help the user to know the defects of the user in the speech process, such as too fast speech in the period 1, and jamming in the period 2, so that the quality of the subsequent speech practice is improved.
The prompt message of the embodiment of the application can aim at any party in the call. For example, the headset device of the first user may obtain the prompt information for the first user and provide the prompt information to the first user. Or, the earphone device of the first user may obtain the prompt information for the second user and prompt the second user.
The earphone device of the embodiment of the invention can be a headset, such as a Bluetooth earphone, a sports earphone, a real wireless stereo (TWS, true Wireless Stereo) earphone and the like, and can also be called an artificial intelligence (AI, artificial Intelligence) earphone.
Alternatively, the earphone arrangement may comprise a plurality of microphone array elements, a processor and a speaker.
The plurality of microphone array elements can pick up voice data within a preset angle range. The processor is used for determining prompt information corresponding to the voice data.
According to one embodiment, the processor of the headset device may process the voice data to obtain the alert information.
According to another embodiment, the task of processing the voice data may be handed over to an external device to reduce the volume of the earphone device, subject to the volume limitation of the earphone device. Correspondingly, the processor of the earphone device can perform data interaction with external equipment to obtain prompt information obtained by processing of the external equipment. The speaker is used for playing sound, such as playing prompt information.
The external device may include: a terminal, and/or a headset receiving device. Of course, the external devices may include: and the server side.
Optionally, the terminal may include: smart phones, tablet computers, e-book readers, MP3 (dynamic video expert compression standard audio plane 3,Moving Picture Experts Group Audio Layer III) players, MP4 (dynamic video expert compression standard audio plane 4,Moving Picture Experts Group Audio Layer IV) players, laptop portable computers, car computers, desktop computers, set-top boxes, smart televisions, wearable devices, smart speakers, and the like. It will be appreciated that embodiments of the present invention are not limited to a particular terminal.
The earphone storage device can be used for storing the earphone device. Optionally, the earphone storing device is further used for providing power to the earphone device. The earphone storage device provided by the embodiment of the invention can also be used for receiving voice data from the earphone device and processing the voice data to obtain prompt information.
In practical application, the earphone storage device may be an earphone storage box. The earphone device and the earphone receiving device may be sold separately or in sets.
In the embodiment of the invention, the connection mode between the earphone device and the external device can be a wired connection mode or a wireless connection mode. The connection mode may include: physical connection, bluetooth connection, infrared connection, or WIFI (wireless fidelity ) connection, etc. It will be appreciated that the embodiments of the present invention are not limited to a specific connection between the earphone device and the external device.
In an alternative embodiment of the present invention, as an external device of the earphone device, the earphone receiving device may be provided with a processing chip, and the processing chip may process the voice data by using the voice processing method of the embodiment of the present invention.
In another alternative embodiment of the present invention, as an external device of the earphone device, the earphone receiving device may perform data interaction with the server, and may transmit a task of processing the voice data to the server. For example, the earphone storage device may send the voice data collected by the earphone device to the server, so that the server processes the voice data. The earphone storage device may also transmit the processed prompt information to the earphone device or the terminal.
Referring to fig. 1, a schematic structural diagram of a speech processing system according to an embodiment of the present invention is shown, which specifically includes: an earphone device 101, an earphone housing device 102, a service terminal 103, and a mobile terminal 104.
The earphone device 101 is connected to the earphone housing device 102 via bluetooth, and the earphone device 101 is connected to the mobile terminal 104 via bluetooth.
During the use of the mobile terminal 104 by the first user, the first user wears the earphone device 101, and can receive and sound through the earphone device 101.
The earphone receiving device 102 has mobile networking or wireless networking capability, and can perform data interaction with the server 103. For example, the earphone housing device 102 may receive voice data collected by the earphone device and send the voice data to the server 103; and the earphone storage device 102 may send the prompt information processed by the server 103 to the earphone device.
In this embodiment of the present invention, optionally, a first processor and a second processor are respectively disposed on two sides of the earphone device 102, where the first processor is used for performing data interaction with the earphone storage device 102, and the second processor is used for performing data interaction with the mobile terminal 104.
For example, during a conversation using the mobile terminal 104, the earphone device 101 may collect voice data of a participant, and the earphone device 101 may determine prompt information corresponding to the voice data in real time and output the prompt information to the first user.
In the embodiment of the present invention, optionally, the earphone device 101 may play the prompt information, so that the user can improve the problem of the user in the dialogue according to the prompt information.
In this embodiment of the present invention, optionally, the earphone device 101 may include a first side and a second side, where the first side is used for playing voice data, and the second side is used for playing prompt information. Of course, the embodiment of the invention does not limit the specific playing side of the voice data and the prompt information.
Method embodiment one
Referring to fig. 2, a flowchart illustrating steps of a first embodiment of a voice processing method of the present invention is applied to an earphone storage device, and may specifically include the following steps:
step 201, receiving voice data of a conversation from a headphone device; the participants of the conversation specifically include: at least two call users;
step 202, determining prompt information corresponding to the voice data; the prompt information can be obtained according to semantic information and/or emotion information corresponding to the voice data;
step 203, the prompt message is sent to the earphone device during the session and/or after the session is ended, so that the earphone device outputs the prompt message.
In step 201, the earphone device may collect voice data generated by the participant by using the microphone array element, and send the voice data to the earphone storage device. The connection mode between the earphone device and the earphone storage device can be a wireless connection mode, such as a Bluetooth connection mode.
In step 202, the earphone storage device may process the voice data to obtain a prompt message; or the earphone containing device can send the voice data to the server so that the server processes the voice data to obtain the prompt information.
For example, in an optional embodiment of the present invention, the determining the prompt information corresponding to the voice data specifically includes: sending the voice data to a server; and receiving the returned prompt information from the server.
The prompt information can be obtained according to semantic information and/or emotion information corresponding to the voice data, so that the prompt information can prompt a user of a problem in the conversation process.
Semantic information is one of the expression forms of information, and refers to information with a certain meaning capable of eliminating uncertainty of things. In the embodiment of the invention, the semantic analysis can be performed on the voice data to obtain the corresponding semantic information. Available semantic analysis methods may include: the keyword extraction method, the sentence component analysis method, the machine learning method, or the like, it will be appreciated that the embodiment of the present invention is not limited to a specific semantic analysis method.
In the embodiment of the invention, optionally, a voice recognition method may be used to convert voice data into dialogue text, and perform semantic analysis on the dialogue text to obtain corresponding semantic information.
In the embodiment of the invention, optionally, a first dialogue text and a second dialogue text in the dialogue text can be identified according to the dialogue identity information, and the first semantic information and the second semantic information corresponding to the first dialogue text and the second dialogue text respectively are determined. The first dialog text and the second dialog text may correspond to different dialog identity information, for example, the first dialog text corresponds to a first user, the second dialog text corresponds to a second user, etc.
In the embodiment of the invention, optionally, a voiceprint recognition method can be utilized to determine the dialogue identity information. The voiceprint recognition method is a method for recognizing the identity of a voice speaker to be detected according to voice parameters reflecting physiological and behavioral characteristics of a sounding user in a voice waveform. Since different users are for different voiceprints, different dialogue identity information can be determined by using a voiceprint recognition method.
In an optional embodiment of the present invention, the semantic information indicates that the voice data of any participant does not match with the theme corresponding to the voice data, and then the corresponding prompt information may be output to any participant.
The subject refers to the central idea to be represented by voice data, and generally refers to the main content. A topic analysis method can be adopted to determine topics corresponding to the topic data. It is understood that the voice data may include at least one theme.
For example, speech data is analyzed by time, and different times may correspond to different topics. As another example, the voice data is analyzed according to dialogue identity information, and different dialogue identity information may correspond to different topics.
For example, in an interview scenario, an interviewer sets several topics and plans to guide a dialog according to the set topics; however, in the actual dialogue process, the voice data of the interviewee is not matched with the theme set by the interviewee, and then corresponding prompt information can be output to the interviewee, so that the interviewee can switch the theme according to the requirement.
For another example, in an interview scenario, an interviewer sets several topics and plans to guide a conversation according to the set topics; however, in the actual dialogue process, the voice data of the job seeker is not matched with the theme set by the interviewer, so that corresponding prompt information can be output to the job seeker, and the job seeker can adjust the voice content of the job seeker according to the actual situation. Or, the corresponding prompt information can be output to the interviewer so that the interviewer can switch the theme according to the requirement.
In the embodiment of the invention, emotion refers to psychological experiences of people such as happiness, anger, fun, happiness, fear and the like, and the experience is a reflection of the attitudes of people on objective things. Emotions have positive and negative properties. Things that can meet the needs of a person can lead to a positive experience of the person, such as happiness, satisfaction, etc.; things that do not meet the needs of a person can cause negative experiences of the person, such as anger, resence, complaints, etc.
In an alternative embodiment of the present invention, the mood information may include: positive emotion, or negative emotion, wherein positive emotion is constructive and aggressive and negative emotion is destructive and depolarizing. Negative emotions may include, but are not limited to: anxiety, tension, anger, depression, sadness, pain, boring, etc. Positive emotions may include, but are not limited to: happy, optimistic, confident, enjoyable, relaxed, etc. Optionally, the emotion information may further include: neutral emotions, which may include, but are not limited to: calm, etc.
In the embodiment of the present invention, optionally, the emotion information may be obtained according to a voice feature corresponding to the voice data; and/or
The emotion information is obtained by analyzing somatosensory data of the user.
The speech features may characterize aspects of speech. The speech features include at least one of the following features: a mood characteristic, a cadence characteristic, and an intensity characteristic.
For example, in a state of tension, normal tremor of the speech organ is suppressed, and at this time, tremor of the speech is not artificially controlled at will, so that emotion information of the user can be obtained by monitoring the speech characteristics of the user.
The embodiment of the application can determine the emotion information of the user by utilizing the mapping relation between the voice characteristics and the emotion information.
It should be noted that, in the embodiment of the present application, the mapping relationship may be represented by a data table, that is, the field corresponding to the mapping relationship may be saved by the data table. Alternatively, the mapping relationship between the input data and the output data may be characterized by a data analyzer. Correspondingly, the method can further comprise the steps of: training the training data to obtain a data analyzer; the data analyzer may be used to characterize a mapping relationship between input data and output data.
In an alternative embodiment of the application, the data model may be trained based on training data to obtain the data analyzer.
The training data in the embodiment of the invention can comprise: the dialogue data can be dialogue data obtained under a voice dialogue scene or a video dialogue scene so as to improve the matching degree between the training data and the voice dialogue scene or the video dialogue scene.
In the embodiment of the invention, optionally, the dialogue data can be distinguished according to the field. The field refers specifically to a specific range. The embodiment of the invention can obtain different fields according to different application scenes of the dialogue. For example, the fields may include: interview fields, business communication fields, social fields, lecture exercise fields, and the like.
In the embodiment of the present invention, optionally, the dialogue data may include: the dialogue data corresponding to the first user may, of course, include: user-corresponding dialog data other than the first user.
The mathematical model is a scientific or engineering model constructed by using a mathematical logic method and a mathematical language, and is a mathematical structure which is expressed in a generalized or approximate way by adopting the mathematical language aiming at referring to the characteristic or the quantity dependency relationship of a certain object system, and the mathematical structure is a relationship structure which is expressed by means of mathematical symbols. The mathematical model may be one or a set of algebraic, differential, integral or statistical equations and combinations thereof by which the interrelationship or causal relationship between the variables of the system is described quantitatively or qualitatively. In addition to mathematical models described by equations, there are models described by other mathematical tools, such as algebra, geometry, topology, mathematical logic, etc. Wherein the mathematical model describes the behavior and characteristics of the system rather than the actual structure of the system. The training of the mathematical model may be performed by a machine learning method, a deep learning method, and the like, and the machine learning method may include: linear regression, decision trees, random forests, etc., the deep learning method may include: convolutional neural network (CNN, convolutional Neural Networks), long short-Term Memory network (LSTM), gated loop unit (GRU, gated Recurrent Unit), and the like.
Assuming that the unit of the intensity feature is decibel, a preset threshold range corresponding to the emotion information can be determined for the decibel. For example, the sound decibel is about 50 db when a person speaks, so the preset threshold range may be set to include 0-40 db, 41-60 db and 61-80 db, wherein the emotion information corresponding to the preset threshold range of 0-40 db is "frustrated", "tensed" or "depressed", the emotion information corresponding to the preset threshold range of 41-60 db is "neutral", and the emotion information corresponding to the preset threshold range of 61-80 db is "excited", "anger" or "restlessness", etc. _cell
Somatosensory, or somatosensory, is a collective term for haptic, pressure, temperature, pain and proprioception (sensations related to muscle and joint position and movement, body posture and movement and facial expression). The somatosensory data of the user may include: at least one of body temperature data, pulse data, image data, and limb data.
In the embodiment of the present invention, the somatosensory data may be acquired according to an earphone device. Alternatively, the earphone device may have a sensor externally or internally installed to collect somatosensory data of the user through the sensor.
For example, a motion sensor is provided inside the earphone device to collect motion information of the user's head. Examples of the action information may include: "waving head", "nodding" and the like.
As another example, an image sensor is provided inside the earphone device to obtain expression information of the user from the image. Examples of expression information may include: smiling, frowning, skimming, etc.
The embodiment of the invention can determine the emotion information of the user by using the mapping relation between the somatosensory data and the emotion information.
Taking somatosensory data as body temperature data as an example, emotion information of a user can be obtained by monitoring the body temperature data of the user because emotion change drives the body temperature change to be a body-regulated reflection. For example, a person's normal body temperature ranges from 36 ℃ to 37 ℃, and when the user is excited or angry, the body temperature rises accordingly, and when the user is depressed or the mood falls, the body temperature falls accordingly. Therefore, the preset threshold range can be set to include 35.5 ℃ to 35.9 ℃, 36 ℃ to 37 ℃ and 37.1 ℃ to 37.5 ℃, wherein the emotion information corresponding to the preset threshold range of 35.5 ℃ to 35.9 ℃ is 'depression', 'tension' or 'depression', the emotion information corresponding to the preset threshold range of 36 ℃ to 37 ℃ is 'neutral', and the emotion information corresponding to the preset threshold range of 37.1 ℃ to 37.5 ℃ is 'agitation', 'anger' or 'dysphoria'.
Taking the somatosensory data as the pulse data as an example, since pulse beating has close relation with emotion of a person, when the person is excited or angry, the heart change can cause the frequency of the pulse to be accelerated, when the person is in a sleep state or steady emotion, the pulse is basically in a slowly rhythmic beating state, and the like, so that the emotion information of the user can be obtained by monitoring the pulse data of the user. For example, a preset threshold range may be set including 50-60 times/min, 61-100 times/min, and 101-130 times/min, wherein the mood information corresponding to the preset threshold range 50-60 times/min is "mood down" or "depression"; presetting emotion information corresponding to a threshold range of 61-100 times/min to be neutral; the emotional state information corresponding to the preset threshold range of 101-130 times/min is 'excited' or 'tension'.
According to the embodiment of the invention, under the condition that the emotion information of the first user is the preset emotion information, the corresponding emotion prompt information can be obtained so as to prompt the user to adjust the emotion to obtain a better emotion state.
The preset emotion information may be obtained according to historical voice data of the first user, where the historical voice data may be voice data before the current call.
In an alternative embodiment of the present invention, the user may specify a type of historical voice data, which may include: if the user is dissatisfied with the emotion information, the embodiment of the invention can determine the dissatisfied emotion information of the user from the historical voice data appointed by the user as the preset emotion information.
For example, historical voice data corresponding to emotion information that is not satisfied by the user may be received, and the historical voice data may be analyzed to obtain corresponding preset emotion information.
In another optional embodiment of the present invention, after a session is ended, evaluation information of a user for the session may be received, and target historical speech data may be obtained from historical speech data corresponding to the session according to the evaluation information, so as to obtain preset emotion information according to emotion information corresponding to the target historical speech data. For example, the evaluation information may include: "satisfactory" or "unsatisfactory", or the like, the history voice data corresponding to the evaluation information of "unsatisfactory" may be regarded as the target history voice data. It can be appreciated that the embodiment of the present invention is not limited to specific preset emotion information and corresponding determination manners thereof.
The prompt information of the embodiment of the invention can comprise at least one of the following information:
prompt information 1, evaluation information of a second user aiming at a first user;
prompt information 2, trust information of the second user;
prompt 3, emotion prompt;
prompt 4, rhythm prompt;
prompt information 5, dialogue quality information;
prompt 6, dialogue atmosphere information.
For prompt 1, the second user's rating information for the first user may characterize the second user's satisfaction with the first user. The evaluation information can meet the information requirements of users in dialogue scenes such as interview scenes, interview scenes and business communication scenes, and the user can conveniently plan related transactions of the dialogue scenes.
In an alternative embodiment of the present invention, the method may further include: determining evaluation information of a second user for the first user according to semantic information corresponding to the second voice data in the voice data and/or emotion information of the second user; the second voice data corresponds to the second user.
Optionally, in the embodiment of the present invention, the evaluation information may include: neutral evaluation information, positive evaluation information, negative evaluation information, and the like.
According to one embodiment, the mapping relation between the semantic information and the evaluation information may be used to determine the evaluation information corresponding to the semantic information.
Alternatively, the semantic information may include: keywords. For example, if the semantic information includes: forward keywords such as "good", "satisfactory" and the like can be obtained. As another example, if the semantic information includes: negative keywords such as 'sorry', 'regret' and the like can obtain negative evaluation information.
Alternatively, the evaluation information may be determined by matching degree information between the second semantic information corresponding to the second voice data and the first semantic information corresponding to the first voice data. For example, if the matching degree information is smaller than the matching degree threshold value, the evaluation information may be negative evaluation information; in another example, if the matching degree information is greater than the matching degree threshold, the evaluation information may be forward evaluation information or the like.
According to another embodiment, the evaluation information corresponding to the emotion information may be determined using a mapping relationship between the emotion information and the evaluation information. Negative emotion information generally corresponds to negative evaluation information, positive emotion information corresponds to positive evaluation information, and neutral emotion information corresponds to neutral evaluation information.
It should be noted that, the semantic information corresponding to the second user may be at least one type, and the emotion information of the second user may be at least one type. Or, the embodiment of the invention can comprehensively utilize various semantic information or various emotion information to obtain corresponding evaluation information; for example, a plurality of semantic information or a plurality of emotion information may be fused, and corresponding evaluation information may be obtained according to the obtained fusion result.
The embodiment of the invention can comprehensively utilize the second semantic information and the emotion information of the second user to determine the evaluation information of the second user for the first user. Specifically, the semantic evaluation information may be obtained by using the second semantic information, and the emotion evaluation information may be obtained by using the emotion information of the second user, and the evaluation information may be obtained according to the semantic evaluation information and the emotion evaluation information.
According to one embodiment, if the semantic rating information and the emotion rating information are matched, the semantic rating information and the emotion rating information are fused to obtain rating information.
According to another embodiment, if the semantic rating information and the emotion rating information do not match, the rating information is obtained from the emotion rating information. Because the authenticity of emotion is generally higher than that of language, the embodiment of the invention can consider that the priority of emotion evaluation information is higher than that of semantic evaluation information, and discard the semantic evaluation information and reserve the emotion evaluation information under the condition that the emotion evaluation information and the semantic evaluation information are not matched.
For example, in the interview scenario, the interviewer asks the job seeker for a question, and the job seeker answers, and if the interviewer hears the answer of the job seeker to a question, the speech transmitted includes "good", but the corresponding head movement is "shaking", the interviewer's evaluation information can be regarded as negative evaluation information in this case.
In another example, in a business communication scenario or a social scenario, the voice sent by the second user includes "frequent connection" or "call you after a few days", but the corresponding head motion is "shaking head", the facial expression is "frowning" or "skimming", so that the evaluation information of the second user can be considered as negative evaluation information in this case.
Note that, in addition to the evaluation information of the second user for the first user, the prompt information may further include: the reason information corresponding to the evaluation information may include: the second user's voice information and its corresponding emotion information. For example, in the interview scenario, the evaluation information is negative evaluation information, and the reason information includes: the interviewer produces a "shaking head" action or the like during the "good" procedure. In another example, in the business communication scenario, the evaluation information is forward evaluation information, and the reason information includes: the counterpart is smiled in the whole course of the conversation process, etc.
For hint information 2, the trust information of the second user may characterize the trustworthiness of the second user. The trust information may include: trusted, untrusted, neutral, etc.
In the embodiment of the present invention, optionally, the trust information of the second user may be determined according to the emotion information of the second user and the mapping relationship between the emotion information and the trust information.
The embodiment of the invention can pre-establish the mapping relation between the emotion information and the trust information. For example, positive emotions correspond to trusted, negative emotions correspond to untrusted, neutral emotions correspond to neutral, etc. The mapping relationship between the emotion information and the trust information may be set by the user, or the dialogue corpus may be analyzed to obtain the mapping relationship between the emotion information and the trust information.
In an application example of the present invention, in an interview scene, a job seeker changes emotion in the process of answering a question posed by an interview officer, for example, emotion information changes from first emotion information to second emotion information, and the second emotion information is tension or fiduciary, so that the probability of lying of the job seeker is higher, and thus the trusted information of the job seeker can be considered as unreliable. The tense-corresponding somatosensory data may include: expressions are smiling, holding with both hands, touching body parts, etc.
Note that, in addition to the trust information, the hint information may include: the reason information corresponding to the trust information may include: the second user's voice information and its corresponding mood information, for example, suddenly becomes stressed during the question answering process "xxx".
For the prompt information 3, corresponding emotion prompt information can be obtained under the condition that the emotion information of the first user is preset emotion information so as to prompt the first user to adjust emotion.
For cue information 4, the tempo cue information may include: completion information of the topic in the conversation process.
Optionally, if the voice data of any participant is not matched with the theme corresponding to the voice data, and/or the duration of the theme corresponding to the voice data exceeds the first preset duration, obtaining corresponding rhythm prompting information.
The voice data of any participant is not matched with the theme corresponding to the voice data, so that the completion progress of the theme is affected, and corresponding rhythm prompt information can be output. For example, in an interview scenario, if the interviewee's voice data for a topic does not match the topic, the interviewee may be prompted to change the topic. For another example, in the interview scenario, the voice data of the questioner for a certain topic does not match the topic, and the questioner may be prompted to answer the question at a change angle.
The duration of the theme corresponding to the voice data exceeds the first preset duration, so that the user can be prompted to accelerate the conversation speed. For example, in an interview scenario, when the interviewer sets a first preset duration for a topic, the interviewer may be prompted to speed up if the duration of the topic exceeds the first preset duration.
Optionally, the tempo prompt information may include: speed of speech information of the first user, etc. For example, the first user may set a speech rate threshold, and if the speech rate information of the first user does not match the speech rate threshold, corresponding tempo prompt information may be provided.
Optionally, the tempo prompt information may include: language continuity information of the first user, and the like. For example, the first user may have a problem of word interruption during a sentence, or a problem of longer interruption between a sentence of the first user and a next sentence, etc., and may prompt a corresponding problem and give corresponding encouraging information, such as "fueling" etc.
In the embodiment of the present invention, optionally, language continuity information of the first user may be determined by using a data analyzer, and input data of the data analyzer may be: the voice data of the first user, and the output data of the data analyzer may be language continuity information of the first user.
For prompt 5, dialog quality information may be used to characterize the quality of dialog. The quality of the conversation can help the user accumulate conversation experience and overcome conversation problems, thereby improving the conversation quality of subsequent conversations.
Optionally, the session quality information may include at least one of the following information:
the voice data comprises the completion proportion information of the theme; for example, the ratio between the number of completed topics and the number of all topics;
the voice data comprises the completion time information of the theme; for example, completion time information for one or more topics;
voice quality information; and
logic information of the voice data.
The voice quality information may include: speech speed information, or speech continuity information, etc. For example, the voice quality information may include: the shell is stuck when xx is divided into xx seconds, and the language at other times is very smooth; or, you are too fast in period 1, too slow in period 3, the speech rate criteria of other periods, etc.
The logical information of the voice data may characterize the relevance or compactness of the voice data during the conversation. The logical information may include: the relevance between sentences, the relevance between the first speech data and the second speech data, the relevance between topics, etc. For example, an interviewer may consider less logical if he/she enters the second topic directly without any transition after completing the first topic. For another example, the job seeker may consider that the logic is poor if there is no relation between the words before and after the words in the process of answering the question, and so on.
In practical applications, the logical information may be determined using a data analyzer. It will be appreciated that the embodiment of the present invention is not limited to a specific determination manner corresponding to the logic information.
For the prompt information 6, optionally, the dialogue atmosphere information may be determined according to emotion information corresponding to the first user and the second user, respectively. For example, if the first user and the second user each correspond to the forward emotion information, the dialogue atmosphere information may be regarded as "happy" or "engaged" or the like. For another example, if any party corresponds to negative emotion information, the dialogue atmosphere information may be regarded as "not in charge of" or the like.
In step 203, the outputting the prompt information specifically includes:
during the conversation and/or after the conversation is finished, the prompt information is sent to the earphone device so that the earphone device plays the prompt information; and/or
And sending the prompt information to the terminal corresponding to the earphone device in the conversation process and/or after the conversation is finished.
The prompt information is sent to the terminal, so that the terminal can play or display the prompt information. It should be noted that, the prompt message in the embodiment of the present invention may be specific to the first user, and may not send the prompt message to the second user; of course, the prompt information for the second user may also be obtained, and the prompt information may be sent to the second user.
In an alternative embodiment of the present invention, event keywords may be extracted from the voice data, and reminder event content may be created according to the event keywords, creating corresponding memo information. For example, the event may include: "parking position", "invoice head up", etc.
The embodiment of the invention can support the search for the memo information. For example, "what is my car parked? "for searching for" a parking location "; "what is the invoice new line? "used for searching for" invoice head up ", etc.
In an alternative embodiment of the present invention, event keywords may be extracted from voice data, and trigger condition keywords associated with the event keywords; and establishing a triggering condition of the reminder according to the triggering condition keywords, and establishing event content of the reminder according to the event keywords, so as to establish a corresponding reminder, thereby improving the accuracy and the intelligence of the reminder.
For example, for voice data, include: "I call you back later", the event content of the reminder obtained may be "call back", and the trigger condition may be "later".
In an alternative embodiment of the present invention, voice data may be searched for in accordance with a search request. The dimensions of the voice search may include, but are not limited to, at least one of the following dimensions: session identity, session time, session place, session keywords, etc. For example, the search request discusses the recording of xx events for "last week and Wang Zong," which may include: session time "last week", session identity "Wang Zong", session keyword "xx event", etc. The voice data can be subjected to semantic analysis to obtain dialogue keywords, and it can be understood that the embodiment of the invention does not limit the specific determination manner of the dialogue keywords.
In summary, according to the voice processing method of the embodiment of the invention, the earphone device can output the prompt information in the process of the conversation and/or after the conversation is finished. The prompt information can be obtained according to semantic information and/or emotion information corresponding to the voice data, so that the prompt information can prompt a user of problems in the conversation process, and the user can timely improve the problems in the conversation process, so that the conversation quality of the conversation can be improved; alternatively, the user may be enabled to improve the questions after the session is completed to enhance the session quality of subsequent sessions.
In addition, the prompt information of the embodiment of the invention can prompt the relevant information of the dialogue, such as the evaluation information of the second user for the first user, the trust information of the second user, the dialogue quality information and the like, so that the user can know the dialogue condition to help the user to decide the dialogue related transaction.
For example, in the interview scene, assuming that the first user is a job seeker and the second user is an interviewer, the interviewer evaluates information of the job seeker, so that the job seeker can learn more interview information, and further the job seeker can be helped to judge the success probability of interviewing. If the first user is an interview officer and the second user is a job seeker, the information of the job seeker can enable the interview officer to know the credibility of the job seeker, and further the interview officer can be helped to evaluate the job seeker better.
As another example, in an interview scenario, the dialogue quality information may help interviewees to learn about interview quality and accumulate interview experience to promote interview quality for subsequent interviews.
For another example, in a speech practice scene, the related information of the dialogue can help the user to know the defects of the user in the speech process, such as too fast speech in the period 1, and jamming in the period 2, so that the quality of the subsequent speech practice is improved.
It should be noted that, for simplicity of description, the method embodiments are described as a series of combinations of motion actions, but those skilled in the art should appreciate that the embodiments of the present invention are not limited by the order of motion actions described, as some steps may be performed in other order or simultaneously in accordance with the embodiments of the present invention. Further, it should be understood by those skilled in the art that the embodiments described in the specification are all preferred embodiments and that the movement involved is not necessarily required by the embodiments of the present invention.
Device embodiment
Referring to fig. 3, a block diagram illustrating a voice processing apparatus according to an embodiment of the present invention is applied to an earphone receiving apparatus, and the voice processing apparatus may include:
A receiving module 301 for receiving voice data of a conversation from the earphone device; the participants of the conversation may include: at least two call users;
a determining module 302, configured to determine a prompt message corresponding to the voice data; the prompt information is obtained according to semantic information and/or emotion information corresponding to the voice data;
and the sending module 303 is configured to send the prompt information to the earphone device during the session and/or after the session is ended, so that the earphone device outputs the prompt information.
Optionally, the prompt information may include at least one of the following information:
the second user aims at the evaluation information of the first user;
trust information of the second user;
emotion prompt information;
rhythm prompting information;
dialogue quality information;
dialog atmosphere information
Optionally, the above voice processing apparatus may further include:
the evaluation information determining module is used for determining the evaluation information of the second user for the first user according to semantic information corresponding to the second voice data in the voice data and/or emotion information of the second user; the second voice data corresponds to the second user.
Optionally, the above voice processing apparatus may further include:
The trust information determining module is used for determining the trust information of the second user according to the emotion information of the second user and the mapping relation between the emotion information and the trust information.
Optionally, the above voice processing apparatus may further include:
the emotion prompt information determination module is used for obtaining corresponding emotion prompt information under the condition that the emotion information of the first user is preset emotion information.
Optionally, the above voice processing apparatus may further include:
and the rhythm prompting information determining module is used for obtaining corresponding rhythm prompting information if the voice data of any participant is not matched with the theme corresponding to the voice data and/or the duration of the theme corresponding to the voice data exceeds a first preset duration.
Optionally, the session quality information may include at least one of the following information:
the voice data comprises the completion proportion information of the theme;
the voice data comprises the completion time information of the theme;
voice quality information; and
logic information of the voice data.
Optionally, the above voice processing apparatus may further include:
and the dialogue atmosphere information determining module is used for determining dialogue atmosphere information according to emotion information corresponding to the first user and the second user respectively.
Optionally, the emotion information is obtained according to a voice feature corresponding to the voice data; and/or
The emotion information is obtained by analyzing somatosensory data of a user, and the somatosensory data is obtained by collecting according to the earphone device.
Optionally, the above voice features may include at least one of the following features: a mood characteristic, a cadence characteristic, and an intensity characteristic.
For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.
In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.
The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.
Fig. 4 is a block diagram illustrating an apparatus 1300 for speech processing according to an example embodiment. For example, apparatus 1300 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, or the like.
Referring to fig. 4, apparatus 1300 may include one or more of the following components: a processing component 1302, a memory 1304, a power component 1306, a multimedia component 1308, an audio component 1310, an input/output (I/O) interface 1312, a sensor component 1314, and a communication component 1316.
The processing component 1302 generally controls overall operation of the apparatus 1300, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing element 1302 may include one or more processors 1320 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 1302 can include one or more modules that facilitate interactions between the processing component 1302 and other components. For example, the processing component 1302 may include a multimedia module to facilitate interaction between the multimedia component 1308 and the processing component 1302.
The memory 1304 is configured to store various types of data to support operations at the device 1300. Examples of such data include instructions for any application or method operating on the apparatus 1300, contact data, phonebook data, messages, pictures, videos, and the like. The memory 1304 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.
The power supply assembly 1306 provides power to the various components of the device 1300. The power supply components 1306 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device 1300.
The multimedia component 1308 includes a screen between the device 1300 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 1308 includes a front-facing camera and/or a rear-facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 1300 is in an operational mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.
The audio component 1310 is configured to output and/or input audio signals. For example, the audio component 1310 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 1300 is in an operational mode, such as a call mode, a recording mode, and a voice data processing mode. The received audio signals may be further stored in the memory 1304 or transmitted via the communication component 1316. In some embodiments, the audio component 1310 also includes a speaker for outputting audio signals.
The I/O interface 1312 provides an interface between the processing component 1302 and peripheral interface modules, which may be a keyboard, click wheel, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.
The sensor assembly 1314 includes one or more sensors for providing status assessment of various aspects of the apparatus 1300. For example, the sensor assembly 1314 may detect the on/off state of the device 1300, the relative positioning of the components, such as the display and keypad of the apparatus 1300, the sensor assembly 1314 may also detect a change in position of the apparatus 1300 or one of the components of the apparatus 1300, the presence or absence of user contact with the apparatus 1300, the orientation or acceleration/deceleration of the apparatus 1300, and a change in temperature of the apparatus 1300. The sensor assembly 1314 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. The sensor assembly 1314 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 1314 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 1316 is configured to facilitate communication between the apparatus 1300 and other devices, either wired or wireless. The apparatus 1300 may access a wireless network based on a communication standard, such as WiFi,2G or 3G, or a combination thereof. In one exemplary embodiment, the communication component 1316 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 1316 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on radio frequency data processing (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the apparatus 1300 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.
In an exemplary embodiment, a non-transitory computer-readable storage medium is also provided, such as memory 1304, including instructions executable by processor 1320 of apparatus 1300 to perform the above-described method. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
A non-transitory computer readable storage medium, which when executed by a processor of a terminal, causes the terminal to perform a speech processing method, the method comprising: receiving voice data of a conversation from a headset device; the participants of the conversation include: at least two call users; determining prompt information corresponding to the voice data; the prompt information is obtained according to semantic information and/or emotion information corresponding to the voice data; and sending the prompt information to the earphone device during the conversation and/or after the conversation is finished so as to enable the earphone device to output the prompt information.
Fig. 5 is a schematic structural diagram of a server according to an embodiment of the present invention. The server 1900 may vary considerably in configuration or performance and may include one or more central processing units (central processing units, CPU) 1922 (e.g., one or more processors) and memory 1932, one or more storage media 1930 (e.g., one or more mass storage devices) that store applications 1942 or data 1944. Wherein the memory 1932 and storage medium 1930 may be transitory or persistent. The program stored in the storage medium 1930 may include one or more modules (not shown), each of which may include a series of instruction operations in a server. Still further, a central processor 1922 may be provided in communication with a storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the server 1900.
The server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input/output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, and the like.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It is to be understood that the invention is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the invention is limited only by the appended claims
The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.
The embodiment of the invention discloses A1, a voice processing method, which is applied to an earphone storage device, and comprises the following steps:
receiving voice data of a conversation from a headset device; the participants of the conversation include: at least two call users;
determining prompt information corresponding to the voice data; the prompt information is obtained according to semantic information and/or emotion information corresponding to the voice data;
and sending the prompt information to the earphone device during the conversation and/or after the conversation is finished so as to enable the earphone device to output the prompt information.
A2, according to the method of A1, the prompt information comprises at least one of the following information:
the second user aims at the evaluation information of the first user;
trust information of the second user;
emotion prompt information;
rhythm prompting information;
dialogue quality information;
dialog atmosphere information
A3, the method of A2, the method further comprising:
determining evaluation information of a second user for the first user according to semantic information corresponding to the second voice data in the voice data and/or emotion information of the second user; the second voice data corresponds to the second user.
A4, the method of A2, the method further comprising:
and determining the trust information of the second user according to the emotion information of the second user and the mapping relation between the emotion information and the trust information.
A5, the method of A2, the method further comprising:
and under the condition that the emotion information of the first user is preset emotion information, obtaining corresponding emotion prompt information.
A6, the method of A2, the method further comprising:
and if the voice data of any participant is not matched with the theme corresponding to the voice data and/or the duration of the theme corresponding to the voice data exceeds the first preset duration, obtaining corresponding rhythm prompting information.
A7, the method according to A2, wherein the dialogue quality information comprises at least one of the following information:
the voice data comprises the completion proportion information of the theme;
the voice data comprises the completion time information of the theme;
voice quality information; and
logic information of the voice data.
A8, the method of A2, the method further comprising:
and determining dialogue atmosphere information according to the emotion information corresponding to the first user and the second user respectively.
A9, according to the method of any one of A1 to A8, the emotion information is obtained according to the voice characteristics corresponding to the voice data; and/or
The emotion information is obtained by analyzing somatosensory data of a user, and the somatosensory data is obtained by collecting according to the earphone device.
A10, the method of A9, the speech features comprising at least one of the following features: a mood characteristic, a cadence characteristic, and an intensity characteristic.
The embodiment of the invention discloses a voice processing device B11, which is applied to an earphone storage device, and comprises:
a receiving module for receiving voice data of a conversation from the earphone device; the participants of the conversation include: at least two call users;
the determining module is used for determining prompt information corresponding to the voice data; the prompt information is obtained according to semantic information and/or emotion information corresponding to the voice data;
and the sending module is used for sending the prompt information to the earphone device in the conversation process and/or after the conversation is finished so as to enable the earphone device to output the prompt information.
B12, the voice processing device according to B11, wherein the prompt information comprises at least one of the following information:
The second user aims at the evaluation information of the first user;
trust information of the second user;
emotion prompt information;
rhythm prompting information;
dialogue quality information;
dialog atmosphere information
B13, the speech processing device according to B12, the speech processing device further comprising:
the evaluation information determining module is used for determining the evaluation information of the second user for the first user according to semantic information corresponding to the second voice data in the voice data and/or emotion information of the second user; the second voice data corresponds to the second user.
B14, the speech processing device of B12, the speech processing device further comprising:
the trust information determining module is used for determining the trust information of the second user according to the emotion information of the second user and the mapping relation between the emotion information and the trust information.
B15, the speech processing device according to B12, further comprising:
the emotion prompt information determination module is used for obtaining corresponding emotion prompt information under the condition that the emotion information of the first user is preset emotion information.
B16, the speech processing device according to B12, further comprising:
And the rhythm prompting information determining module is used for obtaining corresponding rhythm prompting information if the voice data of any participant is not matched with the theme corresponding to the voice data and/or the duration of the theme corresponding to the voice data exceeds a first preset duration.
B17, the speech processing device according to B12, the dialogue quality information includes at least one of the following information:
the voice data comprises the completion proportion information of the theme;
the voice data comprises the completion time information of the theme;
voice quality information; and
logic information of the voice data.
B18, the speech processing device according to B12, the speech processing device further comprising:
and the dialogue atmosphere information determining module is used for determining dialogue atmosphere information according to emotion information corresponding to the first user and the second user respectively.
B19, according to any one of the voice processing devices from B11 to B18, the emotion information is obtained according to the voice characteristics corresponding to the voice data; and/or
The emotion information is obtained by analyzing somatosensory data of a user, and the somatosensory data is obtained by collecting according to the earphone device.
B20, the speech processing device of B19, the speech features including at least one of: a mood characteristic, a cadence characteristic, and an intensity characteristic.
The embodiment of the invention discloses a C21, a device for voice processing, which comprises a memory and one or more programs, wherein the one or more programs are stored in the memory, and are configured to be executed by one or more processors, and the one or more programs comprise instructions for:
receiving voice data of a conversation from a headset device; the participants of the conversation include: at least two call users;
determining prompt information corresponding to the voice data; the prompt information is obtained according to semantic information and/or emotion information corresponding to the voice data;
and sending the prompt information to the earphone device during the conversation and/or after the conversation is finished so as to enable the earphone device to output the prompt information.
C22, the device according to C21, the prompt information includes at least one of the following information:
the second user aims at the evaluation information of the first user;
trust information of the second user;
emotion prompt information;
rhythm prompting information;
dialogue quality information;
dialog atmosphere information
C23, the device of C22, the device further configured to be executed by one or more processors, the one or more programs comprising instructions for:
Determining evaluation information of a second user for the first user according to semantic information corresponding to the second voice data in the voice data and/or emotion information of the second user; the second voice data corresponds to the second user.
C24, the device of C22, the device further configured to be executed by one or more processors, the one or more programs comprising instructions for:
and determining the trust information of the second user according to the emotion information of the second user and the mapping relation between the emotion information and the trust information.
C25, the device of C22, the device further configured to be executed by one or more processors, the one or more programs comprising instructions for:
and under the condition that the emotion information of the first user is preset emotion information, obtaining corresponding emotion prompt information.
C26, the device of C22, the device further configured to be executed by one or more processors, the one or more programs comprising instructions for:
and if the voice data of any participant is not matched with the theme corresponding to the voice data and/or the duration of the theme corresponding to the voice data exceeds the first preset duration, obtaining corresponding rhythm prompting information.
C27, the apparatus of C22, the session quality information comprising at least one of:
the voice data comprises the completion proportion information of the theme;
the voice data comprises the completion time information of the theme;
voice quality information; and
logic information of the voice data.
C28, the device of C22, the device further configured to be executed by one or more processors, the one or more programs comprising instructions for:
and determining dialogue atmosphere information according to the emotion information corresponding to the first user and the second user respectively.
C29, according to any one of the devices from C21 to C28, the emotion information is obtained according to the voice characteristics corresponding to the voice data; and/or
The emotion information is obtained by analyzing somatosensory data of a user, and the somatosensory data is obtained by collecting according to the earphone device.
C30, the apparatus of C29, the speech features comprising at least one of: a mood characteristic, a cadence characteristic, and an intensity characteristic.
Embodiments of the invention disclose D31, one or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform a method as described in one or more of A1-a 10.
The foregoing has outlined a speech processing method, a speech processing apparatus and a device for speech processing in detail, wherein specific examples are presented herein to illustrate the principles and embodiments of the present invention and to help understand the method and core concepts thereof; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims (25)

1. A voice processing method, applied to an earphone storage device, the method being applied to an interview scene or an interview scene, comprising:
receiving voice data of a conversation from a headset device; the participants of the conversation include: at least two call users;
determining prompt information corresponding to the voice data; the prompt information is obtained according to semantic information and emotion information corresponding to the voice data;
sending the prompt information to the earphone device in the conversation process and after the conversation is finished, so that the earphone device outputs the prompt information; the prompt message comprises: the second user is directed to the evaluation information of the first user, the trust information, the rhythm prompt information and the dialogue quality information of the first user; the trust information is obtained according to emotion changes of the first user; the session quality information includes: the voice data comprises the completion proportion information of the theme and the completion time information of the theme;
Determining evaluation information of a second user for the first user according to semantic information corresponding to the second voice data in the voice data and emotion information of the second user; the second voice data corresponds to the second user; the emotion information is obtained by analyzing somatosensory data of a user, and the somatosensory data comprises: motion information of the user's head and facial expression of the user;
wherein the determining the evaluation information of the second user for the first user includes: semantic evaluation information is obtained by utilizing semantic information corresponding to the second voice data, and emotion evaluation information is obtained by utilizing emotion information of the second user; if the semantic evaluation information is matched with the emotion evaluation information, fusing the semantic evaluation information and the emotion evaluation information to obtain the evaluation information; if the semantic evaluation information is not matched with the emotion evaluation information, obtaining evaluation information according to the emotion evaluation information;
and if the voice data of any participant is not matched with the theme corresponding to the voice data and the duration of the theme corresponding to the voice data exceeds the first preset duration, obtaining corresponding rhythm prompting information.
2. The method of claim 1, wherein the hint information further comprises at least one of:
Emotion prompt information;
dialogue atmosphere information.
3. The method according to claim 2, wherein the method further comprises:
and determining the trust information of the second user for the first user according to the emotion information of the second user and the mapping relation between the emotion information and the trust information.
4. The method according to claim 2, wherein the method further comprises:
and under the condition that the emotion information of the first user is preset emotion information, obtaining corresponding emotion prompt information.
5. The method of claim 2, wherein the session quality information further comprises at least one of:
voice quality information; and
logic information of the voice data.
6. The method according to claim 2, wherein the method further comprises:
and determining dialogue atmosphere information according to the emotion information corresponding to the first user and the second user respectively.
7. The method according to any one of claims 1 to 6, wherein the emotion information is obtained according to a voice feature corresponding to the voice data; and/or
The somatosensory data are acquired according to the earphone device.
8. The method of claim 7, wherein the speech features include at least one of the following features: a mood characteristic, a cadence characteristic, and an intensity characteristic.
9. A speech processing device, characterized in that is applied to earphone storage device, speech processing device is applied to interview scene or interview scene, includes:
a receiving module for receiving voice data of a conversation from the earphone device; the participants of the conversation include: at least two call users;
the determining module is used for determining prompt information corresponding to the voice data; the prompt information is obtained according to semantic information and emotion information corresponding to the voice data;
the sending module is used for sending the prompt information to the earphone device in the conversation process and after the conversation is finished so as to enable the earphone device to output the prompt information; the prompt message comprises: the second user is directed to the evaluation information of the first user, the trust information, the rhythm prompt information and the dialogue quality information of the first user; the trust information is obtained according to emotion changes of the first user; the session quality information includes: the voice data comprises the completion proportion information of the theme and the completion time information of the theme;
The evaluation information determining module is used for determining the evaluation information of the second user for the first user according to semantic information corresponding to the second voice data in the voice data and emotion information of the second user; the second voice data corresponds to the second user; the emotion information is obtained by analyzing somatosensory data of a user, and the somatosensory data comprises: motion information of the user's head and facial expression of the user;
wherein the determining the evaluation information of the second user for the first user includes: semantic evaluation information is obtained by utilizing semantic information corresponding to the second voice data, and emotion evaluation information is obtained by utilizing emotion information of the second user; if the semantic evaluation information is matched with the emotion evaluation information, fusing the semantic evaluation information and the emotion evaluation information to obtain the evaluation information; if the semantic evaluation information is not matched with the emotion evaluation information, obtaining evaluation information according to the emotion evaluation information;
and if the voice data of any participant is not matched with the theme corresponding to the voice data and the duration of the theme corresponding to the voice data exceeds the first preset duration, obtaining corresponding rhythm prompting information.
10. The speech processing device of claim 9 wherein the alert information further comprises at least one of:
emotion prompt information;
dialogue atmosphere information.
11. The speech processing apparatus of claim 10 wherein the speech processing apparatus further comprises:
the trust information determining module is used for determining the trust information of the second user for the first user according to the emotion information of the second user and the mapping relation between the emotion information and the trust information.
12. The speech processing apparatus of claim 10 wherein the speech processing apparatus further comprises:
the emotion prompt information determination module is used for obtaining corresponding emotion prompt information under the condition that the emotion information of the first user is preset emotion information.
13. The speech processing apparatus of claim 10 wherein the dialog quality information further comprises at least one of:
voice quality information; and
logic information of the voice data.
14. The speech processing apparatus of claim 10 wherein the speech processing apparatus further comprises:
And the dialogue atmosphere information determining module is used for determining dialogue atmosphere information according to emotion information corresponding to the first user and the second user respectively.
15. The voice processing apparatus according to any one of claims 9 to 14, wherein the emotion information is obtained based on voice features corresponding to the voice data; and/or
The somatosensory data are acquired according to the earphone device.
16. The speech processing apparatus of claim 15 wherein the speech features comprise at least one of the following features: a mood characteristic, a cadence characteristic, and an intensity characteristic.
17. An apparatus for speech processing, the apparatus being applied to an interview scenario or an interview scenario, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors, the one or more programs comprising instructions for:
receiving voice data of a conversation from a headset device; the participants of the conversation include: at least two call users;
determining prompt information corresponding to the voice data; the prompt information is obtained according to semantic information and emotion information corresponding to the voice data;
Sending the prompt information to the earphone device in the conversation process and after the conversation is finished, so that the earphone device outputs the prompt information; the prompt message comprises: the second user is directed to the evaluation information of the first user, the trust information, the rhythm prompt information and the dialogue quality information of the first user; the trust information is obtained according to emotion changes of the first user; the session quality information includes: the voice data comprises the completion proportion information of the theme and the completion time information of the theme;
determining evaluation information of a second user for the first user according to semantic information corresponding to the second voice data in the voice data and emotion information of the second user; the second voice data corresponds to the second user; the emotion information is obtained by analyzing somatosensory data of a user, and the somatosensory data comprises: motion information of the user's head and facial expression of the user;
wherein the determining the evaluation information of the second user for the first user includes: semantic evaluation information is obtained by utilizing semantic information corresponding to the second voice data, and emotion evaluation information is obtained by utilizing emotion information of the second user; if the semantic evaluation information is matched with the emotion evaluation information, fusing the semantic evaluation information and the emotion evaluation information to obtain the evaluation information; if the semantic evaluation information is not matched with the emotion evaluation information, obtaining evaluation information according to the emotion evaluation information;
And if the voice data of any participant is not matched with the theme corresponding to the voice data and the duration of the theme corresponding to the voice data exceeds the first preset duration, obtaining corresponding rhythm prompting information.
18. The apparatus of claim 17, wherein the hint information further comprises at least one of:
emotion prompt information;
dialogue atmosphere information.
19. The device of claim 17, wherein the device is further configured to be executed by one or more processors the one or more programs include instructions for:
and determining the trust information of the second user for the first user according to the emotion information of the second user and the mapping relation between the emotion information and the trust information.
20. The device of claim 19, wherein the device is further configured to be executed by one or more processors the one or more programs include instructions for:
and under the condition that the emotion information of the first user is preset emotion information, obtaining corresponding emotion prompt information.
21. The apparatus of claim 19, wherein the session quality information further comprises at least one of:
voice quality information; and
logic information of the voice data.
22. The device of claim 19, wherein the device is further configured to be executed by one or more processors the one or more programs include instructions for:
and determining dialogue atmosphere information according to the emotion information corresponding to the first user and the second user respectively.
23. The apparatus according to any one of claims 17 to 22, wherein the emotion information is obtained from a voice feature corresponding to the voice data; and/or
The emotion information is obtained by analyzing somatosensory data of a user, and the somatosensory data is obtained by collecting according to the earphone device.
24. The apparatus of claim 23, wherein the speech features comprise at least one of the following features: a mood characteristic, a cadence characteristic, and an intensity characteristic.
25. A machine readable medium having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform the method of one or more of claims 1 to 8.
CN202010507539.1A 2020-06-05 2020-06-05 Voice processing method, device and medium Active CN111696537B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010507539.1A CN111696537B (en) 2020-06-05 2020-06-05 Voice processing method, device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010507539.1A CN111696537B (en) 2020-06-05 2020-06-05 Voice processing method, device and medium

Publications (2)

Publication Number Publication Date
CN111696537A CN111696537A (en) 2020-09-22
CN111696537B true CN111696537B (en) 2023-10-31

Family

ID=72479545

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010507539.1A Active CN111696537B (en) 2020-06-05 2020-06-05 Voice processing method, device and medium

Country Status (1)

Country Link
CN (1) CN111696537B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112331179A (en) * 2020-11-11 2021-02-05 北京搜狗科技发展有限公司 Data processing method and earphone accommodating device

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20010092042A (en) * 2000-03-20 2001-10-24 조정남 Apparatus and method for communicating with moble phone by using earphone remote contrllor
CN103491251A (en) * 2013-09-24 2014-01-01 深圳市金立通信设备有限公司 Method and terminal for monitoring user calls
CN104091153A (en) * 2014-07-03 2014-10-08 苏州工业职业技术学院 Emotion judgment method applied to chatting robot
CN104616666A (en) * 2015-03-03 2015-05-13 广东小天才科技有限公司 Method and device for improving dialogue communication effect based on speech analysis
CN107301168A (en) * 2017-06-01 2017-10-27 深圳市朗空亿科科技有限公司 Intelligent robot and its mood exchange method, system
CN109663196A (en) * 2019-01-24 2019-04-23 聊城大学 A kind of conductor and musical therapy system
CN110289000A (en) * 2019-05-27 2019-09-27 北京蓦然认知科技有限公司 A kind of audio recognition method, device
CN110389667A (en) * 2018-04-17 2019-10-29 北京搜狗科技发展有限公司 A kind of input method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104427088A (en) * 2013-08-22 2015-03-18 深圳富泰宏精密工业有限公司 Communication notification control system and method

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20010092042A (en) * 2000-03-20 2001-10-24 조정남 Apparatus and method for communicating with moble phone by using earphone remote contrllor
CN103491251A (en) * 2013-09-24 2014-01-01 深圳市金立通信设备有限公司 Method and terminal for monitoring user calls
CN104091153A (en) * 2014-07-03 2014-10-08 苏州工业职业技术学院 Emotion judgment method applied to chatting robot
CN104616666A (en) * 2015-03-03 2015-05-13 广东小天才科技有限公司 Method and device for improving dialogue communication effect based on speech analysis
CN107301168A (en) * 2017-06-01 2017-10-27 深圳市朗空亿科科技有限公司 Intelligent robot and its mood exchange method, system
CN110389667A (en) * 2018-04-17 2019-10-29 北京搜狗科技发展有限公司 A kind of input method and device
CN109663196A (en) * 2019-01-24 2019-04-23 聊城大学 A kind of conductor and musical therapy system
CN110289000A (en) * 2019-05-27 2019-09-27 北京蓦然认知科技有限公司 A kind of audio recognition method, device

Also Published As

Publication number Publication date
CN111696537A (en) 2020-09-22

Similar Documents

Publication Publication Date Title
CN111696538B (en) Voice processing method, device and medium
CN110634483B (en) Man-machine interaction method and device, electronic equipment and storage medium
CN108363706B (en) Method and device for man-machine dialogue interaction
CN107644646B (en) Voice processing method and device for voice processing
US20120163677A1 (en) Automatic identifying
CN113362812B (en) Voice recognition method and device and electronic equipment
KR101891496B1 (en) Interactive ai agent system and method for actively monitoring and joining a dialogue session among users, computer readable recording medium
WO2019242414A1 (en) Voice processing method and apparatus, storage medium, and electronic device
CN111241822A (en) Emotion discovery and dispersion method and device under input scene
CN111696536B (en) Voice processing method, device and medium
US20180054688A1 (en) Personal Audio Lifestyle Analytics and Behavior Modification Feedback
CN108648754B (en) Voice control method and device
CN110990534B (en) Data processing method and device for data processing
CN111128183A (en) Speech recognition method, apparatus and medium
CN112037756A (en) Voice processing method, apparatus and medium
CN112068711A (en) Information recommendation method and device of input method and electronic equipment
CN115273831A (en) Voice conversion model training method, voice conversion method and device
CN111696537B (en) Voice processing method, device and medium
CN110970015B (en) Voice processing method and device and electronic equipment
CN113689880B (en) Method, device, electronic equipment and medium for driving virtual person in real time
CN113539261A (en) Man-machine voice interaction method and device, computer equipment and storage medium
CN112988956A (en) Method and device for automatically generating conversation and method and device for detecting information recommendation effect
CN112863499B (en) Speech recognition method and device, storage medium
CN113345452B (en) Voice conversion method, training method, device and medium of voice conversion model
CN114356068B (en) Data processing method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant