CN111696536A - Voice processing method, apparatus and medium - Google Patents

Voice processing method, apparatus and medium Download PDF

Info

Publication number
CN111696536A
CN111696536A CN202010507500.XA CN202010507500A CN111696536A CN 111696536 A CN111696536 A CN 111696536A CN 202010507500 A CN202010507500 A CN 202010507500A CN 111696536 A CN111696536 A CN 111696536A
Authority
CN
China
Prior art keywords
information
user
voice data
conversation
emotion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010507500.XA
Other languages
Chinese (zh)
Other versions
CN111696536B (en
Inventor
王颖
李健涛
张丹
刘宝
张硕
杨天府
梁宵
荣河江
李鹏翀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sogou Intelligent Technology Co Ltd
Original Assignee
Beijing Sogou Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sogou Technology Development Co Ltd filed Critical Beijing Sogou Technology Development Co Ltd
Priority to CN202010507500.XA priority Critical patent/CN111696536B/en
Publication of CN111696536A publication Critical patent/CN111696536A/en
Application granted granted Critical
Publication of CN111696536B publication Critical patent/CN111696536B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G08SIGNALLING
    • G08BSIGNALLING OR CALLING SYSTEMS; ORDER TELEGRAPHS; ALARM SYSTEMS
    • G08B21/00Alarms responsive to a single specified undesired or abnormal condition and not otherwise provided for
    • G08B21/18Status alarms
    • G08B21/24Reminder alarms, e.g. anti-loss alarms
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Psychiatry (AREA)
  • General Physics & Mathematics (AREA)
  • Child & Adolescent Psychology (AREA)
  • Business, Economics & Management (AREA)
  • Hospice & Palliative Care (AREA)
  • Emergency Management (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The embodiment of the invention provides a voice processing method and device and a device for voice processing, wherein the method is applied to a server and specifically comprises the following steps: the receiving module is used for receiving the voice data which are collected by the earphone device and are in conversation; the participants of the conversation include: at least two talking users; the determining module is used for determining prompt information corresponding to the voice data; the prompt information is obtained according to semantic information and/or emotion information corresponding to the voice data; and the sending module is used for sending the prompt information to the earphone device in the conversation process and/or after the conversation is finished so as to enable the earphone device to output the prompt information. The embodiment of the invention can improve the conversation quality of the current conversation or the subsequent conversation.

Description

Voice processing method, apparatus and medium
Technical Field
The present invention relates to the field of speech processing technologies, and in particular, to a speech processing method and apparatus, and a machine-readable medium.
Background
As one of the most natural communication methods, voice is widely used in voice processing scenarios such as voice conversation, voice social interaction, karaoke (Karaok TV), live broadcasting, games, and video recording.
Currently, captured speech is typically used directly for speech processing scenarios. For example, sending the collected voice to the opposite communication terminal; as another example, a captured recording is carried in a video, etc.
In practical applications, there may be situations where the user is not satisfied with the captured speech, in which case the user will have a need to beautify the speech. For example, some users desire to beautify the voice, to get the audience to be motivated, and to enhance confidence.
Disclosure of Invention
In view of the foregoing, embodiments of the present invention are provided to provide a speech processing method, a speech processing apparatus, and a speech processing apparatus, which overcome the foregoing problems or at least partially solve the foregoing problems.
In order to solve the above problem, the present invention discloses a speech processing method, comprising:
receiving voice data collected and conversed by the earphone device; the participants of the conversation include: at least two talking users;
determining prompt information corresponding to the voice data; the prompt information is obtained according to semantic information and/or emotion information corresponding to the voice data;
and sending the prompt information to the earphone device during the conversation and/or after the conversation is finished so that the earphone device outputs the prompt information.
In another aspect, an embodiment of the present invention discloses a speech processing apparatus, including:
the receiving module is used for receiving the voice data which are collected by the earphone device and are in conversation; the participants of the conversation include: at least two talking users;
the determining module is used for determining prompt information corresponding to the voice data; the prompt information is obtained according to semantic information and/or emotion information corresponding to the voice data;
and the sending module is used for sending the prompt information to the earphone device in the conversation process and/or after the conversation is finished so as to enable the earphone device to output the prompt information.
In yet another aspect, an embodiment of the present invention discloses an apparatus for speech processing, including a memory, and one or more programs, where the one or more programs are stored in the memory, and configured to be executed by the one or more processors includes instructions for:
receiving voice data collected and conversed by the earphone device; the participants of the conversation include: at least two talking users;
determining prompt information corresponding to the voice data; the prompt information is obtained according to semantic information and/or emotion information corresponding to the voice data;
and sending the prompt information to the earphone device during the conversation and/or after the conversation is finished so that the earphone device outputs the prompt information.
One or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform the foregoing methods are also disclosed.
The embodiment of the invention has the following advantages:
the server side of the embodiment of the invention can output prompt information in the conversation process and/or after the conversation is finished. The prompt information can be obtained according to semantic information and/or emotion information corresponding to the voice data, so that the prompt information can prompt a user about problems in a conversation process, the user can improve the problems in the conversation process in time, and the conversation quality of the conversation can be improved; or, the user can improve the problem after the conversation is finished so as to improve the conversation quality of the subsequent conversation.
In addition, the prompt information of the embodiment of the present invention may prompt the relevant information of the dialog, such as the evaluation information of the second user for the first user, the trust information of the second user, or the dialog quality information, so that the user can know the dialog condition to help the user to decide the transaction related to the dialog.
For example, in an interview scenario, if the first user is a job seeker and the second user is an interviewer, the interviewer can obtain evaluation information of the job seeker, so that the job seeker can know more interview information, and further can help the job seeker judge the probability of success of an interview. If the first user is an interviewer and the second user is a job seeker, the information of the job seeker enables the interviewer to know the credibility of the job seeker, and further helps the interviewer to evaluate the job seeker better.
For another example, in an interview scene, the dialogue quality information can help the interviewer to know the interview quality and accumulate interview experience so as to improve the interview quality of subsequent interviews.
For another example, in a speech practice scene, the relevant information of the conversation can help the user to know the deficiency of the user in the speech process, such as the speech speed in the time interval 1 is too fast, the speech speed in the time interval 2 is stuck, and the like, so that the quality of subsequent speech practice is improved.
Drawings
FIG. 1 is a flow diagram illustrating a method of speech processing according to an embodiment of the present invention;
FIG. 2 is a flow chart of the steps of one embodiment of a speech processing method of the present invention;
FIG. 3 is a block diagram of a speech processing apparatus according to the present invention;
FIG. 4 is a block diagram of an apparatus 1300 for speech processing of the present invention; and
fig. 5 is a schematic structural diagram of a server according to the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
The embodiment of the invention can be applied to conversation scenes. The dialog scenario may include: a communication-based dialog scenario, such as an operator-based dialog scenario, or a network-based dialog scenario. Alternatively, the dialog scenario may include: and a live conversation scene, such as a face-to-face interview scene and the like.
Depending on the domain to which the dialog relates, the dialog scenario may include: interview scenes, business communication scenes, speech practice scenes and the like.
Depending on the type of conversation, a conversation scenario may include: a voice conversation scene, or a video conversation scene, etc., it is understood that the embodiment of the present invention does not limit the specific conversation scene.
The embodiment of the invention provides a voice processing scheme, which can be executed by a server side and specifically comprises the following steps: receiving voice data collected and conversed by the earphone device; the participants of the conversation specifically include: at least two talking users; determining prompt information corresponding to the voice data; the prompt information can be obtained according to semantic information and/or emotion information corresponding to the voice data; and sending the prompt information to the earphone device during the conversation and/or after the conversation is finished so that the earphone device outputs the prompt information.
The server side of the embodiment of the invention can receive the voice data of the conversation. The voice data may include: voice data of at least one participant.
In one embodiment of the invention, the participants may include: a first user and a second user. The first user may refer to a home user wearing the headset device. The second user may refer to an opposite user, and the second user may or may not wear the headset device.
Of course, the participants may include, in addition to the first user and the second user: a third user and a fourth user, etc.
The server side of the embodiment of the invention can output prompt information in the conversation process and/or after the conversation is finished. Because the prompt information is obtained according to semantic information and/or emotion information corresponding to the voice data, the prompt information can prompt a user about a problem in the conversation process, so that the user can improve the problem in time in the conversation process, and the conversation quality of the conversation can be improved; or, the user can improve the problem after the conversation is finished so as to improve the conversation quality of the subsequent conversation.
The prompt information of the embodiment of the invention can prompt the relevant information of the dialog, such as the evaluation information of the second user aiming at the first user, the trust information of the second user, the dialog quality information and the like, so that the user can know the dialog condition to help the user to decide the relevant affairs of the dialog.
For example, in an interview scenario, if the first user is a job seeker and the second user is an interviewer, the interviewer can obtain evaluation information of the job seeker, so that the job seeker can know more interview information, and further can help the job seeker judge the probability of success of an interview. If the first user is an interviewer and the second user is a job seeker, the information of the job seeker enables the interviewer to know the credibility of the job seeker, and further helps the interviewer to evaluate the job seeker better.
For another example, in an interview scene, the dialogue quality information can help the interviewer to know the interview quality and accumulate interview experience so as to improve the interview quality of subsequent interviews.
For another example, in a speech practice scene, the relevant information of the conversation can help the user to know the deficiency of the user in the speech process, such as the speech speed in the time interval 1 is too fast, the speech speed in the time interval 2 is stuck, and the like, so that the quality of subsequent speech practice is improved.
The prompt message of the embodiment of the application can be directed to any party in the call. For example, the first user's headset device, may obtain and provide reminder information for the first user. Or the earphone device of the first user can obtain prompt information aiming at the second user and prompt the second user.
The earphone device of the embodiment of the present invention may be a headset, such as a bluetooth earphone, a sports earphone, a True Wireless Stereo (TWS) earphone, or an Artificial Intelligence (AI) earphone.
Optionally, the headphone arrangement may comprise a plurality of microphone elements, a processor and a loudspeaker.
The plurality of microphone elements may pick up voice data within a preset angle range. The processor is used for determining prompt information corresponding to the voice data.
According to one embodiment, the processor of the headset device may process the voice data to obtain the prompt information.
According to another embodiment, the task of processing the speech data may be handed over to an external device to reduce the volume of the earphone device, subject to the volume limitation of the earphone device. Accordingly, the processor of the earphone device can perform data interaction with the external equipment to obtain the prompt information processed by the external equipment. The loudspeaker is used for playing sound, such as prompt information.
The external device may include: a terminal and/or a headset receiving device. Of course, the external devices may include: and a server side.
Optionally, the terminal may include: smart phones, tablet computers, electronic book readers, MP3 (Moving Picture Experts Group Audio Layer III) players, MP4 (Moving Picture Experts Group Audio Layer IV) players, laptop portable computers, car-mounted computers, desktop computers, set-top boxes, smart televisions, wearable devices, smart speakers, and the like. It is understood that the embodiment of the present invention does not limit the specific terminal.
The earphone receiving device can be used for receiving the earphone device. Optionally, the headset receiving device is further configured to provide power to the headset device. The earphone accommodating device provided by the embodiment of the invention can also be used for receiving voice data from the earphone device and processing the voice data to obtain prompt information.
In practical applications, the earphone accommodating device may be an earphone accommodating box. The earphone device and the earphone containing device can be sold separately or sold in a set.
In the embodiment of the present invention, the connection mode between the earphone device and the external device may include: a wired connection mode or a wireless connection mode. The connection modes specifically include but are not limited to: a physical connection, a bluetooth connection, an infrared connection, or a WIFI (Wireless Fidelity) connection, etc. It is understood that the embodiment of the present invention does not limit the specific connection manner between the earphone device and the external device.
In an optional embodiment of the present invention, as an external device of the earphone device, the earphone receiving device may be provided with a processing chip, and the processing chip may process voice data by using the voice processing method according to the embodiment of the present invention.
In another optional embodiment of the present invention, as the external device of the earphone device, the earphone accommodating device delivers the task of processing the voice data to the server. Specifically, the earphone receiving device performs data interaction with the server, for example, the earphone receiving device may send voice data acquired by the earphone device to the server, so that the server processes the voice data; the earphone storage device can also send prompt information obtained by processing to the earphone device.
Referring to fig. 1, a schematic structural diagram of a speech processing system according to an embodiment of the present invention is shown, which specifically includes: the earphone device 101, the earphone receiving device 102, the server 103 and the mobile terminal 104.
The earphone device 101 is connected to the earphone receiving device 102 via bluetooth, and the earphone device 101 is connected to the mobile terminal 104 via bluetooth.
During use of the mobile terminal 104 by the first user, the first user wears the ear set 101 and can receive sound and sound through the ear set 101.
The headset storage device 102 has mobile networking and wireless networking capabilities, and can perform data interaction with the server 103. For example, the headset storage device 102 may receive voice data collected by the headset device and send the voice data to the server 103; and the earphone storage device 102 can send the prompt information processed by the server 103 to the earphone device.
In this embodiment of the present invention, optionally, a first processor and a second processor are respectively disposed on two sides of the earphone device 102, where the first processor is used for performing data interaction with the earphone receiving device 102, and the second processor is used for performing data interaction with the mobile terminal 104.
For example, during a conversation using the mobile terminal 104, the headset device 101 may collect voice data of the participants, and the headset device 101 may determine prompt information corresponding to the voice data in real time and output the prompt information to the first user.
In the embodiment of the present invention, optionally, the earphone device 101 may play the prompt message, so that the user can improve the problem of the user in the conversation according to the prompt message.
In this embodiment of the present invention, optionally, the earphone device 101 may include a first side and a second side, where the first side is used for playing voice data, and the second side is used for playing prompt information. Of course, the embodiment of the present invention does not limit the specific playing side of the voice data and the prompt message.
Method embodiment one
Referring to fig. 2, a flowchart illustrating steps of a first embodiment of a speech processing method according to the present invention is shown, and is applied to a server, where the method specifically includes the following steps:
step 201, receiving voice data collected by an earphone device and used for conversation; the participants of the conversation specifically include: at least two talking users;
step 202, determining prompt information corresponding to the voice data; the prompt information can be obtained according to semantic information and/or emotion information corresponding to the voice data;
and step 203, during the conversation process and/or after the conversation is finished, sending the prompt information to the earphone device so that the earphone device outputs the prompt information.
In step 201, the earphone device may collect voice data generated by a participant by using a microphone element, and send the voice data to a server. The earphone device and the server can be in direct communication or indirect communication.
In step 202, the server may process the voice data to obtain a prompt message.
The prompt information can be obtained according to semantic information and/or emotion information corresponding to the voice data, so that the prompt information can prompt a user about problems occurring in the conversation process.
Semantic information is one of the expression forms of information, and means information having a certain meaning capable of eliminating uncertainty of an object. In the embodiment of the invention, semantic analysis can be carried out on the voice data to obtain corresponding semantic information. Available semantic analysis methods may include: the embodiment of the present invention is not limited to a specific semantic analysis method, and may include a keyword extraction method, a sentence component analysis method, or a machine learning method.
In the embodiment of the present invention, optionally, a speech recognition method may be used to convert speech data into a dialog text, and perform semantic analysis on the dialog text to obtain corresponding semantic information.
In this embodiment of the present invention, optionally, the first dialog text and the second dialog text in the dialog text may be identified according to the dialog identity information, and the first semantic information and the second semantic information corresponding to the first dialog text and the second dialog text, respectively, may be determined. The first dialog text and the second dialog text may correspond to different dialog identity information, for example, the first dialog text corresponds to a first user, the second dialog text corresponds to a second user, and the like.
In the embodiment of the present invention, optionally, the dialog identity information may be determined by using a voiceprint recognition method. The voiceprint recognition method is a method for recognizing the identity of a voice speaker to be detected according to voice parameters reflecting the physiological and behavioral characteristics of a voice user in a voice waveform. Different users can determine different dialogue identity information by using a voiceprint recognition method because different voiceprints are used.
In an optional embodiment of the present invention, the semantic information indicates that if the voice data of any participant does not match the topic corresponding to the voice data, the corresponding prompt information may be output to any participant.
The theme is the central idea to be expressed by voice data, and generally refers to the main content. A theme analysis method may be employed to determine a theme corresponding to the theme data. It is to be understood that the voice data may include at least one topic.
For example, speech data is analyzed by time, and different times may correspond to different topics. For another example, the voice data is analyzed according to the conversation identity information, and different conversation identity information can correspond to different topics.
For example, in an interview scenario, an interviewer sets several topics and plans to guide a conversation according to the set topics; however, in the actual conversation process, if the voice data of the interviewee is not matched with the theme set by the interviewee, the corresponding prompt information can be output to the interviewee, so that the interviewee can switch the theme according to the requirement.
For another example, in an interview scenario, an interviewer sets a plurality of topics and plans to guide a conversation according to the set topics; however, in the actual conversation process, if the voice data of the job seeker is not matched with the theme set by the interviewer, the corresponding prompt information can be output to the job seeker, so that the job seeker can adjust the voice content of the job seeker according to the actual situation. Or, corresponding prompt information can be output to the interviewer, so that the interviewer can switch themes according to requirements.
In the embodiment of the invention, the emotion refers to the psychological experience of people such as happiness, anger, sadness, happiness and fear, and the experience is a reflection of the attitude of people to objective objects. Emotions have positive and negative properties. Things that can meet the needs of a person can cause certain experiences of the person, such as happiness, satisfaction, and the like; something that does not meet a person's needs can cause a person's negative experience, such as anger, hate, sadness, etc.
In an alternative embodiment of the invention, the mood information may comprise: positive emotions, which are constructive and aggressive, or negative emotions, which are destructive and depolarised. Among them, negative emotions may include, but are not limited to: urgency, anxiety, tension, anger, depression, sadness, pain, boredom, etc. Positive emotions may include, but are not limited to: happy, optimistic, confident, enjoyable, relaxed, etc. Optionally, the emotional information may further include: neutral emotions, which may include, but are not limited to: calm and so on.
In the embodiment of the present invention, optionally, the emotion information may be obtained according to a voice feature corresponding to the voice data; and/or
The emotion information is obtained by analyzing the somatosensory data of the user.
The speech features may characterize aspects of speech. The speech features include at least one of the following features: mood features, rhythm features, and intensity features.
For example, in a stressful situation, the normal vibration of the vocal organs is suppressed, and at this time, the vibration of the vocal organs cannot be artificially controlled during speaking, so that the emotional information of the user can be obtained by monitoring the voice characteristics of the user.
The embodiment of the invention can determine the emotion information of the user by utilizing the mapping relation between the voice characteristics and the emotion information.
It should be noted that, in the embodiment of the present invention, the mapping relationship may be represented by a data table, that is, a field corresponding to the mapping relationship may be stored in the data table. Alternatively, the mapping between the input data and the output data may be characterized by a data analyzer. Correspondingly, the method may further include: training the training data to obtain a data analyzer; the data analyzer may be used to characterize a mapping relationship between input data and output data.
In an alternative embodiment of the present application, the mathematical model may be trained based on training data to obtain the data analyzer.
The training data in the embodiment of the present invention may include: and the dialogue data can be dialogue data obtained in a voice dialogue scene or a video dialogue scene so as to improve the matching degree between the training data and the voice dialogue scene or the video dialogue scene.
In the embodiment of the present invention, optionally, the dialogue data may be distinguished according to the field. The field specifically refers to a specific range. The embodiment of the invention can obtain different fields according to different application scenes of the conversation. For example, domains may include: interview field, business communication field, social field, speech practice field, etc.
In this embodiment of the present invention, optionally, the dialog data may include: the dialog data corresponding to the first user, of course, may include: the dialog data corresponding to the users other than the first user.
The mathematical model is a scientific or engineering model constructed by using a mathematical logic method and a mathematical language, and is a mathematical structure which is generally or approximately expressed by adopting the mathematical language aiming at the characteristic or quantity dependency relationship of a certain object system, and the mathematical structure is a relational structure which is described by means of mathematical symbols. The mathematical model may be one or a set of algebraic, differential, integral or statistical equations, and combinations thereof, by which the interrelationships or causal relationships between the variables of the system are described quantitatively or qualitatively. In addition to mathematical models described by equations, there are also models described by other mathematical tools, such as algebra, geometry, topology, mathematical logic, etc. Where the mathematical model describes the behavior and characteristics of the system rather than the actual structure of the system. The method can adopt methods such as machine learning and deep learning methods to train the mathematical model, and the machine learning method can comprise the following steps: linear regression, decision trees, random forests, etc., and the deep learning method may include: convolutional Neural Networks (CNN), long short-Term Memory Networks (LSTM), Gated cyclic units (GRU), and so on.
Assuming that the unit of the intensity characteristic is decibels, a preset threshold range corresponding to the emotion information can be determined for the decibels. For example, the sound decibel of a person speaking is usually about 50 decibels, so the preset threshold range can be set to include 0-40 decibels, 41-60 decibels, and 61-80 decibels, where the emotion information corresponding to 0-40 decibels of the preset threshold range is "depression", "tension", or "depression", the emotion information corresponding to 41-60 decibels of the preset threshold range is "neutral", and the emotion information corresponding to 61-80 decibels of the preset threshold range is "excitement", "anger", or "dysphoria". \ u
Somatosensory, or somatic sensation, is a general term for touch, pressure, temperature, pain and proprioception (sensations related to muscle and joint position and movement, body posture and movement, and facial expression). The somatosensory data of the user may include: at least one of body temperature data, pulse data, image data, and limb data.
In the embodiment of the invention, the somatosensory data can be acquired according to an earphone device. Optionally, the earphone device may be externally or internally provided with a sensor, so as to collect somatosensory data of the user through the sensor.
For example, a motion sensor is provided inside the earphone device to collect motion information of the head of the user. Examples of the action information may include: shaking head, nodding head, etc.
As another example, an image sensor is provided inside the headphone apparatus to obtain expression information of the user from the image. Examples of the expression information may include: smile, frown, and skim-nose.
The embodiment of the invention can determine the emotion information of the user by utilizing the mapping relation between the somatosensory data and the emotion information.
Taking the somatosensory data as body temperature data as an example, the emotion change drives the body temperature change to be a reflection of body regulation, so that the emotion information of the user can be obtained by monitoring the body temperature data of the user. For example, a person's normal body temperature ranges from 36 ℃ to 37 ℃, and when the user is excited or angry, the body temperature rises correspondingly, and when the user is depressed or depressed, the body temperature falls correspondingly. Therefore, the preset threshold ranges can be set to include 35.5-35.9 ℃, 36-37 ℃ and 37.1-37.5 ℃, wherein the emotion information corresponding to the preset threshold range of 35.5-35.9 ℃ is "depression", "tension" or "depression", the emotion information corresponding to the preset threshold range of 36-37 ℃ is "neutral", and the emotion information corresponding to the preset threshold range of 37.1-37.5 ℃ is "agitation", "anger" or "dysphoria".
Taking body sensing data as an example of pulse data, because pulse beat and human emotion are closely related, when a human is excited or angry, the frequency of pulse is accelerated due to the change of heart, when the human is in a sleep state or is in a steady emotion state, the pulse is basically in a slow and rhythmic beat state, and the like, so that the emotion information of the user can be obtained by monitoring the pulse data of the user. For example, the preset threshold range may be set to include 50-60 times/min, 61-100 times/min and 101-130 times/min, wherein the mood information corresponding to the preset threshold range of 50-60 times/min is "mood down" or "depression"; the corresponding emotion information of the preset threshold range of 61-100 times/minute is neutral; the emotional state information corresponding to the preset threshold range of 101-130 times/minute is "excited" or "tensed" and the like.
In the embodiment of the invention, under the condition that the emotion information of the first user is the preset emotion information, the corresponding emotion prompt information can be obtained to prompt the user to adjust the emotion to obtain a better emotion state.
The preset emotion information may be obtained according to historical voice data of the first user, and the historical voice data may be voice data before the call.
In an alternative embodiment of the present invention, the user may specify a historical speech data, which may include: and if the user is not satisfied with the emotion information, the embodiment of the invention can determine the dissatisfied emotion information of the user from the historical voice data specified by the user as the preset emotion information. For example, historical speech data corresponding to dissatisfied emotion information of a user may be received and analyzed to obtain corresponding preset emotion information.
In another optional embodiment of the present invention, after a conversation is finished, evaluation information of a user for the conversation may be received, and target historical voice data may be obtained from historical voice data corresponding to the conversation according to the evaluation information, so as to obtain preset emotion information according to emotion information corresponding to the target historical voice data. For example, the evaluation information may include: "satisfactory" or "unsatisfactory", etc., the historical speech data corresponding to the evaluation information of "unsatisfactory" may be set as the target historical speech data. It can be understood that the specific preset emotion information and the corresponding determination manner thereof are not limited in the embodiments of the present invention.
The prompt message of the embodiment of the invention can comprise at least one of the following messages:
prompt information 1 and evaluation information of a second user aiming at a first user;
prompt information 2 and trust information of the second user;
prompt information 3, emotion prompt information;
prompt information 4, rhythm prompt information;
prompt information 5, dialogue quality information;
presentation information 6, dialogue atmosphere information.
For the prompt information 1, the rating information of the second user for the first user may characterize the satisfaction degree of the second user for the first user. The evaluation information can meet the information requirements of the user in conversation scenes such as interview scenes, business communication scenes and the like, and can facilitate the planning of the transactions related to the conversation scenes by the user.
In an optional embodiment of the present invention, the method may further include: determining evaluation information of a second user aiming at the first user according to semantic information corresponding to second voice data in the voice data and/or emotion information of the second user; the second voice data corresponds to the second user.
Optionally, in this embodiment of the present invention, the evaluation information may include: neutral evaluation information, positive evaluation information, negative evaluation information, and the like.
According to one embodiment, the evaluation information corresponding to the semantic information can be determined by using the mapping relation between the semantic information and the evaluation information.
Optionally, the semantic information may include: a keyword. For example, if the semantic information includes: and forward keywords such as 'good', 'good' and 'satisfied', and the like can obtain forward evaluation information. As another example, if the semantic information includes: negative keywords such as "sorry", "regret", and the like, negative evaluation information can be obtained.
Alternatively, the evaluation information may be determined by matching degree information between second semantic information corresponding to the second voice data and first semantic information corresponding to the first voice data. For example, if the matching degree information is smaller than the matching degree threshold, the evaluation information may be negative evaluation information; for another example, if the matching degree information is greater than the matching degree threshold, the evaluation information may be forward evaluation information or the like.
According to another embodiment, the evaluation information corresponding to the emotion information can be determined by using the mapping relation between the emotion information and the evaluation information. Generally, negative emotion information corresponds to negative evaluation information, positive emotion information corresponds to positive evaluation information, and neutral emotion information corresponds to neutral evaluation information.
It should be noted that the semantic information corresponding to the second user may be at least one type of semantic information, and the emotion information of the second user may be at least one type of emotion information. Or, the embodiment of the invention can comprehensively utilize various semantic information or various emotion information to obtain corresponding evaluation information; for example, a plurality of semantic information or a plurality of emotion information may be fused, and corresponding evaluation information may be obtained according to the obtained fusion result.
The embodiment of the invention can comprehensively utilize the second semantic information and the emotion information of the second user to determine the evaluation information of the second user for the first user. Specifically, semantic evaluation information may be obtained using the second semantic information, emotion evaluation information may be obtained using emotion information of the second user, and evaluation information may be obtained according to the semantic evaluation information and the emotion evaluation information.
According to one embodiment, if the semantic evaluation information and the emotion evaluation information are matched, the semantic evaluation information and the emotion evaluation information are fused to obtain evaluation information.
According to another embodiment, if the semantic rating information and the emotional rating information do not match, the rating information is obtained according to the emotional rating information. Because the emotion authenticity is usually higher than the language authenticity, the embodiment of the invention can consider that the emotion evaluation information has higher priority than the semantic evaluation information, and abandons the semantic evaluation information and retains the emotion evaluation information under the condition that the emotion evaluation information and the semantic evaluation information are not matched.
For example, in an interview scenario, an interviewer asks questions of a job seeker and answers the job seeker, and if the interviewer hears an answer to a certain question from the job seeker and then transmits a voice including "good" and the corresponding head movement is "shaking head", the evaluation information of the interviewer in this case can be considered as negative evaluation information.
For another example, in a business communication scenario or a social communication scenario, the voice sent by the second user includes "frequently contact" or "call you after several days", but the corresponding head movement is "shaking head", and the facial expression is "frowning" or "left-falling", and then the evaluation information of the second user in this case may be considered as negative evaluation information.
It should be noted that, in addition to the evaluation information of the second user for the first user, the prompt information may further include: reason information corresponding to the evaluation information may include: the voice information of the second user and the corresponding emotion information. For example, in an interview scenario, the evaluation information is negative evaluation information, and the reason information includes: the interviewer produces a "shaking head" action or the like in saying "good". For another example, in a business communication scenario, the evaluation information is forward evaluation information, and the reason information includes: the other party smiles in the whole conversation process.
For the reminder information 2, the trust information of the second user may characterize the trustworthiness of the second user. The trust information may include: trusted, untrusted, neutral, and the like.
In the embodiment of the present invention, optionally, the trust information of the second user may be determined according to the emotion information of the second user and the mapping relationship between the emotion information and the trust information.
The embodiment of the invention can establish the mapping relation between the emotion information and the trust information in advance. For example, positive emotions correspond to credibility, negative emotions correspond to unreliability, neutral emotions correspond to neutrality, and the like. The mapping relationship between the emotion information and the trust information may be set by a user, or the dialogue corpus may be analyzed to obtain the mapping relationship between the emotion information and the trust information.
In an application example of the invention, in an interview scenario, in the process that the job seeker answers a certain question provided by the interviewer, the emotion of the job seeker changes, for example, the emotion information changes from the first emotion information to the second emotion information, and the second emotion information is nervous or dysphoric, so that the probability that the job seeker lies is high, and therefore the credible information of the job seeker can be considered as being unreliable. The somatosensory data corresponding to tension may include: the expression is a smile, a tight grip with both hands, or a touch on a body part, etc.
It should be noted that, besides the trust information, the prompt information may also include: the reason information corresponding to the trust information may include: the second user's voice information and its corresponding emotional information, e.g., the questioner, suddenly becomes nervous and impatient in answering the question "xxx".
For the prompt information 3, the corresponding emotion prompt information may be obtained to prompt the first user to adjust the emotion under the condition that the emotion information of the first user is the preset emotion information.
For cue information 4, the tempo cue information may include: completion information of the subject in the course of the conversation.
Optionally, if the voice data of any participant is not matched with the theme corresponding to the voice data, and/or the duration of the theme corresponding to the voice data exceeds a first preset duration, obtaining corresponding rhythm prompt information.
The voice data of any participant is not matched with the theme corresponding to the voice data, the completion progress of the theme is influenced, and therefore corresponding rhythm prompt information can be output. For example, in an interview scenario, if the voice data of the interviewee for a certain topic does not match the topic, the interviewee may be prompted to change the topic. For another example, in an interview scenario, if the voice data of the questioner for a certain topic does not match the topic, the questioner may be prompted to change the angle to answer the question.
The duration of the theme corresponding to the voice data exceeds the first preset duration, so that the user can be prompted to accelerate the conversation speed. For example, in an interview scene, an interviewer sets a first preset time length for a theme, and the interviewer can be prompted to accelerate when the duration of the theme exceeds the first preset time length.
Optionally, the rhythm prompt information may include: speech rate information of the first user, etc. For example, a speech rate threshold may be set by the first user, and if the speech rate information of the first user does not match the speech rate threshold, corresponding rhythm prompt information may be provided.
Optionally, the rhythm prompt information may include: language continuity information of the first user, and the like. For example, the first user may present a word interruption problem during the process of speaking a sentence, or a long interruption problem between the first user's sentence and the next sentence, and the like, and may present a corresponding issue and provide a corresponding encouragement message, such as "refuel" and the like.
In this embodiment of the present invention, optionally, the data analyzer may be used to determine language continuity information of the first user, and the input data of the data analyzer may be: the speech data of the first user, the output data of the data analyzer may be language continuity information of the first user.
For the prompt 5, the dialog quality information may be used to characterize the quality of the dialog. The quality of the conversation can help the user to accumulate conversation experience and overcome conversation problems, and further the conversation quality of subsequent conversations is improved.
Optionally, the dialog quality information may include at least one of the following information:
the voice data contains completion proportion information of a theme; for example, the ratio between the number of completed topics and the number of all topics;
the voice data contains completion time information of a topic; for example, completion time information for one or more topics;
voice quality information; and
logical information of the voice data.
The voice quality information may include: speech rate information, or language continuity information, etc. For example, the voice quality information may include: you appear a card shell in xx minutes xx seconds, and the languages at other times are very fluent; or the speed of speech of you in time interval 1 is too fast, the speed of speech of you in time interval 3 is too slow, the speed of speech of other time intervals is standard, and the like.
The logistical information of the voice data may characterize the relevance or compactness of the voice data during the conversation. The logical information may include: a correlation between sentences, a correlation between first speech data and second speech data, a correlation between topics, and the like. For example, the interviewer may consider the logic poor if he directly enters the second topic without any transition after completing the first topic. For another example, in the process of answering the question, the job seeker may consider that the logic is poor if there is no correlation between the preceding sentence and the following sentence.
In practical applications, the logical information may be determined using a data analyzer. It can be understood that the embodiment of the present invention does not limit the specific determination manner corresponding to the logical information.
As for the prompt information 6, optionally, the dialogue atmosphere information may be determined according to emotion information corresponding to the first user and the second user, respectively. For example, if the first user and the second user both correspond to positive emotion information, the dialogue atmosphere information may be considered to be "happy" or "harmonious". For another example, if any participant corresponds to negative emotion information, the dialogue atmosphere information may be considered as "not agreeable".
In step 203, the outputting the prompt information specifically includes:
during and/or after the conversation is finished, the prompt information is sent to the earphone accommodating device corresponding to the earphone device, so that the earphone accommodating device sends the prompt information to the earphone device; and/or
And sending the prompt information to a terminal corresponding to the earphone device during the conversation and/or after the conversation is finished.
And sending the prompt information to the terminal, so that the terminal can play or display the prompt information. It should be noted that the prompt message in the embodiment of the present invention may be specific to the first user, and may not be sent to the second user; of course, it is also possible to obtain the prompt information for the second user and send the prompt information to the second user.
In an optional embodiment of the present invention, event keywords may be extracted from the voice data, event content of a reminder may be established according to the event keywords, and corresponding memo information may be created. For example, the events may include: "parking position", "invoice head up", etc.
The embodiment of the invention can support searching for the memo information. For example, "what did my car park? "for searching" parking position "; what is the invoice new line? "used to search for" invoice heads up ", etc.
In an alternative embodiment of the present invention, an event keyword and a trigger condition keyword associated with the event keyword may be extracted from voice data; and establishing a triggering condition of the reminding according to the triggering condition keyword, establishing event content of the reminding according to the event keyword, and establishing a corresponding reminding so as to improve the accuracy and intelligence of the reminding.
For example, for voice data, including: "i gives you back later", the event content of the resulting reminder may be "call back", and the trigger condition may be "later".
In an alternative embodiment of the present invention, a voice search may be performed on the voice data upon a search request. The dimensions of the voice search may include, but are not limited to, at least one of the following dimensions: conversation identity, conversation time, conversation location, conversation keyword, and the like. For example, the search request is "a recording of the xx event in the general discussions of the last week and king", which may include: the conversation time "last week", the conversation identity "grandma", the conversation keyword "xx events", and so on. The speech data may be subjected to semantic analysis to obtain the dialog keywords, and it is understood that the specific determination manner of the dialog keywords is not limited in the embodiment of the present invention.
In summary, according to the voice processing method of the embodiment of the present invention, the earphone device may output the prompt information during the conversation process and/or after the conversation is finished. The prompt information can be obtained according to semantic information and/or emotion information corresponding to the voice data, so that the prompt information can prompt a user about problems in a conversation process, the user can improve the problems in the conversation process in time, and the conversation quality of the conversation can be improved; or, the user can improve the problem after the conversation is finished so as to improve the conversation quality of the subsequent conversation.
In addition, the prompt information of the embodiment of the present invention may prompt the relevant information of the dialog, such as the evaluation information of the second user for the first user, the trust information of the second user, or the dialog quality information, so that the user can know the dialog condition to help the user to decide the transaction related to the dialog.
For example, in an interview scenario, if the first user is a job seeker and the second user is an interviewer, the interviewer can obtain evaluation information of the job seeker, so that the job seeker can know more interview information, and further can help the job seeker judge the probability of success of an interview. If the first user is an interviewer and the second user is a job seeker, the information of the job seeker enables the interviewer to know the credibility of the job seeker, and further helps the interviewer to evaluate the job seeker better.
For another example, in an interview scene, the dialogue quality information can help the interviewer to know the interview quality and accumulate interview experience so as to improve the interview quality of subsequent interviews.
For another example, in a speech practice scene, the relevant information of the conversation can help the user to know the deficiency of the user in the speech process, such as the speech speed in the time interval 1 is too fast, the speech speed in the time interval 2 is stuck, and the like, so that the quality of subsequent speech practice is improved.
It should be noted that, for simplicity of description, the method embodiments are described as a series of motion combinations, but those skilled in the art should understand that the present invention is not limited by the described motion sequences, because some steps may be performed in other sequences or simultaneously according to the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no moving act is required as an embodiment of the invention.
Device embodiment
Referring to fig. 3, a block diagram of a voice processing apparatus according to an embodiment of the present invention is shown, and the apparatus is applied to a server, and specifically may include:
a receiving module 301, configured to receive voice data of a conversation collected by an earphone device; the participants of the above-described conversation may include: at least two talking users;
a determining module 302, configured to determine a prompt message corresponding to the voice data; the prompt information is obtained according to semantic information and/or emotion information corresponding to the voice data;
a sending module 303, configured to send the prompt information to the earphone device during a conversation and/or after the conversation is ended, so that the earphone device outputs the prompt information.
Optionally, the emotion information is obtained according to a voice feature corresponding to the voice data; and/or
The emotion information is obtained by analyzing the somatosensory data of the user, and the somatosensory data is acquired according to the earphone device.
Optionally, the voice feature may include at least one of the following features: mood features, rhythm features, and intensity features.
Optionally, the prompt message may include at least one of the following messages:
the evaluation information of the second user aiming at the first user;
trust information of the second user;
mood alert information;
rhythm prompt information;
dialog quality information;
dialogue atmosphere information.
Optionally, the apparatus may further include:
the evaluation information determining module is used for determining the evaluation information of the second user aiming at the first user according to the semantic information corresponding to the second voice data in the voice data and/or the emotion information of the second user; the second voice data corresponds to the second user.
Optionally, the apparatus may further include:
and the trust information determining module is used for determining the trust information of the second user according to the emotion information of the second user and the mapping relation between the emotion information and the trust information.
Optionally, the apparatus may further include:
and the emotion prompt information determining module is used for obtaining corresponding emotion prompt information under the condition that the emotion information of the first user is preset emotion information.
Optionally, the apparatus may further include:
and the rhythm prompt information determining module is used for obtaining corresponding rhythm prompt information if the voice data of any participant is not matched with the theme corresponding to the voice data and/or the duration of the theme corresponding to the voice data exceeds a first preset duration.
Optionally, the dialog quality information may include at least one of the following information:
the voice data comprises completion proportion information of the theme;
the voice data includes completion time information of the topic;
voice quality information; and
logical information of the voice data.
Optionally, the apparatus may further include:
and the conversation atmosphere information determining module is used for determining the conversation atmosphere information according to the emotion information respectively corresponding to the first user and the second user.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
Fig. 4 is a block diagram illustrating an apparatus 1300 for speech processing according to an example embodiment. For example, apparatus 1300 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and so forth.
Referring to fig. 4, the apparatus 1300 may include one or more of the following components: a processing component 1302, a memory 1304, a power component 1306, a multimedia component 1308, an audio component 1310, an input/output (I/O) interface 1312, a sensor component 1314, and a communication component 1316.
The processing component 1302 generally controls overall operation of the device 1300, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing element 1302 may include one or more processors 1320 to execute instructions to perform all or part of the steps of the method described above. Further, the processing component 1302 can include one or more modules that facilitate interaction between the processing component 1302 and other components. For example, the processing component 1302 may include a multimedia module to facilitate interaction between the multimedia component 1308 and the processing component 1302.
The memory 1304 is configured to store various types of data to support operation at the device 1300. Examples of such data include instructions for any application or method operating on device 1300, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 1304 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
Power supply component 1306 provides power to the various components of device 1300. Power components 1306 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for device 1300.
The multimedia component 1308 includes a screen between the device 1300 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 1308 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the back-facing camera may receive external multimedia data when the device 1300 is in an operational mode, such as a capture mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.
The audio component 1310 is configured to output and/or input audio signals. For example, audio component 1310 includes a Microphone (MIC) configured to receive external audio signals when apparatus 1300 is in an operational mode, such as a call mode, a recording mode, and a voice data processing mode. The received audio signals may further be stored in the memory 1304 or transmitted via the communication component 1316. In some embodiments, the audio component 1310 also includes a speaker for outputting audio signals.
The I/O interface 1312 provides an interface between the processing component 1302 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
The sensor assembly 1314 includes one or more sensors for providing various aspects of state assessment for the device 1300. For example, the sensor assembly 1314 may detect an open/closed state of the device 1300, the relative positioning of components, such as a display and keypad of the apparatus 1300, the sensor assembly 1314 may also detect a change in position of the apparatus 1300 or a component of the apparatus 1300, the presence or absence of user contact with the apparatus 1300, orientation or acceleration/deceleration of the apparatus 1300, and a change in temperature of the apparatus 1300. The sensor assembly 1314 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 1314 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 1314 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 1316 is configured to facilitate communications between the apparatus 1300 and other devices in a wired or wireless manner. The apparatus 1300 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 1316 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 1316 also includes a Near Field Communications (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on radio frequency data processing (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the apparatus 1300 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.
In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as the memory 1304 comprising instructions, executable by the processor 1320 of the apparatus 1300 to perform the method described above is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
A non-transitory computer readable storage medium in which instructions, when executed by a processor of a terminal, enable the terminal to perform a method of speech processing, the method comprising: receiving voice data collected and conversed by the earphone device; the participants of the conversation include: at least two talking users; determining prompt information corresponding to the voice data; the prompt information is obtained according to semantic information and/or emotion information corresponding to the voice data; and sending the prompt information to the earphone device during the conversation and/or after the conversation is finished so that the earphone device outputs the prompt information.
Fig. 5 is a schematic structural diagram of a server in an embodiment of the present invention. The server 1900, which may vary widely in configuration or performance, may include one or more Central Processing Units (CPUs) 1922 (e.g., one or more processors) and memory 1932, one or more storage media 1930 (e.g., one or more mass storage devices) that store applications 1942 or data 1944. Memory 1932 and storage medium 1930 can be, among other things, transient or persistent storage. The program stored in the storage medium 1930 may include one or more modules (not shown), each of which may include a sequence of instructions operating on the server. Further, a central processor 1922 may be arranged to communicate with the storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the server 1900.
The server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input/output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is only limited by the appended claims
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.
The embodiment of the invention discloses A1 and a voice processing method, which is applied to a server side and comprises the following steps:
receiving voice data collected and conversed by the earphone device; the participants of the conversation include: at least two talking users;
determining prompt information corresponding to the voice data; the prompt information is obtained according to semantic information and/or emotion information corresponding to the voice data;
and sending the prompt information to the earphone device during the conversation and/or after the conversation is finished so that the earphone device outputs the prompt information.
A2, according to the method A1, the emotion information is obtained according to the voice characteristics corresponding to the voice data; and/or
The emotion information is obtained by analyzing the somatosensory data of the user, and the somatosensory data is acquired according to the earphone device.
A3, the method according to A2, the voice features including at least one of: mood features, rhythm features, and intensity features.
A4, the method according to A1, wherein the prompt message includes at least one of the following messages:
the evaluation information of the second user aiming at the first user;
trust information of the second user;
mood alert information;
rhythm prompt information;
dialog quality information;
dialogue atmosphere information.
A5, the method of A4, the method further comprising:
determining evaluation information of a second user aiming at the first user according to semantic information corresponding to second voice data in the voice data and/or emotion information of the second user; the second voice data corresponds to the second user.
A6, the method of A4, the method further comprising:
and determining the trust information of the second user according to the emotion information of the second user and the mapping relation between the emotion information and the trust information.
A7, the method of A4, the method further comprising:
and obtaining corresponding emotion prompt information under the condition that the emotion information of the first user is preset emotion information.
A8, the method of A4, the method further comprising:
and if the voice data of any participant is not matched with the theme corresponding to the voice data and/or the duration of the theme corresponding to the voice data exceeds a first preset duration, obtaining corresponding rhythm prompt information.
A9, the method according to A4, wherein the dialogue quality information includes at least one of the following information:
the voice data contains completion proportion information of a theme;
the voice data contains completion time information of a topic;
voice quality information; and
logical information of the voice data.
A10, the method of A4, the method further comprising:
and determining conversation atmosphere information according to the emotion information respectively corresponding to the first user and the second user.
The embodiment of the invention discloses B11 and a voice processing device, which is applied to a server side, wherein the device comprises:
the receiving module is used for receiving the voice data which are collected by the earphone device and are in conversation; the participants of the conversation include: at least two talking users;
the determining module is used for determining prompt information corresponding to the voice data; the prompt information is obtained according to semantic information and/or emotion information corresponding to the voice data;
and the sending module is used for sending the prompt information to the earphone device in the conversation process and/or after the conversation is finished so as to enable the earphone device to output the prompt information.
B12, according to the device of B11, the emotion information is obtained according to the voice characteristics corresponding to the voice data; and/or
The emotion information is obtained by analyzing the somatosensory data of the user, and the somatosensory data is acquired according to the earphone device.
B13, the apparatus according to B12, the speech features including at least one of the following: mood features, rhythm features, and intensity features.
B14, the device according to B11, the prompt message includes at least one of the following messages:
the evaluation information of the second user aiming at the first user;
trust information of the second user;
mood alert information;
rhythm prompt information;
dialog quality information;
dialogue atmosphere information.
B15, the apparatus of B14, the apparatus further comprising:
the evaluation information determining module is used for determining the evaluation information of the second user aiming at the first user according to the semantic information corresponding to the second voice data in the voice data and/or the emotion information of the second user; the second voice data corresponds to the second user.
B16, the apparatus of B14, the apparatus further comprising:
and the trust information determining module is used for determining the trust information of the second user according to the emotion information of the second user and the mapping relation between the emotion information and the trust information.
B17, the apparatus of B14, the apparatus further comprising:
and the emotion prompt information determining module is used for obtaining corresponding emotion prompt information under the condition that the emotion information of the first user is preset emotion information.
B18, the apparatus of B14, the apparatus further comprising:
and the rhythm prompt information determining module is used for obtaining corresponding rhythm prompt information if the voice data of any participant is not matched with the theme corresponding to the voice data and/or the duration of the theme corresponding to the voice data exceeds a first preset duration.
B19, the apparatus according to B14, the dialogue quality information includes at least one of the following information:
the voice data contains completion proportion information of a theme;
the voice data contains completion time information of a topic;
voice quality information; and
logical information of the voice data.
B20, the apparatus of B14, the apparatus further comprising:
and the conversation atmosphere information determining module is used for determining the conversation atmosphere information according to the emotion information respectively corresponding to the first user and the second user.
The embodiment of the invention discloses C21, an apparatus for speech processing, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs configured to be executed by the one or more processors comprise instructions for:
receiving voice data collected and conversed by the earphone device; the participants of the conversation include: at least two talking users;
determining prompt information corresponding to the voice data; the prompt information is obtained according to semantic information and/or emotion information corresponding to the voice data;
and sending the prompt information to the earphone device during the conversation and/or after the conversation is finished so that the earphone device outputs the prompt information.
C22, according to the device of C21, the emotion information is obtained according to the voice characteristics corresponding to the voice data; and/or
The emotion information is obtained by analyzing the somatosensory data of the user, and the somatosensory data is acquired according to the earphone device.
C23, the apparatus according to C22, the speech features including at least one of: mood features, rhythm features, and intensity features.
C24, the device according to C21, the prompt message includes at least one of the following messages:
the evaluation information of the second user aiming at the first user;
trust information of the second user;
mood alert information;
rhythm prompt information;
dialog quality information;
dialogue atmosphere information.
C25, the device of C24, the device also configured to execute the one or more programs by one or more processors including instructions for:
determining evaluation information of a second user aiming at the first user according to semantic information corresponding to second voice data in the voice data and/or emotion information of the second user; the second voice data corresponds to the second user.
C26, the device of C24, the device also configured to execute the one or more programs by one or more processors including instructions for:
and determining the trust information of the second user according to the emotion information of the second user and the mapping relation between the emotion information and the trust information.
C27, the device of C24, the device also configured to execute the one or more programs by one or more processors including instructions for:
and obtaining corresponding emotion prompt information under the condition that the emotion information of the first user is preset emotion information.
C28, the device of C24, the device also configured to execute the one or more programs by one or more processors including instructions for:
and if the voice data of any participant is not matched with the theme corresponding to the voice data and/or the duration of the theme corresponding to the voice data exceeds a first preset duration, obtaining corresponding rhythm prompt information.
C29, the apparatus according to C24, the dialogue quality information comprising at least one of the following information:
the voice data contains completion proportion information of a theme;
the voice data contains completion time information of a topic;
voice quality information; and
logical information of the voice data.
C30, the device of C24, the device also configured to execute the one or more programs by one or more processors including instructions for:
and determining conversation atmosphere information according to the emotion information respectively corresponding to the first user and the second user.
Embodiments of the invention disclose D31, one or more machine readable media having instructions stored thereon that, when executed by one or more processors, cause an apparatus to perform a method as described in one or more of a 1-a 10.
The foregoing has described in detail a speech processing method, a speech processing apparatus and a speech processing apparatus provided by the present invention, and the present disclosure has applied specific examples to explain the principles and embodiments of the present invention, and the descriptions of the foregoing examples are only used to help understand the method and the core ideas of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (10)

1. A voice processing method is applied to a server side, and the method comprises the following steps:
receiving voice data collected and conversed by the earphone device; the participants of the conversation include: at least two talking users;
determining prompt information corresponding to the voice data; the prompt information is obtained according to semantic information and/or emotion information corresponding to the voice data;
and sending the prompt information to the earphone device during the conversation and/or after the conversation is finished so that the earphone device outputs the prompt information.
2. The method according to claim 1, wherein the emotion information is obtained according to a voice feature corresponding to the voice data; and/or
The emotion information is obtained by analyzing the somatosensory data of the user, and the somatosensory data is acquired according to the earphone device.
3. The method of claim 2, wherein the speech features comprise at least one of: mood features, rhythm features, and intensity features.
4. The method of claim 1, wherein the prompt message comprises at least one of:
the evaluation information of the second user aiming at the first user;
trust information of the second user;
mood alert information;
rhythm prompt information;
dialog quality information;
dialogue atmosphere information.
5. The method of claim 4, further comprising:
determining evaluation information of a second user aiming at the first user according to semantic information corresponding to second voice data in the voice data and/or emotion information of the second user; the second voice data corresponds to the second user.
6. The method of claim 4, further comprising:
and determining the trust information of the second user according to the emotion information of the second user and the mapping relation between the emotion information and the trust information.
7. The method of claim 4, further comprising:
and obtaining corresponding emotion prompt information under the condition that the emotion information of the first user is preset emotion information.
8. A speech processing apparatus, applied to a server, the apparatus comprising:
the receiving module is used for receiving the voice data which are collected by the earphone device and are in conversation; the participants of the conversation include: at least two talking users;
the determining module is used for determining prompt information corresponding to the voice data; the prompt information is obtained according to semantic information and/or emotion information corresponding to the voice data;
and the sending module is used for sending the prompt information to the earphone device in the conversation process and/or after the conversation is finished so as to enable the earphone device to output the prompt information.
9. An apparatus for speech processing comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors the one or more programs including instructions for:
receiving voice data collected and conversed by the earphone device; the participants of the conversation include: at least two talking users;
determining prompt information corresponding to the voice data; the prompt information is obtained according to semantic information and/or emotion information corresponding to the voice data;
and sending the prompt information to the earphone device during the conversation and/or after the conversation is finished so that the earphone device outputs the prompt information.
10. One or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform the method of one or more of claims 1-7.
CN202010507500.XA 2020-06-05 2020-06-05 Voice processing method, device and medium Active CN111696536B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010507500.XA CN111696536B (en) 2020-06-05 2020-06-05 Voice processing method, device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010507500.XA CN111696536B (en) 2020-06-05 2020-06-05 Voice processing method, device and medium

Publications (2)

Publication Number Publication Date
CN111696536A true CN111696536A (en) 2020-09-22
CN111696536B CN111696536B (en) 2023-10-27

Family

ID=72479613

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010507500.XA Active CN111696536B (en) 2020-06-05 2020-06-05 Voice processing method, device and medium

Country Status (1)

Country Link
CN (1) CN111696536B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112331179A (en) * 2020-11-11 2021-02-05 北京搜狗科技发展有限公司 Data processing method and earphone accommodating device
CN113241077A (en) * 2021-06-09 2021-08-10 思必驰科技股份有限公司 Voice entry method and device for wearable device

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103491251A (en) * 2013-09-24 2014-01-01 深圳市金立通信设备有限公司 Method and terminal for monitoring user calls
CN104091153A (en) * 2014-07-03 2014-10-08 苏州工业职业技术学院 Emotion judgment method applied to chatting robot
CN104616666A (en) * 2015-03-03 2015-05-13 广东小天才科技有限公司 Method and device for improving dialogue communication effect based on speech analysis
CN107301168A (en) * 2017-06-01 2017-10-27 深圳市朗空亿科科技有限公司 Intelligent robot and its mood exchange method, system
CN108734096A (en) * 2018-04-11 2018-11-02 北京搜狗科技发展有限公司 A kind of data processing method, device and the device for data processing
CN109215683A (en) * 2018-08-10 2019-01-15 维沃移动通信有限公司 A kind of reminding method and terminal
CN110289000A (en) * 2019-05-27 2019-09-27 北京蓦然认知科技有限公司 A kind of audio recognition method, device
CN110389667A (en) * 2018-04-17 2019-10-29 北京搜狗科技发展有限公司 A kind of input method and device
US20190333514A1 (en) * 2018-08-06 2019-10-31 Baidu Online Network Technology (Beijing) Co., Ltd. Method and apparatus for dialoguing based on a mood of a user
CN110880324A (en) * 2019-10-31 2020-03-13 北京大米科技有限公司 Voice data processing method and device, storage medium and electronic equipment
CN110895931A (en) * 2019-10-17 2020-03-20 苏州意能通信息技术有限公司 VR (virtual reality) interaction system and method based on voice recognition

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103491251A (en) * 2013-09-24 2014-01-01 深圳市金立通信设备有限公司 Method and terminal for monitoring user calls
CN104091153A (en) * 2014-07-03 2014-10-08 苏州工业职业技术学院 Emotion judgment method applied to chatting robot
CN104616666A (en) * 2015-03-03 2015-05-13 广东小天才科技有限公司 Method and device for improving dialogue communication effect based on speech analysis
CN107301168A (en) * 2017-06-01 2017-10-27 深圳市朗空亿科科技有限公司 Intelligent robot and its mood exchange method, system
CN108734096A (en) * 2018-04-11 2018-11-02 北京搜狗科技发展有限公司 A kind of data processing method, device and the device for data processing
CN110389667A (en) * 2018-04-17 2019-10-29 北京搜狗科技发展有限公司 A kind of input method and device
US20190333514A1 (en) * 2018-08-06 2019-10-31 Baidu Online Network Technology (Beijing) Co., Ltd. Method and apparatus for dialoguing based on a mood of a user
CN109215683A (en) * 2018-08-10 2019-01-15 维沃移动通信有限公司 A kind of reminding method and terminal
CN110289000A (en) * 2019-05-27 2019-09-27 北京蓦然认知科技有限公司 A kind of audio recognition method, device
CN110895931A (en) * 2019-10-17 2020-03-20 苏州意能通信息技术有限公司 VR (virtual reality) interaction system and method based on voice recognition
CN110880324A (en) * 2019-10-31 2020-03-13 北京大米科技有限公司 Voice data processing method and device, storage medium and electronic equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112331179A (en) * 2020-11-11 2021-02-05 北京搜狗科技发展有限公司 Data processing method and earphone accommodating device
CN113241077A (en) * 2021-06-09 2021-08-10 思必驰科技股份有限公司 Voice entry method and device for wearable device

Also Published As

Publication number Publication date
CN111696536B (en) 2023-10-27

Similar Documents

Publication Publication Date Title
CN108363706B (en) Method and device for man-machine dialogue interaction
CN110634483B (en) Man-machine interaction method and device, electronic equipment and storage medium
CN111696538B (en) Voice processing method, device and medium
US11468894B2 (en) System and method for personalizing dialogue based on user's appearances
US8144939B2 (en) Automatic identifying
WO2018018482A1 (en) Method and device for playing sound effects
CN107644646B (en) Voice processing method and device for voice processing
CN108922525B (en) Voice processing method, device, storage medium and electronic equipment
CN111954063B (en) Content display control method and device for video live broadcast room
US20180054688A1 (en) Personal Audio Lifestyle Analytics and Behavior Modification Feedback
CN110990534B (en) Data processing method and device for data processing
CN111696536B (en) Voice processing method, device and medium
CN108648754B (en) Voice control method and device
CN115273831A (en) Voice conversion model training method, voice conversion method and device
CN111696537B (en) Voice processing method, device and medium
CN112988956A (en) Method and device for automatically generating conversation and method and device for detecting information recommendation effect
CN114356068B (en) Data processing method and device and electronic equipment
CN111696566B (en) Voice processing method, device and medium
CN114155849A (en) Virtual object processing method, device and medium
EP3288035B1 (en) Personal audio analytics and behavior modification feedback
CN113409766A (en) Recognition method, device for recognition and voice synthesis method
CN108364631B (en) Speech synthesis method and device
CN113674731A (en) Speech synthesis processing method, apparatus and medium
CN111739528A (en) Interaction method and device and earphone
CN113301352A (en) Automatic chat during video playback

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20210706

Address after: 100084 Room 802, 8th floor, building 9, yard 1, Zhongguancun East Road, Haidian District, Beijing

Applicant after: Beijing Sogou Intelligent Technology Co.,Ltd.

Address before: 100084. Room 9, floor 01, cyber building, building 9, building 1, Zhongguancun East Road, Haidian District, Beijing

Applicant before: BEIJING SOGOU TECHNOLOGY DEVELOPMENT Co.,Ltd.

GR01 Patent grant
GR01 Patent grant