CN111696566B

CN111696566B - Voice processing method, device and medium

Info

Publication number: CN111696566B
Application number: CN202010508206.0A
Authority: CN
Inventors: 王颖; 李健涛; 张丹; 刘宝; 张硕; 杨天府; 梁宵; 荣河江; 李鹏翀
Original assignee: Beijing Sogou Intelligent Technology Co Ltd
Current assignee: Beijing Sogou Intelligent Technology Co Ltd
Priority date: 2020-06-05
Filing date: 2020-06-05
Publication date: 2023-10-13
Anticipated expiration: 2040-06-05
Also published as: CN111696566A

Abstract

The embodiment of the invention provides a voice processing method and device and a device for voice processing, wherein the method is applied to an earphone storage device and specifically comprises the following steps: receiving a first voice signal from a headset device; determining a second voice signal corresponding to the first voice signal based on the first voice signal; the second voice signal is obtained by carrying out beautifying processing on the first voice signal; the beautifying treatment comprises the following steps: filtering user noise in the first voice signal; the user noise characterizes noise generated by a user; transmitting the second voice signal to the earphone device so that the earphone device outputs the second voice signal; or, the earphone storage device plays and/or displays the second voice signal. The embodiment of the invention can improve the definition and quality of the voice signal, thereby helping users to realize the purposes of driving audience and enhancing confidence.

Description

Voice processing method, device and medium

Technical Field

The present invention relates to the field of speech processing technologies, and in particular, to a speech processing method and apparatus, and a machine readable medium.

Background

As one of the most natural communication modes, voice is widely used in voice scenes such as voice conversation, voice social interaction, k songs (Karaok TV), live broadcasting, games, video recording, and the like.

Currently, the collected speech is typically used directly for speech scenes. For example, sending the collected voice to the communication opposite terminal; as another example, the collected recordings are carried in video, etc.

In practical applications, there may be situations where the user is not satisfied with the collected speech, in which case the user will have a need to beautify the speech. For example, some users wish to achieve the goal of driving listeners and enhancing confidence by beautifying speech.

Disclosure of Invention

In view of the foregoing, embodiments of the present invention are provided to provide a speech processing method, a speech processing apparatus, and a device for speech processing that overcome or at least partially solve the foregoing problems, where the embodiments of the present invention can improve the definition and quality of speech signals, thereby helping users to achieve the purposes of driving listeners and enhancing confidence.

In order to solve the above problems, the present invention discloses a voice processing method applied to an earphone storage device, the method comprising:

receiving a first voice signal from a headset device;

Determining a second voice signal corresponding to the first voice signal based on the first voice signal; the second voice signal is obtained by carrying out beautifying processing on the first voice signal; the beautifying treatment comprises the following steps: filtering user noise in the first voice signal; the user noise characterizes noise generated by a user;

transmitting the second voice signal to the earphone device so that the earphone device outputs the second voice signal; or, the earphone storage device plays and/or displays the second voice signal.

On the other hand, the embodiment of the invention discloses a voice processing device, which is applied to an earphone storage device, and comprises:

a receiving module for receiving a first voice signal from the earphone device;

the determining module is used for determining a second voice signal corresponding to the first voice signal based on the first voice signal; the second voice signal is obtained by carrying out beautifying processing on the first voice signal; the beautifying treatment comprises the following steps: filtering user noise in the first voice signal; the user noise characterizes noise generated by a user;

an output module, configured to send the second voice signal to an earphone device, so that the earphone device outputs the second voice signal; or, the earphone storage device plays and/or displays the second voice signal.

In yet another aspect, an embodiment of the present invention discloses an apparatus for speech processing, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors, the one or more programs comprising instructions for:

receiving a first voice signal from a headset device;

One or more machine-readable media are also disclosed in embodiments of the invention, wherein the instructions, when executed by one or more processors, cause an apparatus to perform the aforementioned method.

The embodiment of the invention has the following advantages:

the earphone storage device provided by the embodiment of the invention can provide the second voice signal after beautifying treatment for the first voice signal collected by the earphone device. The beautifying processing filters the user noise in the first voice signal, so that the definition and the quality of the voice signal can be improved, and further the purposes of playing audience and enhancing confidence can be achieved for the user.

The earphone containing device provided by the embodiment of the invention can beautify the first voice signal in real time, so that the earphone containing device can be applied to voice processing scenes with high real-time requirements, such as voice dialogue scenes, K songs, live broadcast scenes and the like.

Drawings

FIG. 1 is a schematic illustration of a flow of a speech processing method according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating steps of a first embodiment of a speech processing method according to the present invention;

FIG. 3 is a flowchart illustrating steps of a second embodiment of a speech processing method according to the present invention;

FIG. 4 is a flowchart illustrating the steps of a third embodiment of a speech processing method of the present invention;

FIG. 5 is a block diagram of a speech processing apparatus of the present invention;

FIG. 6 is a block diagram of an apparatus 1300 for speech processing according to the present invention; a kind of electronic device with high-pressure air-conditioning system

Fig. 7 is a schematic structural diagram of a server according to the present invention.

Detailed Description

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

The embodiment of the invention can be applied to a voice processing scene. The speech processing scenario may include: voice conversations, voice societies, k songs, live broadcasts, games, video recordings, etc.

The embodiment of the invention provides a voice processing scheme which can be executed by an earphone containing device, and specifically comprises the following steps: receiving a first voice signal from a headset device; determining a second voice signal corresponding to the first voice signal based on the first voice signal; the second voice signal is obtained by carrying out beautifying processing on the first voice signal; the beautifying treatment specifically comprises the following steps: filtering user noise in the first voice signal; the user noise characterizes noise generated by a user; and transmitting the second voice signal to the earphone device so that the earphone device outputs the second voice signal; or, the earphone storage device plays and/or displays the second voice signal.

The earphone storage device can provide the second voice signal after beautifying treatment for the first voice signal collected by the earphone device. The beautifying processing filters the user noise in the first voice signal, so that the definition and the quality of the voice signal can be improved, and further the purposes of playing audience and enhancing confidence can be achieved for the user.

The earphone containing device provided by the embodiment of the invention can beautify the first voice signal in real time, so that the earphone containing device can be applied to voice processing scenes with high real-time requirements, such as voice dialogue scenes, K songs, live broadcast scenes and the like. The voice dialog may include: a network-based voice conversation, or an operator-based voice conversation, etc.

The earphone device of the embodiment of the invention can be a headset, such as a Bluetooth earphone, a sports earphone, a real wireless stereo (TWS, true Wireless Stereo) earphone and the like, and can also be called an artificial intelligence (AI, artificial Intelligence) earphone.

Alternatively, the earphone arrangement may comprise a plurality of microphone array elements, a processor and a speaker.

The plurality of microphone array elements can pick up the first voice signal within a preset angle range. The processor is used for determining a second voice signal corresponding to the first voice signal. The processor can perform data interaction with the external device to obtain a second voice signal processed by the external device. The speaker is used for playing sound, such as playing the second voice signal.

The external device may include: a terminal, and/or a headset receiving device. Of course, the external devices may include: and the server side.

Optionally, the terminal may include: smart phones, tablet computers, e-book readers, MP3 (dynamic video expert compression standard audio plane 3,Moving Picture Experts Group Audio Layer III) players, MP4 (dynamic video expert compression standard audio plane 4,Moving Picture Experts Group Audio Layer IV) players, laptop portable computers, car computers, desktop computers, set-top boxes, smart televisions, wearable devices, smart speakers, and the like. It will be appreciated that embodiments of the present invention are not limited to a particular terminal.

The earphone storage device can be used for storing the earphone device. Optionally, the earphone storing device is further used for providing power to the earphone device.

According to an embodiment, a voice processing module may be disposed in the earphone housing device, where the voice processing module is configured to beautify the first voice signal to obtain the second voice signal.

According to an embodiment, a voice processing module is not disposed in the earphone accommodating device, an operation corresponding to voice processing is transmitted to the server, and the server beautifies the first voice signal to obtain the second voice signal.

Optionally, a display screen may be disposed in the earphone receiving device, where the display screen is configured to display information related to the second voice signal. The above-mentioned related information may include at least one of the following information: waveform signals of the second voice signals, texts corresponding to the second voice signals, which beautifying treatments are performed on the first voice signals, effects after the beautifying treatments, and the like.

Alternatively, a sound playing device such as a speaker may be provided in the headphone housing apparatus to play the second voice signal.

In the embodiment of the present invention, the connection manner between the earphone device and the external device may include: physical connection, bluetooth connection, infrared connection, or WIFI (wireless fidelity ) connection, etc. It will be appreciated that the embodiments of the present invention are not limited to a specific connection between the earphone device and the external device.

In the embodiment of the invention, the external device and the server are optionally used for data interaction, for example, the external device can send the first voice signal collected by the earphone device to the server, so that the server beautifies the first voice signal. The external device may also transmit the processed second voice signal to the earphone device.

Referring to fig. 1, a schematic structural diagram of a speech processing system according to an embodiment of the present invention is shown, which specifically includes: an earphone device 101, an earphone housing device 102, a service terminal 103, and a mobile terminal 104.

The earphone device 101 is connected to the earphone housing device 102 via bluetooth, and the earphone device 101 is connected to the mobile terminal 104 via bluetooth.

During the use of the mobile terminal 104 by the user, the user wears the earphone device 101, and can receive and sound through the earphone device 101.

The earphone receiving device 102 has mobile networking and wireless networking capabilities, and can perform data interaction with the server 103. For example, the earphone storage 102 may receive the first voice signal collected by the earphone device and sound the first voice signal to the server 103; and the earphone storing apparatus 102 may transmit the second voice signal processed by the server 103 to the earphone apparatus. Of course, the earphone housing device 102 may process the first voice signal to obtain the second voice signal.

In this embodiment of the present invention, optionally, a first processor and a second processor are respectively disposed on two sides of the earphone device 102, where the first processor is used for performing data interaction with the earphone storage device 102, and the second processor is used for performing data interaction with the mobile terminal 104.

For example, in the process of performing a voice conversation or live broadcasting using the mobile terminal 104, the user may send out a first voice signal through the earphone device 101, and the earphone storage device 102 may determine a second voice signal corresponding to the first voice signal in real time, so as to send the second voice signal to the opposite communication end. Because the communication opposite terminal can receive the second voice signal with higher definition and better quality, the user self-confidence and audience driving can be improved.

As another example, in the process of using the mobile terminal 104 to send a voice bar using a social application such as a micro-letter, assuming that the earphone device 101 receives a first voice bar generated by a user, the earphone receiving device 102 may process the first voice bar to obtain a second voice bar, so as to send the second voice bar to the opposite communication end.

In this embodiment of the present invention, optionally, the earphone device 101 may play the second voice signal, so that the user obtains a comparison experience of the first voice signal and the second voice signal.

In this embodiment of the present invention, optionally, the earphone device 101 may include a first side and a second side, where the first side is used for playing the first voice signal or the voice signal sent by the opposite communication end, and the second side is used for playing the second voice signal. Depending on the processing power of the earphone arrangement, the delay time between the first speech signal and the second speech signal may be of the order of milliseconds.

Method embodiment one

Referring to fig. 2, a flowchart illustrating steps of a first embodiment of a voice processing method of the present invention is applied to an earphone storage device, and may specifically include the following steps:

step 201, receiving a first voice signal from a headset device;

step 202, determining a second voice signal corresponding to the first voice signal based on the first voice signal; the second voice signal is obtained by beautifying the first voice signal; the beautifying treatment specifically comprises the following steps: filtering user noise in the first voice signal; the user noise characterizes noise generated by a user;

step 203, sending the second voice signal to the earphone device, so that the earphone device outputs the second voice signal; or, the earphone storage device plays and/or displays the second voice signal.

In step 201, the earphone device may collect a first voice signal generated by a user by using a microphone array element, and send the first voice signal to the earphone storage device.

In step 202, the earphone receiving device may beautify the first voice signal to obtain a second voice signal; or the earphone receiving device may send the first voice signal to an external device, for example, a server or a terminal, so that the external device processes the first voice signal to obtain the second voice signal.

For example, in an alternative embodiment of the present invention, the determining the second voice signal corresponding to the first voice signal specifically includes: transmitting the first voice signal to external equipment; and receiving a second voice signal returned by the external equipment.

The beautifying processing of the embodiment of the invention filters the user noise in the first voice signal, so that the definition and quality of the voice signal can be improved, and further the purposes of driving audience and enhancing confidence can be realized for the user. Therefore, the independent sound style of the user can be reserved, and the beautifying of the sound can be realized.

In the embodiment of the present invention, optionally, the user noise specifically includes: at least one of a har, a cough, a tremolo, and an accent. It can be appreciated that, according to practical application requirements, a person skilled in the art can adopt other user noises, such as walking sounds, and the noise generated by the user is within the protection range of the user noises.

The embodiment of the invention can provide the following technical scheme for determining the second voice signal corresponding to the first voice signal:

technical scheme A1,

In the technical solution A1, the determining the second voice signal corresponding to the first voice signal includes: determining preset voiceprint characteristics corresponding to user noise; and filtering the sound signal corresponding to the preset voiceprint characteristic from the first voice signal to obtain a second voice signal.

The embodiment of the invention can collect the user noise sample in advance and extract the voiceprint characteristic of the user noise sample as the preset voiceprint characteristic. Optionally, the user noise samples may be classified, and corresponding preset voiceprint features may be determined for a plurality of user noise categories, respectively. In this way, during the beautifying process, the sound signal corresponding to the preset voiceprint feature can be filtered from the first voice signal, so as to obtain a voice signal that does not include user noise, that is, the second voice signal. Therefore, the independent sound style of the user can be reserved, and the beautifying of the sound can be realized.

The embodiment of the invention does not limit the specific types of the preset voiceprint features. For example, the types of preset voiceprint features may include: mel-frequency cepstral coefficients (MFCC, mel-frequency CepstrumCoefficients), fundamental frequency parameters, filter Banks (Fbank, filter Banks), and the like.

Technical scheme A2,

In the technical solution A2, the determining the second voice signal corresponding to the first voice signal includes: determining a target voice class corresponding to the first voice signal; and processing the first voice signal according to a first voice parameter corresponding to the target voice category, wherein a second voice parameter corresponding to the obtained second voice signal is matched with the first voice parameter.

The embodiment of the invention can collect the voice samples in advance and classify the voice samples. The speech samples may be screened to obtain a higher definition speech that does not include user noise.

The voice categories may include: female, male, tong Yin, etc. Wherein, female voice can include: magnetic female voice and fool white sweet sister voice; the male voice may include: magnetic male voices, sand dumb male voices, etc. It will be appreciated that, those skilled in the art may classify the voice samples according to practical application requirements, and the embodiment of the present invention is not limited to specific voice types.

In the beautifying process, the embodiment of the invention can firstly determine the target voice category corresponding to the first voice signal so as to determine which voice category the first voice signal belongs to; then, the first voice parameter of the target voice class can be used as a reference basis of the first voice signal. For example, the first voice signal may be adjusted so that a second voice parameter corresponding to the adjusted second voice signal matches the first voice parameter. Because the second voice signal has voice parameters matched with the target voice category, and the voice sample corresponding to the target voice category can correspond to the voice with higher definition which is screened to not contain the noise of the user; therefore, the embodiment of the invention obtains the second voice signal according to the first voice parameter corresponding to the target voice category, and can also improve the definition and quality of the voice signal, thereby helping users to realize the purposes of driving audience and enhancing confidence. Therefore, the independent sound style of the user can be reserved, and the beautifying of the sound can be realized.

In an alternative embodiment of the present invention, the target voice category may be specified by the user, so that the voice category preferred by the user is used in the determination process of the second voice signal, thereby improving the matching degree between the second voice signal and the user's requirement.

In another alternative embodiment of the present invention, the beautifying process may further include: and (5) sound effect processing. That is, the second speech signal may be subjected to an audio processing, which may be used to enhance the audio of the speech signal.

The above-described sound effect processing may include, but is not limited to, at least one of the following: surround processing, channel equalization processing, and reverberation processing. Wherein the surround processing can improve the spatial perception of the speech signal. The channel equalization processing can improve magnetism and thickness of the voice signal, and further improve charm of the voice signal. The reverberation processing can improve the auditory sense and roundness of the voice signal, and different reverberation processing can enable users to be in different spaces and places. Examples of venues may include: ktv, studio, concert, etc., the embodiment of the invention can provide corresponding reverberation processing according to the places designated by the users.

In one embodiment of the invention, the second speech signal may be subjected to a surround processing using a head related transfer function (HRTF, head Related Transfer Function) technique. The HRTF technique can calculate the size and pitch of sound generated in different directions or positions, and thus, the effect of stereo spatial sound localization is produced.

In one embodiment of the present invention, the channel equalization process may determine a target frequency band corresponding to the second speech signal, and then adjust a frequency parameter corresponding to the second speech signal according to a preset frequency corresponding to the target frequency band. Assuming that the target frequency band A is 20-60 Hz, sound about 20Hz in the target frequency band A has a null sense, and low-frequency resonance problem can occur in sound about 60 Hz. It will be appreciated that, according to the actual application requirements, those skilled in the art may perform the required channel equalization processing, and the embodiment of the present invention does not limit the specific channel equalization processing procedure.

The embodiment of the invention can generate reverberant sound by adopting the following reverberation processing mode. A reverberation processing method generates reverberant sound by convolution operation with unit impulse response of a required analog space, and the method carries out convolution operation on the unit impulse response of the space and a voice signal to obtain an output signal of a system. Another way of reverberation processing is to generate reverberant sound by a simple cascade or nest of comb and all-pass filters, exploiting the characteristics of the filters to generate the reverberant signal. It will be appreciated that embodiments of the present invention are not limited to a particular manner of reverberation processing.

Alternatively, the beautifying process of the embodiment of the present invention may correspond to a triggering condition, and if the triggering condition is met, the beautifying process of the embodiment of the present invention is performed, otherwise, the beautifying process of the embodiment of the present invention is not performed.

Alternatively, the triggering condition may be: the environment is a preset environment. The preset environment may characterize the environment in which the beautification process is desired.

The preset environment may be determined by those skilled in the art according to actual application requirements, and for example, the preset environment may include: an outdoor environment, an environment of singing APP (Application), or an environment of detecting background sounds of songs, and the like. The environment may be determined by means of sound detection, and/or image recognition.

For example, the user a is live outdoors, and the embodiment of the present invention may determine that the user a is in an outdoor environment by means of sound detection and/or image recognition, so that the beautifying process of the embodiment of the present invention is automatically triggered.

For another example, when the user B opens the "singing bar" APP, the embodiment of the present invention detects the background sound of the song, and determines that the user is ready to sing the song, and may automatically trigger the beautifying process of the embodiment of the present invention.

The embodiment of the invention executes the beautifying treatment of the embodiment of the invention under the condition that the environment is a preset environment; and under the condition that the environment is not the preset environment, the beautifying processing of the embodiment of the invention is not executed, so that the resources consumed by the beautifying processing can be reduced.

Of course, the beautifying processing of the embodiment of the invention can be executed in any environment. The embodiment of the invention is not limited on whether the beautifying processing of the embodiment of the invention has the triggering condition or not.

In step 203, the earphone receiving device may send the second voice signal to the earphone device, so that the earphone device plays the second voice signal, and the user obtains a comparison experience between the first voice signal and the second voice signal.

The earphone storage device may send the second voice signal to a terminal corresponding to the earphone device, so that the terminal may apply the second voice signal to a voice processing scene. Under the condition, the terminal can send the second voice signal to the opposite communication terminal, and the method is applicable to the scenes such as voice dialogue, live broadcast, social voice, games and the like. In another case, in the video recording scene, the terminal may be caused to synthesize the second voice signal with the recorded video picture.

The earphone storage device can play and/or display the second voice signal so that the user can obtain relevant information of the second voice signal.

In summary, according to the voice processing method of the embodiment of the present invention, the earphone storage device may provide the second voice signal after the beautifying processing for the first voice signal collected by the earphone device. The beautifying processing filters the user noise in the first voice signal, so that the definition and the quality of the voice signal can be improved, and further the purposes of playing audience and enhancing confidence can be achieved for the user.

The earphone storage device of the embodiment of the invention can beautify the first voice signal in real time, so that the earphone storage device can be applied to voice processing scenes with high real-time requirements, such as voice dialogue scenes, K songs, live broadcast scenes and the like

Method embodiment II

Referring to fig. 3, a flowchart illustrating steps of a second embodiment of a voice processing method of the present invention is applied to an earphone storage device, and may specifically include the following steps:

step 301, receiving a first voice signal from a headset device;

step 302, determining a second voice signal corresponding to the first voice signal based on the first voice signal; the second voice signal is obtained by beautifying the first voice signal; the beautifying treatment specifically comprises the following steps: filtering user noise in the first voice signal; the user noise characterizes noise generated by a user;

step 303, transmitting the second voice signal to the earphone device, so that the earphone device outputs the second voice signal; or, the earphone storage device plays and/or displays the second voice signal;

with respect to the first embodiment of the method shown in fig. 2, the method of this embodiment may further include:

Step 304, determining a first audio corresponding to a first keyword in the first voice signal or the voice signal of the opposite communication terminal;

step 305, outputting the first audio.

The embodiment of the invention adopts a semantic analysis method to determine the first keyword in the first voice signal or the voice signal of the opposite communication terminal, and automatically determines the first audio corresponding to the first keyword. Thus, the accompaniment effect of the first audio can be provided in the process of playing the voice signal with the first keyword, and the interestingness in the voice processing process can be further increased. The semantic analysis method may include: sentence component analysis methods, machine learning methods, etc., it will be appreciated that embodiments of the present invention are not limited to particular semantic analysis methods.

For example, in the voice conversation process, if any party of the conversation speaks that the first keyword "the lightning is large" the earphone receiving device of any party can automatically acquire the lightning audio and provide the lightning audio to both parties of the conversation. As another example, during a voice conversation, if any party to the conversation speaks the first keyword "i know true phase," then the two parties to the conversation may be provided with audio of the corynanthe classical speech word "true phase only one. For another example, in a live broadcast scene, a host says that the first keyword is "awkward", and then audio of "big crow fly over" can be provided.

In the embodiment of the present invention, optionally, the first speech signal may be first converted into the first text by using a speech recognition technology, and then the first keyword is acquired from the first text. Of course, the semantic analysis technology may also be used to directly obtain the voice signal corresponding to the first keyword from the first voice signal.

In the embodiment of the present invention, optionally, a mapping relationship between the keywords and the audio may be stored, so that the mapping relationship may be searched according to the first keywords to obtain the first audio.

In the embodiment of the present invention, optionally, a second keyword in the first audio is matched with the first keyword, where the second keyword is derived from a preset work.

The matching of the second keyword with the first keyword may include: the characters are identical, the semantics are similar, the semantics are related, etc. The kinds of preset works may include: written works such as novels, poems, prose, papers, shorthand records, digital games and the like; spoken works such as lectures, speech, cloth ways and the like; a musical composition with or without a word; drama or musical drama works; art works such as dummies, dance art works, paintings, calligraphies, prints, sculptures and the like; practical art works; building art works; photographic works of art; movie works, etc.

In step 305, the earphone receiving device may output the first audio to the earphone device or the terminal corresponding to the earphone device, so that the earphone device or the terminal processes the first audio. Alternatively, the earphone receiving device may play and/or display the first audio.

In an alternative embodiment of the present invention, the earphone device may output the first audio according to the output operation of the first keyword. Outputting the first keyword may include: playing the first keyword in the voice signal, or making a sound to the first keyword in the voice signal by the user, etc. The embodiment of the invention can play the first audio in the process of outputting the first keyword or after the first keyword is output, so as to realize the matching effect of the first keyword and the first audio.

In summary, the voice processing method of the embodiment of the invention adopts a semantic analysis method to determine the first keyword in the first voice signal or the voice signal of the opposite communication terminal, and automatically determines the first audio corresponding to the first keyword. Thus, the accompaniment effect of the first audio can be provided in the process of playing the voice signal with the first keyword, and the interestingness in the voice processing process can be further increased.

Method example III

Referring to fig. 4, a flowchart illustrating steps of a third embodiment of a voice processing method of the present invention is applied to an earphone storage device, and may specifically include the following steps:

step 401, receiving a first voice signal from a headset device;

step 402, determining a second voice signal corresponding to the first voice signal based on the first voice signal; the second voice signal is obtained by beautifying the first voice signal; the beautifying treatment specifically comprises the following steps: filtering user noise in the first voice signal; the user noise characterizes noise generated by a user;

step 403, transmitting the second voice signal to the earphone device, so that the earphone device outputs the second voice signal; or, the earphone storage device plays and/or displays the second voice signal;

step 404, determining a second audio corresponding to the first voice signal; the first text corresponding to the first voice signal is matched with the second text corresponding to the second audio, or the first voice fragment corresponding to the first voice signal is matched with the second voice fragment in the second audio;

Step 405, outputting the second audio.

In the embodiment of the present invention, the language unit corresponding to the first text may be a sentence, a phrase, or the like. The embodiment of the invention can convert the first text into the second text so as to optimize or enrich the expression corresponding to the first text and obtain the second text with better expression capability.

Alternatively, the language style of the second text may be: humorous style, lively style, or book style, etc., to increase the interest in speech processing. The language style may be specified by the user to meet the user's needs.

For example, if the first text is "good to bad," the corresponding second text may include: "baby's heart bitter, baby's will say", "baby's slightly difficult, little hurt feeling", "heart good and difficult to feel", blocked up, dry eyes, have feeling of crying ", etc.

Optionally, the text subject of the first text is the same as the text subject of the second text. Optionally, according to the text theme of the first text, a search may be performed in a mapping relationship between the text theme, the language style and the text, so as to obtain the second text.

After the second Text is obtained, the second Text may be converted into the second audio using TTS (Text To Speech) technology according To the tone of the user itself or the tone of a third party designated by the user. It will be appreciated that the desired second audio may be derived in accordance with the speech synthesis parameters.

Alternatively, the speech synthesis parameters may include: at least one of a timbre parameter, a pitch parameter, and a loudness parameter.

The tone color parameters may refer to the distinctive characteristics of different sounds in terms of waveform, and generally different sounding bodies correspond to different tone colors, so that according to the tone color parameters, second audio matching with the tone color of the target sounding body may be obtained, the target sounding body may be the user, or may be designated by the user, for example, the target sounding body may be a designated media worker, or the like. In practical application, the tone parameters of the target sounding body can be obtained according to the preset length of the audio of the target sounding body.

The pitch parameter may characterize the tone, measured in terms of frequency. Loudness parameters, also known as sound intensity or volume, may refer to the size of sound, measured in decibels (dB).

It will be appreciated that the above determination of the second audio is based on the relationship between the first text and the second text, but is an alternative embodiment. In fact, the embodiment of the invention can also determine the second audio according to the relation between the first voice segment and the second voice segment. Wherein the second speech segment may have the same or similar semantics as the first speech segment but with an optimized representation, e.g. an expanded language style, e.g. providing multiple language styles for selection by the user.

In step 405, the earphone storing device may output the second audio to the earphone device or the terminal corresponding to the earphone device, so that the earphone device or the terminal processes the second audio. Alternatively, the earphone receiving device may play and/or display the second audio.

The second audio may function as a substitute for the third speech signal corresponding to the first text or the first speech segment. The third speech signal may correspond to part or all of the first speech signal.

In case the third speech signal corresponds to all of the first speech signals, the second audio may replace all of the first speech signals or the second speech signals. In this case, the second audio may be used for a speech processing scenario, e.g., the second audio may be sent to the communication partner.

In the case that the third speech signal corresponds to the portion of the first speech signal, the portion of the first speech signal or the second speech signal and the second audio may be combined in order according to the location of the first text or the first speech fragment, and the combined audio may be applied to the speech processing scene.

For example, the first speech signal comprises: text a, text B, and text C, and assuming that text a is converted into text a ', the second audio corresponding to text a' and the voice signals corresponding to text B and text C may be sequentially combined. Assuming that the text B is converted into the text B ', the voice signal corresponding to the text a, the second audio corresponding to the text B', and the voice signal corresponding to the text C may be sequentially combined.

According to the embodiment of the invention, the second audio can be played in the process of outputting the voice signal corresponding to the first text or after the voice signal corresponding to the first text is output, so that the user obtains the comparison experience of the first voice signal corresponding to the first text and the second audio.

In summary, according to the voice processing method of the embodiment of the invention, the first text in the first voice signal is converted into the second text, and the second audio corresponding to the second text is output; alternatively, the first speech segment in the first speech signal is converted into the second speech segment in the second audio. Under the voice processing scene, a sentence of ordinary words originally generated by a user can be converted into a sentence of interesting words, and the interesting sense in the voice processing process can be enhanced. Alternatively, a word or phrase originally generated by the user may be converted into a language-style expression preferred by the user to improve the expression quality.

It should be noted that, for simplicity of description, the method embodiments are described as a series of combinations of motion actions, but those skilled in the art should appreciate that the embodiments of the present invention are not limited by the order of motion actions described, as some steps may be performed in other order or simultaneously in accordance with the embodiments of the present invention. Further, it should be understood by those skilled in the art that the embodiments described in the specification are all preferred embodiments and that the movement involved is not necessarily required by the embodiments of the present invention.

Device embodiment

Referring to fig. 5, a block diagram illustrating a voice processing apparatus according to an embodiment of the present invention may specifically include:

a receiving module 501 for receiving a first voice signal from a headset device;

a determining module 502, configured to determine a second speech signal corresponding to the first speech signal based on the first speech signal; the second voice signal is obtained by beautifying the first voice signal; the beautifying process may include: filtering user noise in the first voice signal; the user noise characterizes noise generated by a user;

an output module 503, configured to send the second voice signal to the earphone device, so that the earphone device outputs the second voice signal; or, the earphone storage device plays and/or displays the second voice signal.

Alternatively, the user noise may include: at least one of har, tremolo and accent.

Optionally, the determining module may include:

the first determining module is used for determining preset voiceprint features corresponding to user noise;

the first processing module is used for filtering the sound signal corresponding to the preset voiceprint characteristic from the first voice signal so as to obtain a second voice signal.

Optionally, the determining module may include:

the second determining module is used for determining a target voice category corresponding to the first voice signal;

the second processing module is used for processing the first voice signal according to a first voice parameter corresponding to the target voice category, and the second voice parameter corresponding to the obtained second voice signal is matched with the first voice parameter.

Optionally, the beautifying process may further include: sound effect processing; the above-described sound effect processing may include at least one of the following: surround processing, channel equalization processing, and reverberation processing.

Optionally, the apparatus may further include:

a third determining module, configured to determine a first audio corresponding to a first keyword in the first voice signal or a voice signal of a communication opposite end;

the first audio output module is used for outputting the first audio; and matching a second keyword in the first audio with the first keyword, wherein the second keyword is derived from a preset work.

Optionally, the apparatus may further include:

a fourth determining module, configured to determine a second audio corresponding to the first speech signal; the first text corresponding to the first voice signal is matched with the second text corresponding to the second audio, or the first voice fragment corresponding to the first voice signal is matched with the second voice fragment in the second audio;

And the second audio output module is used for outputting the second audio.

For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Fig. 6 is a block diagram illustrating an apparatus 1300 for speech processing according to an example embodiment. For example, apparatus 1300 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, or the like.

Referring to fig. 6, apparatus 1300 may include one or more of the following components: a processing component 1302, a memory 1304, a power component 1306, a multimedia component 1308, an audio component 1310, an input/output (I/O) interface 1312, a sensor component 1314, and a communication component 1316.

The processing component 1302 generally controls overall operation of the apparatus 1300, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing element 1302 may include one or more processors 1320 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 1302 can include one or more modules that facilitate interactions between the processing component 1302 and other components. For example, the processing component 1302 may include a multimedia module to facilitate interaction between the multimedia component 1308 and the processing component 1302.

The memory 1304 is configured to store various types of data to support operations at the device 1300. Examples of such data include instructions for any application or method operating on the apparatus 1300, contact data, phonebook data, messages, pictures, videos, and the like. The memory 1304 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The power supply assembly 1306 provides power to the various components of the device 1300. The power supply components 1306 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device 1300.

The multimedia component 1308 includes a screen between the device 1300 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 1308 includes a front-facing camera and/or a rear-facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 1300 is in an operational mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component 1310 is configured to output and/or input audio signals. For example, the audio component 1310 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 1300 is in an operational mode, such as a call mode, a recording mode, and a voice data processing mode. The received audio signals may be further stored in the memory 1304 or transmitted via the communication component 1316. In some embodiments, the audio component 1310 also includes a speaker for outputting audio signals.

The I/O interface 1312 provides an interface between the processing component 1302 and peripheral interface modules, which may be a keyboard, click wheel, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.

The sensor assembly 1314 includes one or more sensors for providing status assessment of various aspects of the apparatus 1300. For example, the sensor assembly 1314 may detect the on/off state of the device 1300, the relative positioning of the components, such as the display and keypad of the apparatus 1300, the sensor assembly 1314 may also detect a change in position of the apparatus 1300 or one of the components of the apparatus 1300, the presence or absence of user contact with the apparatus 1300, the orientation or acceleration/deceleration of the apparatus 1300, and a change in temperature of the apparatus 1300. The sensor assembly 1314 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. The sensor assembly 1314 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 1314 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 1316 is configured to facilitate communication between the apparatus 1300 and other devices, either wired or wireless. The apparatus 1300 may access a wireless network based on a communication standard, such as WiFi,2G or 3G, or a combination thereof. In one exemplary embodiment, the communication component 1316 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 1316 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on radio frequency data processing (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 1300 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.

In an exemplary embodiment, a non-transitory computer-readable storage medium is also provided, such as memory 1304, including instructions executable by processor 1320 of apparatus 1300 to perform the above-described method. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

A non-transitory computer readable storage medium, which when executed by a processor of a terminal, causes the terminal to perform a speech processing method, the method comprising: receiving a first voice signal from a headset device; determining a second voice signal corresponding to the first voice signal based on the first voice signal; the second voice signal is obtained by carrying out beautifying processing on the first voice signal; the beautifying treatment comprises the following steps: filtering user noise in the first voice signal; the user noise characterizes noise generated by a user; transmitting the second voice signal to the earphone device so that the earphone device outputs the second voice signal; or, the earphone storage device plays and/or displays the second voice signal.

Fig. 7 is a schematic structural diagram of a server according to an embodiment of the present invention. The server 1900 may vary considerably in configuration or performance and may include one or more central processing units (central processing units, CPU) 1922 (e.g., one or more processors) and memory 1932, one or more storage media 1930 (e.g., one or more mass storage devices) that store applications 1942 or data 1944. Wherein the memory 1932 and storage medium 1930 may be transitory or persistent. The program stored in the storage medium 1930 may include one or more modules (not shown), each of which may include a series of instruction operations in a server. Still further, a central processor 1922 may be provided in communication with a storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the server 1900.

The server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input/output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, and the like.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It is to be understood that the invention is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the invention is limited only by the appended claims

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

The embodiment of the invention discloses A1, a voice processing method, which is applied to an earphone storage device, and comprises the following steps:

receiving a first voice signal from a headset device;

transmitting the second voice signal to the earphone device so that the earphone device outputs the second voice signal; or playing and/or displaying the second voice signal.

A2, the method of A1, the user noise comprising: at least one of har, tremolo and accent.

A3, determining a second voice signal corresponding to the first voice signal according to the method of A1 or A2, including:

determining preset voiceprint characteristics corresponding to user noise;

and filtering the sound signal corresponding to the preset voiceprint feature from the first voice signal to obtain a second voice signal.

A4, determining a second voice signal corresponding to the first voice signal according to the method of A1 or A2, including:

Determining a target voice class corresponding to the first voice signal;

and processing the first voice signal according to a first voice parameter corresponding to the target voice category, wherein a second voice parameter corresponding to the obtained second voice signal is matched with the first voice parameter.

A5, the method according to A1 or A2, wherein the beautifying treatment further comprises: sound effect processing; the sound effect processing includes at least one of the following: surround processing, channel equalization processing, and reverberation processing.

A6, the method of A1 or A2, the method further comprising:

determining a first audio corresponding to a first keyword in the first voice signal or the voice signal of the communication opposite terminal;

outputting the first audio; and a second keyword in the first audio is matched with the first keyword, and the second keyword is derived from a preset work.

A7, the method of A1 or A2, the method further comprising:

determining a second audio corresponding to the first voice signal; the first text corresponding to the first voice signal is matched with the second text corresponding to the second audio, or the first voice fragment corresponding to the first voice signal is matched with the second voice fragment in the second audio;

Outputting the second audio.

The embodiment of the invention discloses a voice processing device which is applied to an earphone storage device and comprises:

a receiving module for receiving a first voice signal from the earphone device;

B9, the apparatus of B8, the user noise comprising: at least one of har, tremolo and accent.

B10, the apparatus of B8 or B9, the determining module comprising:

the first processing module is used for filtering the sound signal corresponding to the preset voiceprint feature from the first voice signal so as to obtain a second voice signal.

B11, the apparatus of B8 or B9, the determining module comprising:

the second processing module is used for processing the first voice signal according to the first voice parameter corresponding to the target voice category, and the second voice parameter corresponding to the obtained second voice signal is matched with the first voice parameter.

B12, the device of B8 or B9, the beautification treatment further comprising: sound effect processing; the sound effect processing includes at least one of the following: surround processing, channel equalization processing, and reverberation processing.

B13, the apparatus of B8 or B9, the apparatus further comprising:

the first audio output module is used for outputting the first audio; and a second keyword in the first audio is matched with the first keyword, and the second keyword is derived from a preset work.

B14, the apparatus of B8 or B9, the apparatus further comprising:

a fourth determining module, configured to determine a second audio corresponding to the first voice signal; the first text corresponding to the first voice signal is matched with the second text corresponding to the second audio, or the first voice fragment corresponding to the first voice signal is matched with the second voice fragment in the second audio;

And the second audio output module is used for outputting the second audio.

The embodiment of the invention discloses a C15, a device for voice processing, which comprises a memory and one or more programs, wherein the one or more programs are stored in the memory, and are configured to be executed by one or more processors, and the one or more programs comprise instructions for:

receiving a first voice signal from a headset device;

C16, the apparatus of C15, the user noise comprising: at least one of har, tremolo and accent.

C17, the apparatus according to C15 or C16, the determining a second speech signal corresponding to the first speech signal, including:

Determining preset voiceprint characteristics corresponding to user noise;

C18, the apparatus according to C15 or C16, the determining the second speech signal corresponding to the first speech signal includes:

determining a target voice class corresponding to the first voice signal;

C19, the apparatus of C15 or C16, the beautification treatment further comprising: sound effect processing; the sound effect processing includes at least one of the following: surround processing, channel equalization processing, and reverberation processing.

The C20, device of C15 or C16, the device also configured to be executed by one or more processors the one or more programs including instructions for:

C21, the device of C15 or C16, the device also configured to be executed by one or more processors, the one or more programs including instructions for:

outputting the second audio.

Embodiments of the invention disclose D22, one or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform a method as described in one or more of A1-A7.

The foregoing has outlined a speech processing method, a speech processing apparatus and a device for speech processing in detail, wherein specific examples are presented herein to illustrate the principles and embodiments of the present invention and to help understand the method and core concepts thereof; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. A voice processing method, applied to a headset housing device, the method comprising:

receiving a first voice signal from a headset device;

transmitting the second voice signal to the earphone device so that the earphone device outputs the second voice signal; or playing and/or displaying the second voice signal;

outputting the second audio;

the determining the second voice signal corresponding to the first voice signal includes:

determining a target voice class corresponding to the first voice signal;

Processing the first voice signal according to a first voice parameter corresponding to the target voice class, wherein a second voice parameter corresponding to the obtained second voice signal is matched with the first voice parameter; wherein the target voice category comprises: female or male or Tong Yin.

2. The method of claim 1, wherein the user noise comprises: at least one of har, tremolo and accent.

3. The method according to claim 1 or 2, wherein said determining a second speech signal corresponding to the first speech signal comprises:

determining preset voiceprint characteristics corresponding to user noise;

4. The method according to claim 1 or 2, wherein the beautification treatment further comprises: sound effect processing; the sound effect processing includes at least one of the following: surround processing, channel equalization processing, and reverberation processing.

5. The method according to claim 1 or 2, characterized in that the method further comprises:

6. A voice processing device, characterized in that it is applied to an earphone housing device, the voice processing device comprising:

a receiving module for receiving a first voice signal from the earphone device;

an output module, configured to send the second voice signal to an earphone device, so that the earphone device outputs the second voice signal; or, the earphone storage device plays and/or displays the second voice signal;

The second audio output module is used for outputting the second audio;

the determining module includes:

the second processing module is configured to process the first voice signal according to a first voice parameter corresponding to a target voice category, where a second voice parameter corresponding to an obtained second voice signal is matched with the first voice parameter, and the target voice category includes: female or male or Tong Yin.

7. The apparatus of claim 6, wherein the user noise comprises: at least one of har, tremolo and accent.

8. The apparatus of claim 6 or 7, wherein the determining module comprises:

9. The apparatus of claim 6 or 7, wherein the beautification process further comprises: sound effect processing; the sound effect processing includes at least one of the following: surround processing, channel equalization processing, and reverberation processing.

10. The apparatus according to claim 6 or 7, characterized in that the apparatus further comprises:

11. An apparatus for speech processing comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors, the one or more programs comprising instructions for:

receiving a first voice signal from a headset device;

outputting the second audio;

determining a target voice class corresponding to the first voice signal;

12. The apparatus of claim 11, wherein the user noise comprises: at least one of har, tremolo and accent.

13. The apparatus according to claim 11 or 12, wherein said determining a second speech signal corresponding to said first speech signal comprises:

Determining preset voiceprint characteristics corresponding to user noise;

14. The apparatus of claim 11 or 12, wherein the beautification process further comprises: sound effect processing; the sound effect processing includes at least one of the following: surround processing, channel equalization processing, and reverberation processing.

15. The device of claim 11 or 12, wherein the device is further configured to be executed by one or more processors the one or more programs including instructions for:

16. One or more machine readable media having instructions stored thereon that, when executed by one or more processors, cause an apparatus to perform the method of one or more of claims 1-5.