CN112911062B

CN112911062B - Voice processing method, control device, terminal device and storage medium

Info

Publication number: CN112911062B
Application number: CN201911214593.0A
Authority: CN
Inventors: 颜虹
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2019-12-02
Filing date: 2019-12-02
Publication date: 2023-06-23
Anticipated expiration: 2039-12-02
Also published as: CN112911062A

Abstract

The application discloses a voice processing method, a control device, a terminal device and a storage medium. The voice processing method comprises the steps of obtaining a first voice signal of a user, converting the first voice signal into text information, converting the text information into a second voice signal according to voice parameters preset by the user, and outputting the second voice signal. In this embodiment of the present application, when the sound during speaking is disturbed due to the influence of jolt, airflow or shortness of breath, or when the talker does not want to expose the current state due to privacy consideration, the talker may convert the voice signal into the text information and then convert the text information into the voice signal in the normal state, which can improve the quality of the voice signal and promote the experience of voice communication.

Description

Voice processing method, control device, terminal device and storage medium

Technical Field

The present invention relates to the field of communications, and in particular, to a voice processing method, a control device, a terminal device, and a storage medium.

Background

With the rapid development of science level and mobile networks, voice communication between people is becoming more and more convenient, and various modes of voice communication, such as making a call using a mobile phone, making a voice call using communication software, or directly transmitting voice information, are being used. However, when a person is in voice communication, the following may exist: due to the influence of jolt, airflow or shortness of breath, the sound during speaking is disturbed, and the talker has to improve the decibel of speaking, so that the conversation quality is reduced; in addition, or in view of privacy, the caller does not want to expose the current state, and hopes that the performance of the call sound is consistent with the normal situation, and no improvement scheme for the situation exists in the prior art.

Disclosure of Invention

The following is a summary of the subject matter described in detail herein. This summary is not intended to limit the scope of the claims.

On one hand, the embodiment of the application provides a voice processing method, a control device, a terminal device and a storage medium, which can improve the quality of voice signals and improve the experience of voice communication.

In another aspect, an embodiment of the present application provides a method for processing speech, including:

acquiring a first voice signal of a user;

converting the first voice signal into text information;

and converting the text information into a second voice signal according to voice parameters preset by a user, and outputting the second voice signal.

On the other hand, the embodiment of the application also provides a control device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the computer program executes the voice processing method.

On the other hand, the embodiment of the application also provides a terminal device, which comprises the control device.

In yet another aspect, embodiments of the present application further provide a computer-readable storage medium storing computer-executable instructions for performing the above-described speech processing method.

The embodiment of the application comprises the following steps: acquiring a first voice signal of a user, converting the first voice signal into text information, converting the text information into a second voice signal according to voice parameters preset by the user, and finally outputting the second voice signal. Based on the technical scheme of the embodiment of the application, when the sound during speaking is interfered due to the influence of jolt, airflow or shortness of breath, or due to privacy, when a caller does not want to expose the current state, the voice signal can be converted into the text information, and then the text information is converted into the voice signal in the normal state, so that the quality of the voice signal can be improved, and the experience of voice communication is improved.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings are included to provide a further understanding of the technical aspects of the present application, and are incorporated in and constitute a part of this specification, illustrate the technical aspects of the present application and together with the examples of the present application, and not constitute a limitation of the technical aspects of the present application.

Fig. 1 is a schematic structural diagram of a terminal device to which a voice processing method provided in an embodiment of the present application is applied;

FIG. 2 is a flowchart of a method for processing speech according to an embodiment of the present application;

fig. 3 is a flowchart of setting preset voice parameters in a voice processing method according to an embodiment of the present application;

FIG. 4 is a flowchart of a voice processing method according to an embodiment of the present application, after obtaining a first voice parameter and a second voice parameter, converting the text information into a second voice signal, and outputting the second voice signal;

FIG. 5 is a flowchart of acquiring a first voice signal of a user in a voice processing method according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

It should be understood that in the description of the embodiments of the present application, the meaning of a plurality (or multiple) is two or more, and that greater than, less than, exceeding, etc. is understood to not include the present number, and that greater than, less than, within, etc. is understood to include the present number. If any, the terms "first," "second," etc. are used for distinguishing between technical features only, and should not be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.

The embodiment of the application provides a voice processing method, a control device, terminal equipment and a storage medium. The voice processing method is applied to the user terminal device 100. The control device is a control center of the terminal device 100, and may be the processor 101, or may be a peripheral circuit related to the processor 101 and a control function thereof.

The terminal device 100 is a device that can install various communication applications, or has a communication function. For example, smart phones, tablet computers, PCs, various wearable devices (headphones, watches, etc.), in-vehicle devices, etc. Fig. 1 shows a schematic diagram of one possible configuration of a terminal device 100. As shown in fig. 1, the terminal device 100 may include: an audio module 110, a mobile data network module 109, a WLAN module 108, a bluetooth module 107, a radio frequency module 106, a display screen 105, an input module 104, a sensing module 103, a memory 102, a processor 101, etc.

The audio module 110 may include a speaker, receiver, earphone or microphone, etc. for capturing or playing voice signals. The audio module 110 may further include an equalizer, and may process the voice signal.

The mobile data network module 109 and the WLAN module 108 may be used to connect to a network, and the mobile data network module 109 may also be used to conduct a call.

Bluetooth module 107 may be used to connect bluetooth devices. Such as a connection of a wearable device to a cell phone, a connection of a wireless headset to a cell phone, etc.

The radio frequency module 106 may be used for receiving and transmitting information or signals during a call, and in particular, the received information is processed by the processor 101; in addition, the signal generated by the processor 101 is transmitted.

The display screen 105 may be used to present a graphical user interface (graph ica l user interface, GUI) for human-machine interaction. Various controls or various application interfaces are included on the graphical user interface, etc. The display screen 105 may be configured in the form of a liquid crystal display screen 105 (Liqu id Crysta l Di sp l ay, LCD), an organic Light-emitting Diode (OLED), or the like.

The input module 104 is used to receive numeric or character information input by a user and to generate key signal inputs related to user settings and function control of the terminal device 100. The input module 104 may include a touch panel, a pen sensor, a physical keyboard, function keys, an input sensing module, and the like.

The display screen 105 and the touch panel may be referred to as a touch display screen 105, and are used to collect touch operations on or near the touch display screen by a user (such as operations on or near the touch screen by the user using any suitable object or accessory such as a finger, a stylus, etc.), and to drive the corresponding connection device according to a preset program. But also to display information entered by the user or provided to the user as well as various menus of the handset. For example, the touch display screen 105 may be implemented in various types, such as resistive, capacitive, infrared light sensing, and ultrasonic, and the embodiment of the present invention is not limited thereto.

The sensing module 103 may include various sensors, such as an acceleration sensor, a vibration sensor, a noise sensor, a gyroscope, a GPS, etc., which may measure the posture information or environmental parameters in which the terminal device 100 is located.

The processor 101 is a control center of the terminal device 100, connects respective components using various interfaces and lines, and performs various functions of the terminal device 100 and processes data by running or executing software programs and/or modules stored in the memory 102 and calling data stored in the memory 102, thereby realizing various services based on the terminal device 100. Alternatively, the processor 101 may comprise one or more processing units. Alternatively, the processor 101 may integrate the application processor 101 with the modem processor 101, wherein the application processor 101 primarily handles operating systems, user interfaces, applications, etc., and the modem processor 101 primarily handles wireless communications. It will be appreciated that the modem processor 101 described above may not be integrated into the processor 101.

Memory 102 may be used to store software programs and modules. The processor 101 executes various functional applications and data processing of the terminal device 100 by running software programs and modules stored in the memory 102. In addition, the memory 102 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

In addition, the terminal device 100 may further include a functional module such as a camera and a power management module, which will not be described in detail herein.

It should be understood that the structure of the terminal device 100 shown in fig. 1 does not constitute a limitation of the terminal device 100, and the terminal device 100 provided in the embodiments of the present application may include more or less components than illustrated, or may combine some components, or may be different arrangements of components.

Fig. 2 is a flowchart of a voice processing method according to an embodiment of the present application. As shown in fig. 2, the method includes, but is not limited to, the steps of:

step 101: acquiring a first voice signal of a user;

step 102: converting the first voice signal into text information;

step 103: and converting the text information into a second voice signal according to voice parameters preset by a user, and outputting the second voice signal.

In this embodiment, the first voice signal of the user is acquired, and there may be various cases. For example, the mobile phone may be used to directly obtain the voice signal of the user through the microphone of the mobile phone, or the mobile phone is connected with the wireless bluetooth headset, the voice signal of the user is obtained through the microphone of the wireless bluetooth headset, and the obtained voice signal is used as the first voice signal, at this time, the first voice signal is an unprocessed original voice signal, and the converted second voice signal is a voice signal with better quality.

After the first voice signal is obtained, the first voice signal is converted to generate corresponding text information, specifically, a voice recognition technology in the prior art can be adopted, for example, the first voice signal is resolved to be smaller voice units, and then the first voice signal is converted to corresponding text by means of a voice model and a data model for deep learning, so that the principle of the first voice signal is not repeated. In the present embodiment, the act of converting the text information into the second voice signal is performed locally at the terminal device 100, and it is understood that the act of converting the text information into the second voice signal is performed locally at the terminal based on the terminal device 100 having a corresponding hardware support, the terminal device 100 being provided with a corresponding hardware type and having a corresponding processing speed. Based on the rapid processing of high-performance hardware, the conversion of the second voice signal can be completed rapidly, the time delay is low, and the requirement of real-time conversation can be met.

In one embodiment of the present application, for further explanation of the above embodiment, the step 103 is specifically: acquiring voice parameters preset by a user in a local area, and converting the text information into a second voice signal according to the preset voice parameters, and outputting the second voice signal. Specifically, the user can save the voice parameters locally in advance through a software interface and the like, when voice processing is needed, the mobile phone is taken as an example, the voice parameters saved in advance by the user can be read out quickly, and the text information is converted into a second voice signal according to the preset voice parameters; if the use scene is the earphone, the voice parameters can be directly stored on the earphone through a software interface, or the voice parameters stored on the mobile phone can be sent to the earphone through Bluetooth and the like, and then the text information is converted into a second voice signal according to the preset voice parameters. The prior art of communication between a mobile phone and a wearable device is well established and will not be described in detail here.

Fig. 3 is a flowchart of setting preset voice parameters in a voice processing method according to an embodiment of the present application. Referring to fig. 3, based on the above embodiment, preset voice parameters are set by:

step 201: acquiring a third voice signal recorded by a user in a reference environment;

step 202: analyzing the third voice signal to obtain voice parameters corresponding to the user.

Specifically, the reference environment refers to a normal environment where a user performs voice communication, and is generally quiet, and the user is in a stable state, for example, sitting in a room to perform voice communication, and at this time, the voice of the user is stable and clear, and the surrounding environment has low noise, and the quality of voice communication is good. In the reference environment, the user may pre-record a section of voice, that is, the third voice signal, and the voice may be arbitrary, and the main function is to obtain the voice parameters corresponding to the user as the object of analysis. It should be noted that, the criterion of the reference environment is not unique, and the user can freely select the most suitable scene as the reference environment according to the actual situation of the user, and analyze the most suitable scene to obtain the voice parameters preset by the user.

In an embodiment, the third voice signal is analyzed to obtain the voice parameters corresponding to the user, and the acoustic feature of the third voice signal may be extracted to obtain the first voice parameters corresponding to the user. Wherein the first speech parameters may include, but are not limited to, fundamental frequency, duration, energy characteristics, and so forth. Such as phonemes or syllables; the fundamental frequency features are represented by the hertz number of the voice data corresponding to each grammar unit, the time length features are represented by the time length of the voice data corresponding to each grammar unit, the energy features are represented by the amplitude of the voice data corresponding to each grammar unit, and the specific extraction method can adopt the prior art and is not repeated here.

On the basis, the equalizer can be used for adjusting and identifying the high-frequency, medium-frequency and low-frequency hertz frequency bands of the third voice signal to obtain a second voice parameter corresponding to the user, wherein the second voice parameter corresponds to the user. The second speech parameter may include, but is not limited to, a frequency band of the speech signal, a frequency bin of the speech signal, a gain of the speech signal, a quality factor of the speech signal, and the like. For example, if the usage scenario is a mobile phone, the equalizer may be disposed on the mobile phone, and the mobile phone performs analysis of the third voice signal; if the usage scene is an earphone, the equalizer can be arranged on the earphone, the earphone analyzes the third voice signal, or the equalizer can be arranged on the mobile phone, the mobile phone analyzes the third voice signal, and the third voice signal is sent to the earphone in a Bluetooth mode or the like, and the equalizer is selected according to the data processing capacity of the earphone.

Fig. 4 is a flowchart of a voice processing method according to an embodiment of the present application, after obtaining a first voice parameter and a second voice parameter, converting the text information into a second voice signal, and outputting the second voice signal. Referring to fig. 4, in an embodiment, after obtaining the first voice parameter and the second voice parameter, the text information is then converted into a second voice signal, and the second voice signal is output, which specifically includes the following steps:

step 301: inputting the first voice parameters into a voice synthesis model, and converting the text information into a second voice signal by utilizing the voice synthesis model;

step 302: and inputting the second voice parameters to an equalizer, and optimizing the second voice signals by using the equalizer.

The speech synthesis model may be obtained by pre-collecting a large amount of user speech data, and the construction method is the prior art and will not be described herein. The first voice parameter is input into the voice synthesis model, so that the text information converted from the first voice signal can be converted into the second voice signal. On the basis, prosodic features can be introduced, and prosodic analysis can be performed on the text information converted from the first voice signal by utilizing a pre-trained prosodic model, so that the generated second voice signal is more real and natural. The prosodic features mainly include identifying the grammar units corresponding to the text, prosodic words, prosodic phrases, prosodic clauses, accents (i.e., grammar units that are grammatically re-readable), and focuses (i.e., grammar units that are emphasized or deliberately emphasized by the user). The prosody model can be constructed according to a large amount of voice text data collected in advance, and the specific construction method is the same as that in the prior art and is not described herein.

The equalizer can respectively adjust the amplification amounts of the electric signals with various frequency components, and compensates the defects of the loudspeaker and the sound field and compensates and modifies various sound sources by adjusting the electric signals with various different frequencies. The second voice parameters are input to the equalizer, and the equalizer can be utilized to perform optimization processing on the generated second voice signals, so that the quality of the second voice signals is improved, and the experience of voice communication is improved.

The following will describe a practical example. The user inputs the voice normally speaking in the quiet environment to the mobile phone in advance, the voice normally speaking in the quiet environment is utilized to generate corresponding voice parameters, when the mobile phone is utilized to carry out conversation, the mobile phone utilizes the microphone to acquire the voice spoken by the user, the voice is converted into text information, finally, the text information is combined with the voice parameters in the quiet environment to regenerate new voice, the new voice is transmitted to the other party of conversation, no matter what conversation environment the user is in, the voice heard by the other party of conversation is the voice in the quiet environment, and conversation quality is effectively improved.

Based on the technical scheme of the embodiment of the application, when the sound during speaking is interfered due to the influence of jolt, airflow or shortness of breath, or due to privacy, when a caller does not want to expose the current state, the voice signal can be converted into the text information, and then the text information is converted into the voice signal in the normal state, so that the quality of the voice signal can be improved, and the experience of voice communication is improved.

In an embodiment, according to a voice parameter preset by a user, the text information is converted into a second voice signal, the second voice signal is output, the text information can be sent to a cloud server, the second voice signal corresponding to the text information from the cloud server is obtained, and the second voice signal is output. The second voice signal is formed by converting the text information by the cloud server according to voice parameters preset by the user, that is, the action of converting the text information into the second voice signal is not completed locally by the terminal device 100, the cloud server may provide an interface, and after converting the text information into the second voice signal, the terminal device 100 may connect the interface through a network so as to download the second voice signal for storage. The conversion mode of the text information can reduce the hardware requirement on the terminal equipment 100, and the terminal equipment 100 only needs to have the networking function, for example, if the terminal equipment 100 is a mobile phone, the downloading of the second voice signal can be performed through a mobile network or a WiFi network; if the terminal device 100 is a headset or a wristwatch, the downloading of the second voice signal may be performed through a mobile network by adding a cellular network module. Based on the advantage of 5g of network high bandwidth, the uploading of text information and the downloading of the second voice signal can be completed rapidly, the time delay is low, and the requirement of real-time conversation can be met.

In an embodiment, the text information is converted into a second voice signal according to a voice parameter preset by a user, the second voice signal is output, and the text information is converted into the second voice signal according to the preset voice parameter by the user and the second voice signal is output by acquiring the voice parameter preset by the user from the cloud server. At this time, the voice parameters preset by the user are not stored locally in the terminal device 100, but stored through the cloud server, so that the loss of the voice parameters caused by the failure of the terminal device 100 and other reasons can be effectively prevented, and the user needs to preset again. After the voice parameters are acquired, the conversion of the text information into the second voice signal is still completed locally in the terminal device 100, so the mode of this embodiment requires that the terminal device 100 has a networking function, and at the same time, has a certain requirement on the hardware of the terminal device 100.

Fig. 5 is a flowchart of acquiring a first voice signal of a user in a voice processing method according to an embodiment of the present application. Referring to fig. 5, on the basis of the above embodiment, the method for acquiring the first voice signal of the user specifically includes the following steps:

step 401: judging whether voice processing is needed or not;

step 402: when the voice processing is needed, a first voice signal of the user is acquired.

By adding the judgment step of the voice processing, the flexibility of the voice processing can be improved, wherein:

in an embodiment, the determining whether the voice processing is required may be obtaining the motion state of the user terminal device 100, and when the motion state of the user terminal device 100 is the set state, determining that the voice processing is required. Here, the setting state may refer to a state in which the user is busy, that is, a state in which voice communication is affected. Illustratively, the motion state of the user terminal device 100 is the gesture information of the user terminal device 100, which may be vibration, tilting, movement, etc., and since the motion state of the user terminal device 100 generally changes with the motion state of the user, the motion state of the user terminal device 100 can be inferred by determining the motion state of the user terminal device 100. For example, if it is determined that the terminal device 100 is in a vibration state, it may be inferred that the user may be in a bumpy state, such as running, sitting on a car, etc.; if the terminal device 100 is in the inclined state, it can be inferred that the user may be in a motion state, such as a yoga motion; if it is determined that the terminal device 100 is in a moving state, it may be inferred that the user may be in a moving state, such as a riding, sitting, etc., scenario. It will be appreciated that the above description is exemplary only, and that the motion state may also include other similar parameters. The states of the users listed above all belong to busy states. By judging the motion state of the user terminal device 100, it is possible to easily infer the motion state of the user, thereby realizing automatic voice processing.

Thus, the posture information of the user terminal device 100 may include one or more combinations of vibration amplitude of the user terminal device 100, vibration frequency of the user terminal device 100, inclination angle of center of gravity of the user terminal device 100, and speed change condition of the user terminal device 100, for example. Correspondingly, the vibration amplitude and the vibration frequency of the user terminal device 100 may be measured by a vibration sensor, the inclination angle of the center of gravity of the user terminal device 100 may be measured by a gyroscope, and the speed change of the user terminal device 100 may be measured by a GPS. The judgment method in this embodiment belongs to a method of automatically sensing whether or not to perform voice processing.

In an embodiment, it may be determined whether the voice processing is required, or it may be further configured to acquire a parameter setting of the user, and determine that the voice processing is required when the parameter setting of the user is set to be on. The parameter setting of the user refers to that the user presets a corresponding instruction before a certain voice communication, which indicates that the voice communication process needs to process voice. For example, the user sets in advance in a setting menu provided by the terminal device 100 that the voice processing is required for the next voice communication, and at this time, the user parameter is set to be on for voice processing, so that the parameter setting is acquired at the time of the next voice communication, and the terminal device 100 automatically performs voice processing in response to the user's intention. The reservation function of voice processing can be realized through the mode. The judgment mode in this embodiment belongs to a mode in which the user can select whether to perform voice processing.

In an embodiment, whether the voice processing is required is determined, or a trigger signal of the user may be acquired, and when the trigger signal of the user is received, the voice processing is determined to be required. Wherein, the trigger signal of the user refers to a user instruction. The trigger signal of the user may be a keyboard trigger instruction input by the user or a voice trigger instruction of the user, for example. For example, when the user performs voice communication, an operation interface is provided for the user to select whether voice processing is required, and the user may select whether voice processing is required for the voice communication or not, which may be indicated to the terminal device 100 by touching a key, a physical key, or voice control. The judgment mode in this embodiment belongs to a mode in which the user can select whether to perform voice processing.

In an embodiment, it may be determined whether the voice processing is required, or it may be further determined that the voice processing is required when the environment information is in a set state by acquiring the environment information in which the user terminal device 100 is located. The environmental information in which the user terminal device 100 is located may be the noise level of the environment, and correspondingly, the noise sensor may be used to collect the environmental noise level. The setting state here refers to the noisy state of the environment, and the noisy state can be calculated only when the decibel of the noise reaches, and can be freely set by a user according to actual conditions. The judgment method in this embodiment belongs to a method of automatically sensing whether or not to perform voice processing.

The following description is made in connection with practical examples. The user has the phone call to be in when running, and the pronunciation can be swiftly undulant when the conversation, and at this moment cell-phone or earphone sense through vibration sensor that the user probably is in the state of jolting this moment, then can be automatic with the pronunciation that the user said change into the text information, and finally the new pronunciation of regeneration is reproduced to another party of conversation to this text information combination speech parameter under the quiet environment again, even if the user talks in the running process at this moment, the pronunciation that another party of conversation heard is the pronunciation under the normal condition, has promoted conversation quality effectively.

Similarly, the user can use GPS positioning to identify the speed of the terminal device 100 while riding a motorcycle or bicycle, and when sensing that the user may be moving at a faster speed, the user's voice is automatically processed during a conversation; for another example, when the user performs yoga or gymnastics, the inclination angle of the terminal device 100 can be identified by using the gyroscope, so that the user may be performing sports, and the voice of the user is automatically processed during conversation. It will be appreciated that the above detection method is not applicable to mobile phone terminals in all situations, for example, the user may not have to carry the mobile phone with him or her during exercise, but may perform voice communication through an earphone or a watch, and the terminal device 100 for sensing is the earphone or the watch.

Or, in view of privacy protection, the user does not want to expose the current state of the user, at this time, the voice communication can be subjected to voice processing through preset or keyboard input instructions, no matter what communication environment the user is in, the voice heard by the other party communicating with the user is the voice in a quiet environment, the communication quality is ensured, and the effect of protecting the privacy of the user is also achieved.

Based on the voice processing method in the above embodiment, in an embodiment, in order to improve security, determining whether voice processing is required further includes verifying the user identity.

In particular, the user identity may be verified by:

requiring the user to input a password, wherein a password input interface is required to be provided, the password is preset by the user, and the password can be in the form of a digital password, a graphic password and the like;

requiring the user to perform fingerprint recognition, which can be implemented by a fingerprint recognition module of the terminal device 100;

requiring the user to perform face recognition, which may be performed by the camera of the terminal device 100;

the above-mentioned manner of identifying the user identity may be implemented by the terminal device 100 having a graphical operation interface, i.e. the terminal device 100 is required to have a display 105, such as a mobile phone, a PC, etc. However, not all terminal apparatuses 100 are provided with the display screen 105, and thus, verification of the user identity can also be performed by:

voiceprint information of a user is identified. The original user voiceprint information for matching and checking is recorded and stored in advance by the original user, and the user identity is required to be checked in real time, and the voiceprint information of the original user is called for comparison. The specific mode can be that a user is prompted by voice to speak a section of random voice, after the section of random voice is obtained, the section of random voice is compared with voiceprint information of the original user to confirm the identity of the user, and whether the user has permission to perform voice processing is judged. The identity verification method for identifying the voiceprint information of the user is not only suitable for mobile phones and PCs, but also can be applied to terminal equipment 100 without user operation interfaces or with smaller user operation interfaces, such as headphones and watches, so that the convenience of identity identification is improved. The extraction and comparison of the voiceprint information are consistent with the prior art, and are not repeated here.

It will be appreciated that the verification of the identity of the user is not limited to the above, and that other similar identity verification methods, such as iris recognition, etc., may be chosen by those skilled in the art and are not listed here.

In an embodiment, the user identity verification action may be set when the voice parameters of the user are acquired, and is not limited to the case of judging whether the voice processing is required, and the specific user identity recognition mode is consistent with the above mode, which is not described herein.

Illustratively, a voice processing method of the present application, applied to the user terminal device 100, may be applied in a dial-up call scenario (communication through a mobile base station), a voice call scenario (communication through the internet), or a voice message transmission scenario. Therefore, in the dialing conversation scene and the voice conversation scene, a conversation interface is entered before a first voice signal of a user is acquired, and a second voice signal is output, namely, the second voice signal is output to the voice conversation process; in the voice message sending scene, the user enters a voice input interface before the first voice signal is acquired, and the second voice signal is output, namely the second voice signal is output as a voice message.

Therefore, the voice processing method is not limited to the real-time voice call scene, but can be a scene of sending voice messages. The following will describe a practical example. The user inputs the voice normally speaking in the quiet environment to the mobile phone in advance, the voice normally speaking in the quiet environment is utilized to generate corresponding voice parameters, when the mobile phone is utilized to send voice information, the mobile phone utilizes the microphone to acquire the voice spoken by the user, the voice is converted into text information, finally, the text information is utilized to regenerate new voice by combining the voice parameters in the quiet environment, and the new voice is transmitted to the other party of the call in the form of voice information, at the moment, no matter what environment the user is in, the voice heard by the other party of the voice information is the voice in the quiet environment, and the quality of the voice information is effectively improved. The above scenario of sending voice messages may be applied to existing instant chat software, such as WeChat, QQ, etc.

It should also be appreciated that various implementations of the methods provided by the embodiments of the present application may be arbitrarily combined to achieve different technical effects.

FIG. 6 shows that the embodiment of the present application provides the embodiment of the application provides a terminal device 100. The terminal device 100 includes: memory 102, processor 101, and a computer program stored on memory 102 and executable on processor 101, the computer program when run is configured to perform the speech processing method described above.

The processor 101 and the memory 102 may be connected by a bus or other means.

The memory 102, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs and non-transitory computer-executable programs, such as the speech processing methods described in the embodiments of the present application. The processor 101 implements the above-described speech processing method by running a non-transitory software program and instructions stored in the memory 102.

The memory 102 may include a storage program area that may store an operating system, at least one application program required for functions, and a storage data area; the storage data area may store and perform the voice processing method described above. In addition, the memory 102 may include high-speed random access memory 102, and may also include non-transitory memory 102, such as at least one disk memory 102 piece, flash memory device, or other non-transitory solid state memory 102 piece. In some embodiments, the memory 102 may optionally include memory 102 remotely located relative to the processor 101, the remote memory 102 being connectable to the terminal device 100 through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The non-transitory software programs and instructions required to implement the above-described speech processing methods are stored in the memory 102 and when executed by the one or more processors 101, perform the above-described speech processing methods, e.g., perform method steps 101 through 103 depicted in fig. 2, method steps 201 through 202 depicted in fig. 3, method steps 301 through 302 depicted in fig. 4, and method steps 401 through 402 depicted in fig. 5.

The embodiment of the application also provides a computer readable storage medium, which stores computer executable instructions for executing the voice processing method.

In an embodiment, the computer-readable storage medium stores computer-executable instructions that are executed by the one or more control processors 101, for example, by the one processor 101 in the terminal device 100, which may cause the one or more processors 101 to perform the above-described speech processing method, for example, performing the method steps 101 to 103 described in fig. 2, the method steps 201 to 202 described in fig. 3, the method steps 301 to 302 described in fig. 4, and the method steps 401 to 402 described in fig. 5.

The above described apparatus embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Those of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor 101, such as a central processor 101, a digital signal processor 101, or a microprocessor 101, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory 102 technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically include computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and may include any information delivery media.

While the preferred embodiments of the present application have been described in detail, the present application is not limited to the above embodiments, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit and scope of the present application, and these equivalent modifications and substitutions are intended to be included in the scope of the present application as defined in the appended claims.

Claims

1. A voice processing method applied to a user terminal device, comprising:

acquiring a first voice signal of a user;

converting the first voice signal into text information;

converting the text information into a second voice signal according to voice parameters preset by a user, and outputting the second voice signal;

the preset voice parameters are set through the following steps:

acquiring a third voice signal recorded by a user in a reference environment;

analyzing the third voice signal to obtain voice parameters corresponding to the user;

the step of analyzing the third voice signal to obtain voice parameters corresponding to the user includes the following steps:

extracting acoustic characteristics of the third voice signal to obtain a first voice parameter corresponding to the user;

the equalizer is utilized to carry out high-frequency, medium-frequency and low-frequency hertz frequency band adjustment recognition on the third voice signal to obtain a second voice parameter corresponding to the user;

the text information is converted into a second voice signal according to voice parameters preset by a user, outputting the second voice signal, comprising the steps of:

inputting the first voice parameters into a voice synthesis model, and converting the text information into a second voice signal by utilizing the voice synthesis model;

and inputting the second voice parameters to an equalizer, and optimizing the second voice signals by using the equalizer.

2. The method for processing voice according to claim 1, wherein the step of converting the text information into a second voice signal according to voice parameters preset by a user and outputting the second voice signal comprises one of the following steps:

acquiring a voice parameter preset by a user in a local area, converting the text information into a second voice signal according to the preset voice parameter, and outputting the second voice signal;

the text information is sent to a cloud server, a second voice signal corresponding to the text information from the cloud server is obtained, and the second voice signal is output, wherein the second voice signal is formed by converting the text information by the cloud server according to voice parameters preset by a user;

and acquiring voice parameters preset by a user from a cloud server, converting the text information into a second voice signal according to the preset voice parameters, and outputting the second voice signal.

3. The method for processing voice according to claim 1, wherein the step of obtaining the first voice signal of the user comprises the steps of:

judging whether voice processing is needed or not;

when the voice processing is needed, a first voice signal of the user is acquired.

4. A method of speech processing according to claim 3, wherein said determining whether speech processing is required comprises one of:

acquiring the motion state of user terminal equipment, and judging that voice processing is required when the motion state of the user terminal equipment is a set state;

acquiring parameter settings of a user, and judging that voice processing is required when the user parameter settings are voice processing starting;

acquiring a trigger signal of a user, and judging that voice processing is required when the trigger signal of the user is received;

and acquiring environment information of the user terminal equipment, and judging that voice processing is required when the environment information is in a set state.

5. The method for voice processing according to claim 4, wherein the step of acquiring the motion state of the user terminal device comprises:

acquiring gesture information of user terminal equipment;

and judging the motion state of the terminal equipment according to the gesture information.

6. The speech processing method of claim 5 wherein the gesture information comprises at least one of:

vibration amplitude of the user terminal equipment;

the vibration frequency of the user terminal equipment;

the gravity center inclination angle of the user terminal equipment;

speed change condition of user terminal equipment.

7. The method of claim 4, wherein the step of obtaining the trigger signal of the user comprises one of:

acquiring a keyboard trigger instruction input by a user;

and acquiring a voice trigger instruction of the user.

8. The method for processing voice according to claim 1, further comprising, before the step of obtaining the first voice signal of the user:

entering a call interface or a voice input interface.

9. The method of claim 1, wherein said outputting said second speech signal comprises one of:

outputting the second voice signal to a voice call process;

and outputting the second voice signal as a voice message.

10. A control apparatus, characterized by comprising: memory, a processor and a computer program stored on the memory and executable on the processor, which computer program, when run, performs the speech processing method according to any one of claims 1 to 9.

11. A terminal device comprising a control apparatus as claimed in claim 10.

12. A computer readable storage medium storing computer executable instructions for performing the speech processing method of any one of claims 1 to 9.