CN112349266A - Voice editing method and related equipment - Google Patents

Voice editing method and related equipment Download PDF

Info

Publication number
CN112349266A
CN112349266A CN201910735271.4A CN201910735271A CN112349266A CN 112349266 A CN112349266 A CN 112349266A CN 201910735271 A CN201910735271 A CN 201910735271A CN 112349266 A CN112349266 A CN 112349266A
Authority
CN
China
Prior art keywords
information
sound
voice
attribute information
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910735271.4A
Other languages
Chinese (zh)
Inventor
赖国锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen TCL Digital Technology Co Ltd
Original Assignee
Shenzhen TCL Digital Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen TCL Digital Technology Co Ltd filed Critical Shenzhen TCL Digital Technology Co Ltd
Priority to CN201910735271.4A priority Critical patent/CN112349266A/en
Publication of CN112349266A publication Critical patent/CN112349266A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention discloses a voice editing method, which comprises the following steps: performing voice recognition on received first voice information to acquire voice attribute information and first text information of the first voice information; inputting the first sound attribute information into a trained sound conversion model to obtain second sound attribute information after sound conversion; sensitive information filtering is carried out on the first text information to obtain second text information; and performing voice synthesis on the second sound attribute information and the second text information to obtain second sound information. The method of the invention can adjust the mood of the emotion in the voice information and filter the sensitive information collection contained in the voice information by editing the voice information sent by the two communication parties, so that the two communication parties can smoothly complete the event processing under the interference of non-emotional factors, the work efficiency is improved, and the waiting time for the event processing is reduced. In addition, the invention also discloses a voice editing device and related equipment.

Description

Voice editing method and related equipment
Technical Field
The present invention relates to the field of voice interaction technologies, and in particular, to a voice editing method and related devices.
Background
The most common communication method in the prior art of voice communication includes: face-to-face communication. When two parties of voice communication perform voice chat, such as telephone communication and network voice chat, there may be a case where communication is not smooth, for example: when one party is out of control of emotion, the volume of the voice sent out is too high or too low, or the voice sent out contains one or more sensitive information, but in the prior art, the voice information sent out by the two communicating parties is directly sent to the other party without being edited and filtered, so that words and tone with emotion in the voice information can cause the two communicating parties to dispute, the events to be completed in the original plan cannot be successfully processed and completed, and finally the problem of low transaction efficiency is caused. Furthermore, if the voice communication is between the client and the manual customer service, the efficiency of processing events is low, the waiting time of the user is long, the problems that the customer service process is not standard, the loss of the client is serious and the like can be caused, and inconvenience is brought to the processing events of the two voice communication parties.
Therefore, the prior art is subject to further improvement.
Disclosure of Invention
In view of the defects in the prior art, the invention provides a voice editing method and related equipment, which overcome the defects that voice information sent by two voice communication parties in the prior art is not subjected to information identification, and sensitive information contained in the voice information may cause that the two voice communication parties cannot smoothly process events, so that the work efficiency is low.
In a first aspect, an embodiment of the present invention provides a voice editing method, including:
receiving first sound information, performing voice recognition on the first sound information, and acquiring sound attribute information of the first sound information and first text information contained in the first sound information;
inputting the first sound attribute information into a trained sound conversion model to obtain second sound attribute information after sound conversion, wherein the sound conversion model is trained on the basis of the corresponding relation between the input sound attribute and the target conversion sound attribute information; the target conversion sound attribute information is sound attribute information obtained by adjusting the sound channel parameters of the input sound attribute information according to a preset sound channel parameter range;
sensitive information filtering is carried out on the first text information to obtain second text information;
and performing voice synthesis on the second sound attribute information and the second text information to obtain second sound information.
Optionally, the step of performing speech recognition on the first sound information includes:
inputting the first voice information into a trained voice recognition model to obtain the first voice attribute information and the first text information; the voice recognition model is trained based on correspondence between input voice information, voice attribute information corresponding to the voice information, and text information included in the voice information.
Optionally, the sound conversion model includes: a voice analysis layer, a parameter prediction layer and an information conversion layer;
the step of inputting the first sound attribute information into a trained sound conversion model to obtain second sound attribute information after sound conversion comprises:
inputting the first sound attribute information into a voice analysis layer to obtain a sound channel parameter corresponding to the first sound attribute information output by the voice analysis layer;
inputting the sound channel parameters corresponding to the first sound attribute information into a parameter prediction layer, and outputting the adjusted sound channel parameters after the first sound attribute information is adjusted by the parameter prediction layer according to a preset sound channel parameter range;
and inputting the adjusted channel parameters into an information conversion layer to obtain the second channel attribute information output by the information conversion layer.
Optionally, the step of inputting the channel parameter corresponding to the first sound attribute information into a parameter prediction layer, and outputting the adjusted channel parameter after the obtained parameter prediction layer adjusts the first sound attribute information according to a preset channel parameter range includes:
the parameter prediction layer receives the sound channel parameters and extracts gender identifiers contained in the sound channel parameters;
and comparing the gender identification with a pre-stored answering party gender identification, if the gender identifications are the same, adjusting the gender of the first sound attribute information within the preset sound channel parameter range, and outputting an adjusted sound channel parameter, wherein the adjusted sound channel parameter is the sound channel parameter of the first sound attribute information after the gender of the first sound attribute information is changed.
Optionally, the step of inputting the channel parameter corresponding to the first sound attribute information into a parameter prediction layer, and outputting the adjusted channel parameter after the obtained parameter prediction layer adjusts the first sound attribute information according to a preset channel parameter range includes:
and the parameter prediction layer receives the sound channel parameters, extracts prosody parameters and audio parameters contained in the sound channel parameters, adjusts the prosody parameters and the audio parameters to be within the preset sound channel parameter range, and outputs the adjusted sound channel parameters.
Optionally, the step of filtering the sensitive information of the first text information includes:
and inputting the first text information into a trained information filtering model to obtain second text information after filtering, wherein the information filtering model is trained on the basis of the corresponding relation between the text information and the marked sensitive information.
Optionally, the step of inputting the first text information into the trained information filtering model to obtain the filtered second text information further includes:
inputting the second text information into a depth semantic model to obtain integrated third text information; the depth semantic model is trained on the basis of the corresponding relation between the text information and the matching information; the matching information is text information of which the semantic matching degree with the text information exceeds a preset threshold value;
inputting the third text information as the second text information into the speech synthesis model.
Optionally, the step of performing speech synthesis on the second sound attribute information and the first text information includes:
and inputting the second sound attribute information and the second text information into a speech synthesis model to obtain output second sound information, wherein the speech synthesis model is trained based on the corresponding relation among the sample sound attribute information, the second text information and the sample synthesis audio, and the sample synthesis audio is audio generated according to the sample sound attribute information and the second text information.
In a second aspect, an embodiment of the present invention provides a voice editing apparatus, including:
the voice recognition module is used for receiving first voice information, performing voice recognition on the first voice information, and acquiring voice attribute information of the first voice information and first text information contained in the first voice information;
the attribute information conversion module is used for inputting the first sound attribute information into a trained sound conversion model to obtain second sound attribute information after sound conversion, and the sound conversion model is trained on the basis of the corresponding relation between the input sound attribute information and the target conversion sound attribute information; the target conversion sound attribute information is sound attribute information obtained by adjusting the sound channel parameters of the input sound attribute information according to a preset sound channel parameter range;
the text filtering module is used for filtering the sensitive information of the first text information to obtain second text information;
and the voice synthesis module is used for carrying out voice synthesis on the second sound attribute information and the second text information to obtain second sound information.
In a third aspect, an embodiment of the present invention further provides a computer device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the method when executing the computer program.
In a fourth aspect, the present invention further provides a computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the steps of the method.
Compared with the prior art, the embodiment of the invention has the following advantages:
according to the method provided by the embodiment of the invention, the sound information sent by both communication parties is edited to obtain the first sound attribute information and the first text information of the sound information, the first sound attribute information is converted by utilizing a trained sound conversion model, the sound channel parameter of the first sound attribute information is adjusted to be within the preset sound channel parameter range, and the sensitive information contained in the first text information is filtered, so that the sound information is edited based on two aspects of sound attribute and text information. In the embodiment, discontented emotion intonation possibly carried in the voice information is adjusted, and sensitive information possibly contained in the voice information is filtered, so that the intonation of the filtered voice information after adjustment is smooth and does not have sensitive information with bias tendency, a good communication environment is created for two communication parties, the two communication parties can smoothly complete event processing, the work efficiency is improved, the waiting time of the two communication parties is reduced, and a good basis is provided for the two communication parties to communicate again.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a block diagram of an exemplary application scenario in an embodiment of the present invention;
FIG. 2 is a flow chart of the steps of a method for speech editing according to an embodiment of the present invention;
FIG. 3 is a block diagram of another exemplary application scenario in an embodiment of the present invention;
fig. 4 is a schematic block diagram of a speech editing apparatus in an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The existing voice communication mode usually sends the voice information of two voice communication parties to the opposite party directly through a medium, so when the two communication parties have different opinions, one or two communication parties may release dissatisfaction or directly speak words with sensitive information, the dissatisfaction can be directly known by the opposite party, and the increase of contradiction may cause the whole communication event to end up with communication failure, so that not only the expected event processing result is not achieved, but also adverse effects such as loss of customers and the like can be caused, and therefore a mode capable of effectively solving the contradiction when the two communication parties communicate is needed to improve the efficiency of processing the event by the two communication parties.
In order to solve the above problem, in the embodiment of the present invention, when the two parties of communication send the voice information to the other party, the voice information sent by the two parties of communication is received first, then voice recognition is carried out on the sound information to obtain the sound attribute information and the text information of the sound information, the sound attribute information may include attributes of timbre, volume, intensity and pitch of the sound information, the text information is the character information contained in the sound information, the two parts of information are respectively processed, the high volume, the tone color and the high tone which exceed the preset amplitude range in the sound attribute information are adjusted to the preset parameter range, the sensitive characters contained in the text information are deleted, therefore, the voice information which is more suitable for communication is obtained, and the edited voice information is sent to the opposite side, so that a better event communication result is achieved.
For example, the present invention implementation may be applied to the scenario shown in FIG. 1. In this scenario, a server 102 having a voice editing function is provided between the client terminal 101 and the client service terminal 103, and edits voice information uttered by both. After receiving the voice information from the client terminal 101 or the customer service terminal 103, the server 102 performs voice recognition on the voice information, extracts voice attribute information and text information of the voice information, inputs the voice attribute information into a trained voice conversion model to obtain converted voice attribute information, and performs filtering processing on the text information to obtain text information with sensitive information filtered out, and finally synthesizes the voice attribute information and the text information with sensitive information filtered out into edited voice information, and transmits the edited voice information to the customer service terminal 103.
It should be noted that the above application scenarios are only presented to facilitate understanding of the present invention, and the embodiments of the present invention are not limited in any way in this respect. Rather, embodiments of the present invention may be applied to any scenario where applicable.
Exemplary method
Referring to fig. 2, a method for speech editing in an embodiment of the present invention is shown. In this embodiment, the method may include, for example, the steps of:
step S201, receiving first sound information, performing speech recognition on the first sound information, and acquiring first sound attribute information of the first sound information and first text information included in the first sound information.
The first sound attribute information corresponds to attribute information such as volume, tone, timbre and the like of the first sound information, wherein the volume corresponds to sound intensity and is determined by vibration amplitude of sound waves and can also be understood as amplitude of sound signals, and the tone refers to the height of sound and depends on the frequency of the sound waves, the tone of a general male student is deep, and the tone of a female student is sharp; timbre, i.e., sound quality, is a basic characteristic of a sound that is different from another sound, such as a sound emitted between different objects, and is the most important analysis factor in human voice recognition. The first text information is obtained by recognizing vocabulary contents contained in the first voice information and converting the recognized vocabulary contents into text information.
In this embodiment, the first sound information sent by the communication party may be received by a microphone, or the first sound information sent by either party of the communication party may be received by a telephone or in another way, where the first sound information may be in a plurality of audio formats such as MP3 format and wav format.
After receiving first sound information input by a target user, the voice editing equipment performs voice recognition on the first sound information to obtain first sound attribute information of the first sound information and first text information contained in the first sound attribute information.
In this step, the method for performing voice recognition on the first sound information is to acquire the first sound attribute information of the first sound information first, and since the sound attribute information is the physical characteristic of the sound information itself, the first sound attribute information of the first sound information can be acquired at the same time when the first sound information is acquired. The first sound attribute information contains the volume of the first sound information, the pitch and the amplitude of the tone. Then, the vocabulary information contained in the first voice information is converted into character information by using the voice recognition software, and the first character information of the first voice information is obtained, for example: and speech recognition software such as Baidu speech recognition software, speech master APP, speech-to-text APP developed by science news. Examples are: after the customer sends out the voice information of 'whether the customer can apply for goods return', the vocabulary information contained in the voice information is 'whether the customer can apply for goods return' and the current volume is twice of the preset volume threshold, the amplitude of the tone and the frequency corresponding to the tone, and the range of the preset volume threshold can be any volume value between 80dB and 100 dB.
In addition, the method for performing speech recognition on the first sound information in this step may be implemented by using a neural network model, and specifically, if the first sound information is obtained by performing speech recognition on the first sound information by using a trained sound recognition model, and the sound recognition model is obtained by training on the basis of a preset first neural network model, the step of performing speech recognition on the first sound information by using the sound recognition model includes:
first, a voice recognition model for performing voice recognition is trained based on a first neural network model.
The training of the step is carried out on a preset first neural network model based on input sound information and verification sound information containing sound attribute information identification and text information identification, and the training method comprises the following steps:
firstly, a plurality of voice information training sets used for training the voice recognition model and voice information verification sets used for verifying the voice recognition model are collected.
The voice information training set contains a plurality of training voice sample information, the voice information verification set contains a plurality of verification voice sample information, and the verification voice sample information comprises: the voice training system comprises voice attribute information and text information, wherein the voice attribute information is the real value of the voice attribute information of each training voice information, and the text information is the real value of the text information converted from the voice contained in each training voice information. A set of sound attribute information and text information corresponds to a training sound sample information.
The training voice sample information is voice sample information for training. And the plurality of sound attribute information and the text information contained in the plurality of verification sound information are used for verifying the output result obtained by inputting the training sound sample information into the first neural network model, and optimizing and adjusting the parameters of the first neural network model according to the verification result.
Inputting each training sound sample information in the training set into the first neural network model to obtain sound attribute information and text information which are output by the first neural network model and correspond to the training sound sample information, wherein the sound attribute information and the text information which are output by the first neural network model are predicted values of the sound attribute information and the text information which are obtained by carrying out sound recognition on each training sound sample information, verifying the predicted values by using the sound attribute information and the real values of the text information of the training sound sample information which are contained in the sound information verification set to obtain errors between the predicted values and the real values, and optimizing parameters of the first neural network model according to the errors.
Specifically, the training sound sample information contained in the sound information training set is input into the first neural network model, and sound attribute information and text information corresponding to the training sound sample information in the sound information training set output by the first neural network model are obtained.
Comparing the real values of the sound attribute information and the text information of each training sound information contained in the sound information verification set with the predicted values of each training sound attribute information and the text information output by the first neural network model to obtain the error of the current training, adjusting the parameters of the first neural network model according to the error, repeating the step of inputting each training voice sample information contained in the sound information training set into the first neural network model until the verified error is within a preset range, and finishing the training step to obtain the trained sound recognition model.
Secondly, inputting the first voice information into the voice recognition model to obtain the first voice attribute information and the first text information;
the voice recognition model is trained based on correspondence between input voice information, voice attribute information corresponding to the voice information, and text information included in the voice information.
And after the first voice information is input into the trained voice recognition model, the voice recognition model outputs the recognized first voice attribute information and first text information.
Step S202, inputting the first sound attribute information into a trained sound conversion model to obtain second sound attribute information after sound conversion, wherein the sound conversion model is trained on the basis of the corresponding relation between the input sound attribute information and the target conversion sound attribute information. And the target conversion sound attribute information is sound attribute information obtained by adjusting the sound channel parameters of the input sound attribute information according to a preset sound channel parameter range.
The second sound attribute information is obtained by adjusting the sound channel parameters of the first sound information to be within a preset sound channel parameter range. The pre-channel parameters are parameters corresponding to the sound attribute information. The preset sound channel parameter range is a parameter range corresponding to each preset sound attribute information. The preset channel parameter range includes: and the preset range of the preset sound channel parameters such as volume, tone, timbre and the like. And adjusting the sound track parameters such as volume, tone, timbre and the like of the adjusted target converted sound attribute information to be within a preset parameter range, so as to create a harmonious language atmosphere for the two parties of communication. Specifically, the preset sound channel parameter range can be set to be that the volume is 50dB to 100dB, and the audio frequency parameter corresponding to the tone is controlled to be 200 Hz to 4000 Hz; the tone color of the sound information is controlled by adjusting the fundamental tone and each subharmonic tone in the sound information.
The sound conversion model is used for adjusting sound attribute information in the first sound information and outputting adjusted second sound attribute information. Specifically, the acoustic conversion model is trained based on a correspondence between input acoustic attribute information and target conversion acoustic attribute information. And the target conversion sound attribute information is sound attribute information obtained by adjusting the input sound attribute information.
The sound conversion model is obtained by training the second neural network model through a plurality of pieces of training sound attribute information, the sound attribute information in the first sound information is adjusted through the sound conversion model, and the adjusted second sound attribute information is output.
Specifically, the training method of the voice conversion model includes the following steps:
firstly, a plurality of sound attribute information training sets and sound attribute information verification sets are collected, wherein the sound attribute information training sets are used for training the sound conversion model, the sound attribute information training sets contain a plurality of training sound attribute sample information, and the sound attribute information verification sets contain a plurality of target conversion sound attribute sample information. And the target converted sound attribute sample information is the real value of the sound attribute information obtained by adjusting the sound channel parameters of the input sound attribute sample information according to the preset sound channel parameter range.
Inputting the training sound attribute sample information into the second neural network model to obtain target conversion sound attribute information corresponding to each piece of training sound attribute sample information output by the second neural network model, wherein the target conversion sound attribute information output by the second neural network model is a predicted value obtained after sound conversion is performed on the training sound attribute sample information, the verification sound attribute information is used for verifying the predicted value to obtain an error between the predicted value and a true value, and parameters of the second neural network model are optimized according to the error.
And inputting the training sound attribute sample information contained in the sound attribute information training set into a preset second neural network model respectively to obtain target conversion sound attribute information corresponding to the sound attribute sample information in the sound attribute information training set and output by the second neural network model.
And comparing the target conversion sound attribute information output by the second neural network model with the true value of the target conversion sound attribute sample information of each training sound attribute sample information in the verification set to obtain the error of the training, adjusting the parameters of the second neural network according to the error, repeating the step of inputting each training sound attribute information and verification sound attribute information contained in the sound attribute information training set into the second neural network model until the obtained training error is within a preset range, and finishing the training step to obtain the trained sound conversion model.
And inputting the first sound attribute information into a trained sound conversion model to obtain output second sound attribute information, wherein the second sound attribute information is sound attribute information obtained by adjusting the sound channel parameters of the first sound attribute information according to a preset sound channel parameter range.
Specifically, the sound conversion model includes: a speech analysis layer, a parameter prediction layer and an information conversion layer. The voice analysis layer is configured to acquire a channel parameter corresponding to the first sound attribute information, the parameter prediction layer is configured to predict a channel parameter corresponding to a target conversion sound attribute based on the channel parameter of the first sound attribute information acquired by the voice analysis layer to obtain an adjusted channel parameter, and the information conversion layer is configured to generate second channel attribute information according to the adjusted channel parameter.
Further, the step of inputting the first sound attribute information into a trained sound conversion model to obtain second sound attribute information after sound conversion includes:
step 2021, inputting the first sound attribute information into a speech analysis layer, so as to obtain a sound channel parameter corresponding to the first sound attribute information output by the speech analysis layer.
Firstly, the first sound attribute information is input into a voice analysis layer, and a sound channel parameter of the first sound attribute information in a preset voice signal analysis model is analyzed.
The voice analysis layer utilizes the linear prediction analysis technology of the voice signal to accurately analyze the sound channel parameter of the first sound attribute information, and the sound channel parameter analyzed by the linear prediction analysis technology is an audio parameter, including: the frequency of the sound, the volume of the sound, the spectral amplitude of the sound, etc. Meanwhile, gender identification contained in the vocal tract parameters can be obtained by utilizing R voice analysis. The gender identification marks whether the vocal tract parameters belong to a male voice or a female, usually in the last field of the voice attribute. And the R voice analysis extracts the voice attribute of the voice information, and obtains whether the sex belongs to a boy or a girl based on the voice attribute.
Step 2022, inputting the channel parameter corresponding to the first sound attribute information into a parameter prediction layer, and outputting the adjusted channel parameter after the parameter prediction layer adjusts the first sound attribute information according to a preset channel parameter range.
And the parameter prediction layer predicts the sound channel parameters corresponding to the second sound channel attribute information according to the input sound channel parameters.
In this embodiment, in order to make the communication more efficient, whether the sound emitted from the sender is a male sound or a female sound is identified in the adjustment of the vocal tract parameters, when it is identified that the sound emitted from the sender is a female sound and the receiver of the sound information is a female student, the female sound corresponding to the first sound attribute information may be adjusted to be a male sound, and if the receiver of the sound information is a male student, the adjustment of the gender to which the sound belongs may not be performed, so that the communication efficiency is improved due to the reason that the communication between the female student and the male student is easier. For example: when a male client wants to send out splenic qi, calm communication may be quickly restored because the voice sent out by the customer service of the opposite party is female voice.
Therefore, when the channel parameter is adjusted, the parameter prediction layer performs two different adjustments to the channel parameter of the first sound attribute information: the first is to change the gender and gender identification of the first voice attribute and adjust other voice parameters of the first voice attribute, and the second is to adjust other voice parameters except the gender and gender identification of the first voice attribute, such as audio parameters of volume, pitch and tone and/or the like and/or rhythm parameters of energy curve, voice rate and resonant frequency curve and the like. Wherein, the gender identification is: male voice and female voice. The gender of the first voice attribute may be male or female. The parameter prediction layer extracts the gender of the first voice attribute, identifies whether the gender of the voice is male or female, and adjusts the gender according to the identified gender.
When the gender and gender identification of the first voice attribute are changed, the step of adjusting comprises:
the parameter prediction layer receives the sound channel parameters and extracts gender identifiers contained in the sound channel parameters;
and comparing the gender identification with a pre-stored answering party gender identification, if the gender identifications are the same, adjusting the gender of the first sound attribute information within the preset sound channel parameter range, and outputting an adjusted sound channel parameter, wherein the adjusted sound channel parameter is the sound channel parameter of the first sound attribute information after the gender of the first sound attribute information is changed.
The pre-stored recipient gender identification can be a boy student or a girl student, and the pre-stored recipient gender identification corresponds to a communicator recipient, for example: when the sender of the first voice message is a client, the receiver of the communication party can be customer service, the position of the customer service is set to be generally fixed, the sex of the customer service corresponding to the position can be pre-stored in the voice editor for men or women, the voice editor identifies that the sender of the first voice message is a boy student, the pre-stored voice of the customer service is a boy student, the sex of the voice attribute corresponding to the first voice message can be changed into a woman, the voice is converted into the voice of the girl student and sent to the customer service, and the communication between the opposite sex is smoother, so that a better communication effect can be obtained after the sex of the voice attribute information is changed.
The gender identification of the listener of the first sound information can also be obtained by recognizing the voice information sent by the listener, and the specific method for recognizing the gender of the voice information can be realized through a neural network model, for example: through a simple neural network model voice-generator data set which is processed through the R language, the data set extracts some voice attributes of the wav file, and the last field of the data set marks male voice or female voice, so that whether the input voice attribute information is male voice or female voice can be identified from the mark of the last field.
When the gender of the vocal tract parameter in the first voice attribute is changed and other voice parameters are adjusted, the method comprises the following steps:
and the parameter prediction layer receives the sound channel parameters, extracts prosody parameters and audio parameters contained in the sound channel parameters, adjusts the prosody parameters and the audio parameters to be within the preset sound channel parameter range, and outputs the adjusted sound channel parameters.
The prosodic parameters include: an energy curve, a sound rate, and a resonant frequency curve, the audio parameters including: volume, tone, and timbre.
The parameter prediction layer may adjust the channel parameters by using any one of algorithms such as codebook mapping, discrete transfer function, neural network, and gaussian mixture model.
Step 2023, inputting the adjusted channel parameters to an information conversion layer to obtain the second channel attribute information output by the information conversion layer.
And the information conversion layer obtains the adjusted sound channel parameters in the parameter prediction layer and synthesizes second sound channel attribute information according to the adjusted sound channel parameters. The information conversion layer may use a sound synthesis algorithm to realize synthesis of the second channel property information from the adjusted channel parameters.
And S203, filtering sensitive information of the first text information to obtain second text information.
Since the first text message may contain sensitive information, the sensitive information is: the words may be words of civilization, words containing anger, violence and pornography tendencies, or politically sensitive words that may cause the recipient of the voice message to feel uncomfortable and thereby interfere with the communication between the parties. Taking the non-civilized words as examples: when the customer is refused to apply for return, the customer may speak an uncertified word with a tendency of angry emotion, and the receiver of the voice message feels dissatisfaction emotion in the word, and the receiver of the voice message can also generate a conflict emotion, so that the two communication parties can not normally communicate with each other in the next step.
In this step, the method for filtering the sensitive information of the first text information may be to use a pre-established sensitive information database, to find out whether the sensitive information is contained therein by using text information matching, or to use sensitive information filtering software or algorithm to implement filtering.
And step S204, carrying out voice synthesis on the second sound attribute information and the second text information to obtain second sound information.
And performing voice synthesis on the second sound attribute information and the second text information respectively obtained in the step S202 and the step S203 to obtain edited second sound information. The second sound information is obtained by adjusting the sound attribute information of the first sound information and filtering the sensitive information, and the obtained second sound information is gentle in language and does not contain statements of the sensitive information, so that the method provided by the last step creates a harmonious communication atmosphere.
In this embodiment, in order to obtain a more accurate filtering result, the step of filtering the sensitive information of the first text information in step S203 includes:
and inputting the first text information into a trained information filtering model to obtain second text information after filtering, wherein the information filtering model is trained on the basis of the corresponding relation between the text information and the verification text information marked with sensitive information.
The information filtering model is used for filtering the text information input into the information filtering model and is trained on the basis of a preset third neural network model, and the training method comprises the following steps:
firstly, a plurality of text information training sets and text information verification sets for training the information filtering model are collected, wherein the text information training sets contain a plurality of training text information, and the text information verification sets contain a plurality of verification text information. The plurality of verification text messages are text sample templates obtained after sensitive information filtering is carried out on each training text message.
Inputting the training text information into the third neural network model to obtain text information obtained after filtering sensitive information in each training text information output by the third neural network model, wherein the verification text information is used for verifying the filtered text information output by the third neural network model to obtain a text information model, comparing the text information model with the filtered text information output by the third neural network model to obtain an error between the text information model and the text information obtained by the model output, and optimizing parameters of the third neural network model according to the error.
Specifically, each piece of training text information contained in the text information training set is input into the third neural network model, and the text information output by the third neural network model after each piece of training text information is filtered is obtained.
And comparing the filtered text information output by the third neural network model with the verification text information in the verification set to obtain an error of the training, adjusting parameters of the third neural network model according to the error, repeating the step of inputting each training text information contained in the text information training set into the third neural network model until the obtained training error is within a preset range, and finishing the training step to obtain the trained information filtering model.
And using the information filtering model obtained by training in the step to filter the sensitive information of the first text information to obtain filtered second text information output by the information filtering model.
In the step, the first text information is filtered by using an information filtering model based on a neural network model to obtain filtered second text information, wherein the information filtering model is trained based on the corresponding relation between the text information and the text information marked as sensitive information, so that more accurate results can be obtained.
In this embodiment, since the second text information is obtained by deleting the sensitive information from the first text information, a situation that the sentence is not smooth may occur, for example: if the client statement "i love XXX", after the sensitive information is filtered, determining "XXX" in the statement to be politically sensitive vocabulary, namely determining "XXX" as sensitive information, and after filtering "XXX", changing the filtered statement into: therefore, the step of inputting the first text information into the trained information filtering model in step S203 and obtaining the filtered second text information further includes:
step S2031, inputting the second text information into a deep semantic model to obtain integrated third text information; the depth semantic model is trained on the basis of the corresponding relation between the first sample text information and the matching information; the matching information is the text information of which the semantic matching degree with the first sample text information is greater than a preset threshold value.
The deep semantic model is used for integrating text information input into the deep semantic model and is trained on the basis of a preset fourth neural network model, and the training method comprises the following steps:
and collecting a plurality of semantic text training sets used for training the deep semantic model and a semantic text verification set used for verifying, wherein the semantic text training sets contain a plurality of training semantic texts, and the semantic text verification set contains a plurality of verification semantic texts. The verification semantic text is a semantic text integrated with the training semantic text, wherein the integration of the training semantic text is to match the training semantic text to a text with similar semantics based on semantic matching degrees between text information, and then integrate the training semantic text into a text with natural word order conforming to the grammatical structure and the word habit based on the grammatical structure and the word habit according to the matched text, and it can be understood that the verification semantic text is a text with natural word order conforming to the grammatical structure and the word habit obtained by adjusting the training semantic text according to a preset rule. Such as: one of the sentences of the training semantic text is: "are the garments worn? Poor quality ", based on semantic matching between text messages, matching the training semantic text to a text having a semantic matching degree exceeding a preset matching degree" the clothing is worn by a person, the quality is poor ", is the training semantic text" the clothing is worn? The clothes with poor quality and the matched text are worn by people, the poor quality integrates grammatical structures and expression habits of the contents of the two texts, and the text to be expressed by the two texts is obtained, namely, the clothes with poor quality, so that the verification semantic text corresponding to the training semantic text is obtained.
And the verification semantic text is used for verifying the integration result of the training semantic text output in the fourth neural network model, judging the error between the semantic text template and the integration result output by the model, and optimizing the parameters of the fourth neural network model according to the error.
And inputting the training semantic text into the fourth neural network model to obtain an integration result of the training semantic text output by the fourth neural network model. And verifying the integration result output by the fourth neural network model by using the verification semantic text, and optimizing and adjusting the parameters of the fourth neural network model according to the verification result.
And inputting a plurality of training semantic texts contained in the semantic text training set into the fourth neural network model to obtain third text information output by the fourth neural network model, wherein the third text information is obtained by integrating all semantic text information in the semantic text training set.
Comparing the integrated text information output by the fourth neural network model with the verification semantic texts in the verification set to obtain an error of the training, adjusting parameters of the fourth neural network model according to the error, repeating the step of inputting a plurality of training semantic texts and a plurality of verification semantic texts contained in the semantic text training set into the fourth neural network model until the obtained training error is within a preset range, and completing the training step to obtain the trained deep semantic model.
And integrating text information of the first text information by using the deep semantic model obtained by training in the step to obtain integrated third text information output by the deep semantic model.
The step is based on the semantic matching degree between the text messages, and the filtered sentences are reintegrated, so that the integrated sentences are smoother, and the communication content is easier to understand for the opposite side. For example, if the customer statement "is the clothing a bar worn by those at low? Mass bar drop difference! ", then after filtering through sensitive information, the statement becomes: "are the garments worn? Poor quality ", after the integration of the deep semantic model, when the above statements issued by the customer are integrated as: the quality of the clothes is poor, the sentence semantics are clearer, and a better communication effect can be achieved.
Step S2032, replacing the second text information with the third text information and inputting the third text information into the speech synthesis model.
And replacing the second text information which is not smooth after filtering with the third text information which is smooth after integration, and synthesizing to obtain the second voice information through the voice synthesis model because the replaced second text information and the second voice attribute information are input into the voice synthesis model.
In this embodiment, the step of performing speech synthesis in step S204 may be implemented by using speech synthesis software, or may be implemented by using a speech synthesis model based on a neural network.
When the voice synthesis software is used for realizing, only the second sound attribute information and the second text information are required to be simultaneously input into the voice synthesis software, and the voice synthesis software synthesizes the two kinds of information and outputs the synthesized second sound information.
When the implementation of the speech synthesis model based on the neural network is yes, the step of performing speech synthesis on the second sound attribute information and the first text information comprises:
and inputting the second sound attribute information and the second text information into a speech synthesis model to obtain output second sound information, wherein the speech synthesis model is trained based on the corresponding relation among the sample sound attribute information, the second text information and the sample synthesis audio, and the sample synthesis audio is audio generated according to the sample sound attribute information and the second text information.
The second sound attribute information is obtained by adjusting the sound channel parameters of the first sound attribute information, and the second text information is obtained by integrating the first text after sensitive information filtering. And synthesizing the second sound attribute information and the second text information obtained by editing the first sound information to obtain second sound information.
And respectively inputting second sound attribute information and second text information into the speech synthesis model, and outputting the synthesized second sound information by the speech synthesis model. In this embodiment, the speech synthesis model is trained based on the corresponding relationship between the sample sound attribute information, the second text information, and the sample synthesis audio, the speech synthesis model learns the sound attribute information and the text information in the synthesis audio, and predicts and outputs the synthesis result of the input sound attribute information and the text information, and after continuous training, a better speech synthesis result is obtained. For example: after the sound information of 'whether goods can be returned or not' sent by the client by more than twice of the ordinary volume is adjusted and filtered, the 'whether goods can be returned or not' sent by the sound with preset volume, tone and tone (such as the sound of Lin Shiling) is obtained, and therefore a good listening effect is obtained.
Specifically, the speech synthesis model is trained based on a preset fifth neural network model, and the step of obtaining the speech synthesis model based on the training of the fifth neural network model includes:
and collecting a plurality of speech synthesis training sets and speech synthesis verification sets for training the deep semantic model, wherein the speech synthesis training sets contain a plurality of sample sound attribute information, sample text information and sample synthesis audio information for training, the speech synthesis verification sets contain a plurality of sample sound attribute information, sample text information and sample synthesis audio information for verification, and the sample sound attribute information, sample text information and sample synthesis speech for verification contain identification information of synthesis audio information corresponding to the sample sound attribute information, sample text information and sample synthesis speech.
Respectively inputting a plurality of sample sound attribute information, sample text information and sample synthesized voice for training, and a plurality of sample sound attribute information, sample text information and sample synthesized audio for verification into the fifth neural network model to obtain synthesized audio information output by the fifth neural network model, wherein the synthesized audio is synthesized audio obtained by integrating the sample sound attribute information and the sample text information
And comparing the integrated synthetic audio output by the fifth neural network model with the synthetic audio of the sample in the verification set to obtain an error of the training, adjusting parameters of the fifth neural network model according to the error, and repeating the steps of inputting a plurality of sample sound attribute information, sample text information and sample synthetic voices for training and a plurality of sample sound attribute information, sample text information and sample synthetic audios for verification into the fifth neural network model until the training error is within a preset range, and finishing the training step to obtain the trained voice synthetic model.
In one possible implementation, the speech synthesis model of the present embodiment is built on WaveNet, which is an automatic encoder that can generate a model from a recorded waveform, and uses an inverse translation technique to convert one person's voice into another specified person's voice.
Furthermore, the voice editing equipment disclosed by the invention can also receive the problem of user consultation, and give reply information after analyzing the problem, so that the voice editing equipment can be applied to the field of voice interaction, and is particularly suitable for the field of information processing needing a customer service robot.
As shown in fig. 3, the voice editing apparatus is connected to a telephone, and is arranged between a client and a service, the voice information sent by the client is transmitted to the voice editing apparatus provided in this embodiment through the telephone, and the voice editing apparatus recognizes the received voice information to obtain the sound attribute and text information of the voice information, such as: the client sends out 'please inquire the logistics information of order number xxx' voice information, and the voice editing equipment acquires the sound attribute of the voice information and the vocabulary information in the voice information. And adjusting the sound channel parameters of the acquired first sound attribute information to enable the first sound attribute information to be within a preset sound channel parameter range, if the volume corresponding to the voice information is too large, controlling the volume to be within the preset volume range, identifying whether the sound is male voice or female voice according to a gender mark of the sound attribute contained in the voice information, and when the customer service is male and the customer is also male, controlling the gender of the sound sent by the customer to be adjusted to be female voice and sending the female voice to the customer service to construct a communication scene between opposite sexes, so that the communication between the customer and the customer service is smoother. Meanwhile, the voice editing equipment also filters whether the vocabulary information in the voice information sent by the client contains overstrain or the content related to political sensitive words, filters the statement of the unlawful or violating laws and regulations and transmits the statement to the customer service civilized and peaceful sentences.
Similarly, the voice information sent by the customer service can also be transmitted to the voice editing equipment through a telephone, the voice editing equipment edits the voice information sent by the customer service in the same way and transmits the edited sentences to the customer, and the customer also receives the content civilization and the sentences with mild tone, so that the harmonious communication atmosphere is realized.
Exemplary device
On the basis of the above method, the present invention also discloses a voice editing apparatus, as shown in fig. 4, including:
the voice recognition module 401 is configured to receive first sound information sent by a target user, perform voice recognition on the first sound information, and acquire sound attribute information of the first sound information and first text information included in the first sound information; the function of which is shown in step S201.
An attribute information conversion module 402, configured to input the first sound attribute information into a trained sound conversion model, so as to obtain second sound attribute information after sound conversion, where the sound conversion model is trained based on a corresponding relationship between an input sound attribute and target conversion sound attribute information; the function of which is shown in step S202.
The text filtering module 403 is configured to perform sensitive information filtering on the first text information to obtain second text information; the function of which is shown in step S203.
The speech synthesis module 404 is configured to perform speech synthesis on the second sound attribute information and the second text information to obtain second sound information, where the function of the second sound information is as shown in step S204.
In an exemplary embodiment, the apparatus 1800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.
On the basis of the method, the invention also discloses computer equipment which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of the method when executing the computer program.
On the basis of the above method, the present invention also discloses a computer readable storage medium having a computer program stored thereon, wherein the computer program realizes the steps of the method when being executed by a processor.
The invention provides a voice editing method and related equipment, wherein voice recognition is carried out on first voice information by receiving the first voice information sent by a target user, and the voice attribute information of the first voice information and first text information contained in the first voice information are acquired; inputting the first sound attribute information into a trained sound conversion model to obtain second sound attribute information after sound conversion, wherein the sound conversion model is trained on the basis of the corresponding relation between the input sound attribute and the target conversion sound attribute information; sensitive information filtering is carried out on the first text information to obtain second text information; and performing voice synthesis on the second sound attribute information and the second text information to obtain second sound information. The method of the invention can adjust the mood of the emotion in the voice information and filter the sensitive information contained in the voice information by editing the voice information sent by the two communication parties, so that the two communication parties can smoothly complete the event processing under the interference of non-emotional factors, the work efficiency is improved, and the waiting time of the two communication parties is reduced.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims (10)

1. A method for speech editing, comprising the steps of:
receiving first sound information, performing voice recognition on the first sound information, and acquiring first sound attribute information of the first sound information and first text information contained in the first sound information;
inputting the first sound attribute information into a trained sound conversion model to obtain second sound attribute information after sound conversion, wherein the sound conversion model is trained on the basis of the corresponding relation between the input sound attribute information and the target conversion sound attribute information; the target conversion sound attribute information is sound attribute information obtained by adjusting the sound channel parameters of the input sound attribute information according to a preset sound channel parameter range;
sensitive information filtering is carried out on the first text information to obtain second text information;
and performing voice synthesis on the second sound attribute information and the second text information to obtain second sound information.
2. The voice editing method according to claim 1, wherein the step of performing voice recognition on the first sound information includes:
inputting the first voice information into a trained voice recognition model to obtain the first voice attribute information and the first text information; the voice recognition model is trained based on correspondence between input voice information, voice attribute information corresponding to the voice information, and text information included in the voice information.
3. The speech editing method according to claim 1 or 2, wherein the acoustic conversion model comprises: a voice analysis layer, a parameter prediction layer and an information conversion layer;
the step of inputting the first sound attribute information into a trained sound conversion model to obtain second sound attribute information after sound conversion comprises:
inputting the first sound attribute information into the voice analysis layer to obtain a sound channel parameter corresponding to the first sound attribute information output by the voice analysis layer;
inputting the sound channel parameters corresponding to the first sound attribute information into the parameter prediction layer, and outputting the adjusted sound channel parameters after the parameter prediction layer adjusts the first sound attribute information according to the preset sound channel parameter range;
and inputting the adjusted channel parameters into the information conversion layer to obtain the second channel attribute information output by the information conversion layer.
4. The speech editing method according to claim 3, wherein the step of inputting the channel parameter corresponding to the first sound attribute information into a parameter prediction layer, and obtaining the channel parameter after the parameter prediction layer adjusts the first sound attribute information according to a preset channel parameter range, and outputting the adjusted channel parameter comprises:
the parameter prediction layer receives the sound channel parameters and extracts gender identifiers contained in the sound channel parameters;
comparing the gender identification with a pre-stored receiver gender identification, if the gender identification is the same as the pre-stored receiver gender identification, adjusting the gender of the first sound attribute information within the range of the preset sound channel parameters, and outputting the adjusted sound channel parameters; the adjusted vocal tract parameters are the vocal tract parameters of the first sound attribute information after the gender change.
5. The speech editing method according to claim 3, wherein the step of inputting the channel parameter corresponding to the first sound attribute information into a parameter prediction layer, and obtaining the channel parameter after the parameter prediction layer adjusts the first sound attribute information according to a preset channel parameter range, and outputting the adjusted channel parameter comprises:
and the parameter prediction layer receives the sound channel parameters, extracts prosody parameters and audio parameters contained in the sound channel parameters, adjusts the prosody parameters and the audio parameters to be within the preset sound channel parameter range, and outputs the adjusted sound channel parameters.
6. The speech editing method of claim 1 or 2, wherein the step of sensitive information filtering the first text information comprises:
and inputting the first text information into a trained information filtering model to obtain second text information after filtering, wherein the information filtering model is trained on the basis of the corresponding relation between the text information and the marked sensitive information.
7. The method of claim 6, wherein the step of inputting the first text message into a trained message filtering model to obtain a filtered second text message further comprises:
inputting the second text information into a depth semantic model to obtain integrated third text information; the depth semantic model is trained on the basis of the corresponding relation between the first sample text information and the matching information; the matching information is text information of which the semantic matching degree with the first sample text information is greater than a preset threshold value;
inputting the third text information as the second text information into the speech synthesis model.
8. The voice editing method according to any one of claims 1 to 3, wherein the step of voice-synthesizing the second sound attribute information with the first text information includes:
and inputting the second sound attribute information and the second text information into a speech synthesis model to obtain output second sound information, wherein the speech synthesis model is trained based on the corresponding relation among the sample sound attribute information, the second text information and the sample synthesis audio, and the sample synthesis audio is audio generated according to the sample sound attribute information and the second text information.
9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 8 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 8.
CN201910735271.4A 2019-08-09 2019-08-09 Voice editing method and related equipment Pending CN112349266A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910735271.4A CN112349266A (en) 2019-08-09 2019-08-09 Voice editing method and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910735271.4A CN112349266A (en) 2019-08-09 2019-08-09 Voice editing method and related equipment

Publications (1)

Publication Number Publication Date
CN112349266A true CN112349266A (en) 2021-02-09

Family

ID=74366989

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910735271.4A Pending CN112349266A (en) 2019-08-09 2019-08-09 Voice editing method and related equipment

Country Status (1)

Country Link
CN (1) CN112349266A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113033191A (en) * 2021-03-30 2021-06-25 上海思必驰信息科技有限公司 Voice data processing method, electronic device and computer readable storage medium
US11488603B2 (en) * 2019-06-06 2022-11-01 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for processing speech
CN117409761A (en) * 2023-12-14 2024-01-16 深圳市声菲特科技技术有限公司 Method, device, equipment and storage medium for synthesizing voice based on frequency modulation

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040249634A1 (en) * 2001-08-09 2004-12-09 Yoav Degani Method and apparatus for speech analysis
CN105390141A (en) * 2015-10-14 2016-03-09 科大讯飞股份有限公司 Sound conversion method and sound conversion device
CN105763698A (en) * 2016-03-31 2016-07-13 深圳市金立通信设备有限公司 Voice processing method and terminal
CN106448665A (en) * 2016-10-28 2017-02-22 努比亚技术有限公司 Voice processing device and method
CN109065035A (en) * 2018-09-06 2018-12-21 珠海格力电器股份有限公司 Information interacting method and device
CN109256151A (en) * 2018-11-21 2019-01-22 努比亚技术有限公司 Call voice regulates and controls method, apparatus, mobile terminal and readable storage medium storing program for executing
CN109979473A (en) * 2019-03-29 2019-07-05 维沃移动通信有限公司 A kind of call sound processing method and device, terminal device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040249634A1 (en) * 2001-08-09 2004-12-09 Yoav Degani Method and apparatus for speech analysis
CN105390141A (en) * 2015-10-14 2016-03-09 科大讯飞股份有限公司 Sound conversion method and sound conversion device
CN105763698A (en) * 2016-03-31 2016-07-13 深圳市金立通信设备有限公司 Voice processing method and terminal
CN106448665A (en) * 2016-10-28 2017-02-22 努比亚技术有限公司 Voice processing device and method
CN109065035A (en) * 2018-09-06 2018-12-21 珠海格力电器股份有限公司 Information interacting method and device
CN109256151A (en) * 2018-11-21 2019-01-22 努比亚技术有限公司 Call voice regulates and controls method, apparatus, mobile terminal and readable storage medium storing program for executing
CN109979473A (en) * 2019-03-29 2019-07-05 维沃移动通信有限公司 A kind of call sound processing method and device, terminal device

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11488603B2 (en) * 2019-06-06 2022-11-01 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for processing speech
CN113033191A (en) * 2021-03-30 2021-06-25 上海思必驰信息科技有限公司 Voice data processing method, electronic device and computer readable storage medium
CN117409761A (en) * 2023-12-14 2024-01-16 深圳市声菲特科技技术有限公司 Method, device, equipment and storage medium for synthesizing voice based on frequency modulation
CN117409761B (en) * 2023-12-14 2024-03-15 深圳市声菲特科技技术有限公司 Method, device, equipment and storage medium for synthesizing voice based on frequency modulation

Similar Documents

Publication Publication Date Title
US11475897B2 (en) Method and apparatus for response using voice matching user category
CN101030368B (en) Method and system for communicating across channels simultaneously with emotion preservation
CN110751943A (en) Voice emotion recognition method and device and related equipment
CN105869626A (en) Automatic speech rate adjusting method and terminal
CN104538043A (en) Real-time emotion reminder for call
US20120016674A1 (en) Modification of Speech Quality in Conversations Over Voice Channels
CN112349266A (en) Voice editing method and related equipment
CN108184032B (en) Service method and device of customer service system
US11842721B2 (en) Systems and methods for generating synthesized speech responses to voice inputs by training a neural network model based on the voice input prosodic metrics and training voice inputs
TWI638352B (en) Electronic device capable of adjusting output sound and method of adjusting output sound
CN105810205A (en) Speech processing method and device
CN112562681B (en) Speech recognition method and apparatus, and storage medium
CN110853621B (en) Voice smoothing method and device, electronic equipment and computer storage medium
CN110265008A (en) Intelligence pays a return visit method, apparatus, computer equipment and storage medium
KR20210071713A (en) Speech Skill Feedback System
CN113643684A (en) Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN112185341A (en) Dubbing method, apparatus, device and storage medium based on speech synthesis
CN117219046A (en) Interactive voice emotion control method and system
KR20220154655A (en) Device, method and computer program for generating voice data based on family relationship
JPH10149361A (en) Information processing method and its device, and storage medium
US20220020368A1 (en) Output apparatus, output method and non-transitory computer-readable recording medium
CN114462376A (en) RPA and AI-based court trial record generation method, device, equipment and medium
CN114005428A (en) Speech synthesis method, apparatus, electronic device, storage medium, and program product
CN114328867A (en) Intelligent interruption method and device in man-machine conversation
US20230186900A1 (en) Method and system for end-to-end automatic speech recognition on a digital platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination