CN114937457A - Audio information processing method and device and electronic equipment - Google Patents

Audio information processing method and device and electronic equipment Download PDF

Info

Publication number
CN114937457A
CN114937457A CN202210329461.8A CN202210329461A CN114937457A CN 114937457 A CN114937457 A CN 114937457A CN 202210329461 A CN202210329461 A CN 202210329461A CN 114937457 A CN114937457 A CN 114937457A
Authority
CN
China
Prior art keywords
audio
voiceprint
audio information
information
extractor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210329461.8A
Other languages
Chinese (zh)
Inventor
郑渊中
叶峰
朱小波
疏北平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Qiyue Information Technology Co Ltd
Original Assignee
Shanghai Qiyue Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Qiyue Information Technology Co Ltd filed Critical Shanghai Qiyue Information Technology Co Ltd
Priority to CN202210329461.8A priority Critical patent/CN114937457A/en
Publication of CN114937457A publication Critical patent/CN114937457A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Signal Processing (AREA)
  • Stereophonic System (AREA)

Abstract

The embodiment of the specification provides an audio information processing method, which includes the steps of obtaining sample audio information, carrying out combined training on a pre-trained audio content extractor, a voiceprint feature extractor and an audio synthesizer, obtaining audio information to be desensitized and target voiceprint audio information, extracting audio content in the audio content extractor by the aid of the audio content extractor, extracting target voiceprint features in the audio content by the aid of the voiceprint feature extractor, inputting the audio content and the target voiceprint features into the audio synthesizer, and outputting voiceprint desensitized audio information after the audio content and the target voiceprint features are processed by the audio synthesizer. Through the joint training, the training audio content extractor and the voiceprint feature extractor can accurately extract audio content and voiceprint features, the audio synthesizer learns the logic of synthesizing audio by combining the audio content and the voiceprint features, desensitization audio information with target voiceprints is synthesized when the desensitization audio information and the target voiceprint audio information are processed, the voiceprint information of a user is hidden, and the information safety of the user is guaranteed.

Description

Audio information processing method and device and electronic equipment
Technical Field
The present application relates to the field of computers, and in particular, to an audio information processing method and apparatus, and an electronic device.
Background
With the development of technology, voice, especially voiceprint information, is gradually regarded as important user information, and is widely applied to identification technology, for example, to verify the identity of a user. However, things are twosided, and the appearance of the technology can invisibly reveal user information contained in sound, especially user privacy information, so that the safety of the user information is reduced; at present, the information content of a user is desensitized, namely information such as an identification number, a name, an address and the like in call recording is eliminated, but voiceprint information is often ignored and cannot be protected.
Disclosure of Invention
The embodiment of the specification provides an audio information processing method, an audio information processing device and electronic equipment, which are used for hiding voiceprint information of a user and ensuring information security of the user.
An embodiment of the present specification provides an audio information processing method, including:
acquiring sample audio information, and performing combined training on a pre-trained audio content extractor, a voiceprint feature extractor and an audio synthesizer according to the sample audio information;
acquiring audio information to be desensitized and target voiceprint audio information, extracting audio content of the audio information to be desensitized by using the audio content extractor, and extracting target voiceprint characteristics of the target voiceprint audio information by using the voiceprint characteristic extractor;
and inputting the audio content and the target voiceprint characteristics into the audio synthesizer, and outputting voiceprint desensitization audio information after the processing of the audio synthesizer.
Optionally, the jointly training the pre-trained audio content extractor, the voiceprint feature extractor, and the audio synthesizer according to the sample audio information includes:
extracting audio content from the sample audio information using the audio content extractor;
extracting voiceprint features from the sample audio information using the voiceprint feature extractor;
the audio synthesizer predicts audio frequency points according to the audio content and the voiceprint characteristics;
calculating a deviation value according to the predicted audio point and the sample audio information, adjusting parameters of the audio content extractor, the voiceprint feature extractor and the audio synthesizer according to the deviation value, and performing iterative training until the calculated deviation value is smaller than a threshold value.
Optionally, the calculating a deviation value according to the predicted audio point and the sample audio information includes:
sampling from the waveform of the sample audio information, calculating the difference value of each sampling point and the predicted audio frequency point, and summing all the difference values to obtain the deviation value.
Optionally, the method further comprises: pre-training an audio content extractor and a voiceprint feature extractor with the sample audio information, comprising:
setting an audio label for the sample audio information according to audio content in the sample audio information, and training an audio content extractor based on the audio label;
and setting different voiceprint labels according to different users to which different sample audio information belongs, and training a voiceprint feature extractor based on the voiceprint labels.
Optionally, the outputting the voiceprint desensitization audio information after the processing by the audio synthesizer includes:
and the audio synthesizer predicts audio frequency points of the audio content according to the frequency domain distribution of the target voiceprint characteristics and synthesizes voiceprint desensitization audio information according to the predicted audio frequency points.
Optionally, the method further comprises:
deploying the audio content extractor, the voiceprint feature extractor and the audio synthesizer obtained by joint training to a network access device;
the network access device acquires audio information to be desensitized sent by a terminal, desensitizes the audio information by using a deployed model, synthesizes voiceprint desensitized audio information and uploads the voiceprint desensitized audio information to a network.
An embodiment of the present specification further provides an audio information processing apparatus, including:
the joint training module is used for acquiring sample audio information and performing joint training on a pre-trained audio content extractor, a voiceprint feature extractor and an audio synthesizer according to the sample audio information;
the characteristic extraction module is used for acquiring audio information to be desensitized and target voiceprint audio information, extracting the audio content of the audio information to be desensitized by using the audio content extractor, and extracting target voiceprint characteristics in the target voiceprint audio information by using the voiceprint characteristic extractor;
and the audio synthesis module is used for inputting the audio content and the target voiceprint characteristics into the audio synthesizer, and outputting voiceprint desensitization audio information after the processing of the audio synthesizer.
Optionally, the jointly training the pre-trained audio content extractor, the voiceprint feature extractor, and the audio synthesizer according to the sample audio information includes:
extracting audio content from the sample audio information using the audio content extractor;
extracting voiceprint features from the sample audio information using the voiceprint feature extractor;
the audio synthesizer predicts audio frequency points according to the audio content and the voiceprint characteristics;
calculating a deviation value according to the predicted audio point and the sample audio information, adjusting parameters of the audio content extractor, the voiceprint feature extractor and the audio synthesizer according to the deviation value, and performing iterative training until the calculated deviation value is smaller than a threshold value.
An embodiment of the present specification further provides an electronic device, where the electronic device includes:
a processor; and the number of the first and second groups,
a memory storing a computer executable program which, when executed, causes the processor to perform any of the methods described above.
The present specification also provides a computer readable storage medium, wherein the computer readable storage medium stores one or more programs which, when executed by a processor, implement any of the above methods.
In the various technical solutions provided in the embodiments of the present specification, by obtaining sample audio information, performing joint training on a pre-trained audio content extractor, a voiceprint feature extractor, and an audio synthesizer, obtaining audio information to be desensitized and target voiceprint audio information, extracting audio content therein using the audio content extractor, extracting target voiceprint features therein using the voiceprint feature extractor, inputting the audio content and the target voiceprint features into the audio synthesizer, and outputting voiceprint desensitized audio information after processing by the audio synthesizer. Through the combined training, the training audio content extractor and the voiceprint feature extractor can accurately extract audio content and voiceprint features, the audio synthesizer learns the logic of synthesizing audio by combining the audio content and the voiceprint features, desensitization audio information with target voiceprints is synthesized when the desensitization audio information and the target voiceprint audio information are processed, the voiceprint information of a user is hidden, and the information safety of the user is guaranteed.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
FIG. 1 is a schematic diagram illustrating an audio information processing method according to an embodiment of the present disclosure;
fig. 2 is a schematic structural diagram of an audio information processing method provided in an embodiment of the present specification;
fig. 3 is a schematic diagram of an audio information processing apparatus according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of an electronic device provided in an embodiment of the present specification;
fig. 5 is a schematic diagram of a computer-readable medium provided in an embodiment of the present specification.
Detailed Description
Exemplary embodiments of the present invention will now be described more fully with reference to the accompanying drawings. The exemplary embodiments, however, may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these exemplary embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the invention to those skilled in the art. The same reference numerals denote the same or similar elements, components, or parts in the drawings, and thus their repetitive description will be omitted.
Features, structures, characteristics or other details described in a particular embodiment may not be excluded from being combined in a suitable manner in one or more other embodiments, while still complying with the technical idea of the invention.
In describing particular embodiments, the present invention has been described with reference to features, structures, characteristics or other details that are within the purview of one skilled in the art to provide a thorough understanding of the embodiments. One skilled in the relevant art will recognize, however, that the invention may be practiced without one or more of the specific features, structures, characteristics, or other details.
The flowcharts shown in the figures are illustrative only and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
The term "and/or" and/or "includes all combinations of any one or more of the associated listed items.
Fig. 1 is a schematic diagram of an audio information processing method provided in an embodiment of the present disclosure, where the method may include:
s101: and acquiring sample audio information, and performing combined training on the pre-trained audio content extractor, the voiceprint feature extractor and the audio synthesizer according to the sample audio information.
In the embodiment of the disclosure, the audio content not only includes text information, but also includes other audio information such as pause, speaking volume, tone, speaking tone, and the like; moreover, the voiceprint features in the embodiment of the disclosure include not only timbre features, but also accent features, such as the retroflex sound of Beijing, Fujian "f" and "h" and the like. In the prior art, voice calls are generally directly translated into texts, the conduction of information loss is basically not considered, the requirement on the accuracy of intermediate results is extremely high, and if the voice environment noise is high or the mandarin of a speaker is not standard, the information loss is generated and is continuously amplified in the conduction process; in the embodiment of the disclosure, in order to avoid information loss conduction, factor information such as volume of audio content and accent of human voice in audio is considered, and information loss is effectively reduced.
In order to desensitize the sound, hide the real voiceprint information and change the tone of the sound, a tool, namely an audio synthesizer, capable of synthesizing brand new audio according to the voiceprint features and the audio content in the audio can be constructed.
When the audio synthesizer is trained, the voiceprint features and the audio content need to be utilized, and the voiceprint features and the audio content in the sample audio information are fused together, so that a tool capable of extracting (or separating) the voiceprint features and the audio content separately needs to be constructed before the audio synthesizer is trained.
Therefore, a tool for extracting audio content from sample audio, i.e., an audio content extractor, and a tool for extracting voiceprint features from sample audio, i.e., a voiceprint feature extractor, may be trained in advance. Then, the audio content extractor and the voiceprint feature extractor can be used as input layers of an audio synthesizer to carry out joint training. In this way, the composite model finally trained has an audio content extractor, a voiceprint feature extractor and an audio synthesizer.
Before training, audio information of a large number of users can be collected as sample audio information, and then the sample audio information is labeled for training.
The source of the audio information may be in various ways, and may be an existing audio in the database, an audio temporarily acquired by the service platform, or an audio sent by the user terminal to the point of presence.
Specifically, in an embodiment of the present specification, the method further includes: pre-training an audio content extractor and a voiceprint feature extractor with the sample audio information, comprising:
setting an audio label for the sample audio information according to audio content in the sample audio information, and training an audio content extractor based on the audio label;
and setting different voiceprint labels according to different users to which different sample audio information belongs, and training a voiceprint feature extractor based on the voiceprint labels.
The audio content can be manually identified from the audio label, the semantic identification can also be carried out by using a program to obtain the audio content, and after the audio content is obtained, the label is set according to the audio content.
Under the general condition, different audio timbres of the same user are the same, and the voiceprints (including timbres) of different users are different, so that the same voiceprint labels can be set for different audios of the same user, different voiceprint labels can be set for different audios of different users, and the same voiceprint labels can be set for a plurality of sample audio information which is sent by the same user and has different semantics, speech speed and loudness, so that the sample diversity is improved.
The audio content extractor and the voiceprint feature extractor can convert audio information in a time domain form into audio information in a frequency domain form, the audio information in the frequency domain form can be recorded by using feature vectors, a plurality of feature vectors form a feature matrix, and feature extraction is performed by using the feature matrix, so that the audio content extractor and the voiceprint feature extractor can accurately learn features of audio and tone attributes in frequency domain representation information respectively.
The process of converting the audio information into the feature matrix can be regarded as encoding, and the process of generating the audio by the feature matrix can be regarded as decoding.
In order to enable the audio synthesizer to accurately learn the processing logic of the synthesized audio and further optimize the audio content extractor and the voiceprint feature extractor, the audio content extractor, the voiceprint feature extractor and the audio synthesizer are jointly trained in the embodiments of the present specification.
In this embodiment, the audio synthesizer may be specifically a distribution model, and the audio synthesizer may have a decoder therein for converting the data in the form of matrix into audio, where the audio frequency is predicted according to the required frequency domain distribution.
Wherein the inputs of the audio synthesizer are audio content and voiceprint features, and therefore, the audio content extractor, the voiceprint feature extractor and the audio synthesizer may be associated with each other, with the outputs of the audio content extractor and the voiceprint feature extractor as the inputs of the audio synthesizer.
During the joint training, iteration can be performed according to the accuracy of the audio synthesized by the audio synthesizer, and the accuracy can be represented by calculating a deviation value according to the predicted audio point and the sample audio information.
Therefore, in this embodiment of the present specification, the jointly training the pre-trained audio content extractor, the voiceprint feature extractor, and the audio synthesizer according to the sample audio information includes:
extracting audio content from the sample audio information using the audio content extractor;
extracting voiceprint features from the sample audio information using the voiceprint feature extractor;
the audio synthesizer predicts audio frequency points according to the audio content and the voiceprint characteristics;
calculating a deviation value according to the predicted audio point and the sample audio information, adjusting parameters of the audio content extractor, the voiceprint feature extractor and the audio synthesizer according to the deviation value, and performing iterative training until the calculated deviation value is smaller than a threshold value.
Through the combined training, the parameters of the audio content extractor, the voiceprint feature extractor and the audio synthesizer are continuously optimized, finally, the audio content extractor can accurately extract audio content, the voiceprint feature extractor can accurately extract voiceprint features, and the audio synthesizer can accurately synthesize the voiceprint features according to the extracted audio content.
Wherein calculating the deviation value essentially performs an evaluation of the similarity of the audio, timbre, of the synthesized audio and the sample audio.
There are various ways to calculate the deviation value, such as based on subtracting the waveform of the synthesized audio from the sample audio, or subtracting the synthesized audio after converting the synthesized audio to the frequency domain, and then quantizing the difference into a numerical value.
Specifically, in this embodiment, the calculating a deviation value according to the predicted audio point and the sample audio information may include:
sampling from the waveform of the sample audio information, calculating the difference value between each sampling point and the predicted audio frequency point, and summing all the difference values to obtain the deviation value.
S102: obtaining audio information to be desensitized and target voiceprint audio information, extracting the audio content of the audio information to be desensitized by using the audio content extractor, and extracting the target voiceprint characteristics of the target voiceprint audio information by using the voiceprint characteristic extractor.
The joint training phase uses sample audio information, but in the use phase, since the final purpose is to make the synthesized audio have a specific voiceprint feature, namely a target voiceprint feature, the audio information to be desensitized and the target voiceprint audio information need to be acquired.
The target voiceprint audio information may be specifically an artificial customer service tone audio or a robot tone audio, which is not specifically illustrated and limited herein.
And inputting the audio content of the audio information to be desensitized into an audio content extractor, and extracting the audio content in the audio content extractor.
The target voiceprint audio information is input into a voiceprint feature extractor, and the voiceprint feature extractor can extract the voiceprint features in the voiceprint audio information, namely the target voiceprint features.
S103: and inputting the audio content and the target voiceprint characteristics into the audio synthesizer, and outputting voiceprint desensitization audio information after the processing of the audio synthesizer.
The method comprises the steps of performing joint training on a pre-trained audio content extractor, a voiceprint feature extractor and an audio synthesizer by obtaining sample audio information, obtaining audio information to be desensitized and target voiceprint audio information, extracting audio content in the audio content extractor by using the audio content extractor, extracting target voiceprint features in the voiceprint feature extractor by using the voiceprint feature extractor, inputting the audio content and the target voiceprint features into the audio synthesizer, and outputting the voiceprint desensitized audio information after the voiceprint desensitized audio information is processed by the audio synthesizer. Through the combined training, the training audio content extractor and the voiceprint feature extractor can accurately extract audio content and voiceprint features, the audio synthesizer learns the logic of synthesizing audio by combining the audio content and the voiceprint features, desensitization audio information with target voiceprints is synthesized when the desensitization audio information and the target voiceprint audio information are processed, the voiceprint information of a user is hidden, and the information safety of the user is guaranteed.
When the audio synthesizer utilizes the audio content in the audio information to be desensitized and the target voiceprint characteristics to synthesize, the synthesized audio has both the audio in the audio information to be desensitized and the target voiceprint, so that not only are the original voiceprints of the audio information to be desensitized well hidden, but also the audio desensitization is realized, the real voiceprint information of the user in the audio information to be desensitized is hidden, and the information security of the user is ensured; and the audio information except the text information in the audio information with desensitization is reserved to the maximum extent; meanwhile, the technical scheme disclosed by the application does not need to store or establish a voiceprint library/a tone library in advance, so that a large amount of cost is saved.
The specific synthesis process can be regarded as a reverse process of feature extraction, and essentially predicts the audio frequency points according to a distribution model, so that the predicted frequency domain distribution performance of a plurality of audio frequency points is consistent with the frequency domain distribution performance of the target voiceprint.
Specifically, in this embodiment of the present specification, the outputting the voiceprint desensitization audio information after being processed by the audio synthesizer includes:
and the audio synthesizer predicts audio frequency points of the audio content according to the frequency domain distribution of the target voiceprint characteristics and synthesizes voiceprint desensitization audio information according to the predicted audio frequency points.
After the joint training, the model which can be trained is deployed in a service platform, and desensitization is carried out in the service platform; the desensitization can also be deployed in a network access device accessing the internet before uploading audio of the client to the internet.
Therefore, in the embodiment of the present specification, the method may further include:
deploying the audio content extractor, the voiceprint feature extractor and the audio synthesizer obtained by joint training to a network access device;
the network access device acquires audio information to be desensitized sent by a terminal, desensitizes the audio information by using a deployed model, synthesizes voiceprint desensitized audio information and uploads the voiceprint desensitized audio information to a network.
Fig. 2 is a schematic diagram illustrating an audio information processing method according to an embodiment of the present disclosure, which shows a principle of training a voiceprint feature converter.
Before training, the structure of an audio content extractor (Ec), a voiceprint feature extractor (Es) and the audio synthesizer (D) is constructed, wherein the audio content extractor and the voiceprint feature extractor may have an encoder for converting audio into a matrix format, and the audio synthesizer has a decoder for finally converting the calculated matrix into the format of audio.
Then, sample audio information X is acquired 1 ,Z 1 ,U 1 … …, setting audio labels according to the sample audio information, setting voiceprint labels according to the tone characteristics (including but not limited to tone and breath) of the sample audio information, and respectively training the audio content extractor (Ec) and the voiceprint characteristic extractor (Es) by using the sample audio information with labels.
After pre-training, the output ends of the audio content extractor (Ec) and the voiceprint feature extractor (Es) are respectively connected with the input end of the audio synthesizer (D) for joint training.
Specifically, an audio content extractor (Ec) outputs audio content C1, a voiceprint feature extractor (Es) outputs voiceprint features S1, and an audio synthesizer (D) synthesizes audio based on the audio content C1 and the voiceprint features S1, and records the synthesized audio as voice
Figure BDA0003574668870000101
Audio to be synthesized
Figure BDA0003574668870000102
And sample audio information X 1 Evaluation quantization (usually in a manner of difference) is performed, and the quantized deviation value can reflect whether the synthesis capability of the audio synthesizer meets the preset requirement, and if not, the parameters of the audio content extractor (Ec), the voiceprint feature extractor (Es) and the audio synthesizer (D) are adjusted.
And then, the audio content extractor (Ec), the voiceprint feature extractor (Es) and the audio synthesizer (D) after the parameters are adjusted continue to carry out iterative training until the calculated deviation value is smaller than the threshold value.
Different from the pre-training stage, in the joint training stage, the parameters in the audio content extractor (Ec) and the voiceprint feature extractor (Es) are adjusted according to the audio difference and the tone difference of the audio synthesized by the audio synthesizer (D) and the sample audio information, but not directly adjusted according to the outputs of the audio content extractor (Ec) and the voiceprint feature extractor (Es), so that the audio content extractor (Ec) and the voiceprint feature extractor (Es) can adapt to the audio synthesizer (D), and the composite model trained by the joint training is most beneficial to the accurate synthesis of the audio. And (3) performing integral training on each model by adopting a joint training method to represent the model as a desensitization model externally, and realizing the effect that voice is input into the desensitization model and the voice after desensitization is directly output. The text content of the audio is not subjected to speech recognition (translation of the text content) and subsequent speech synthesis any more, so that certain fault tolerance of the audio content is ensured.
In the process of executing audio information processing, sample audio is not input to an audio content extractor (Ec) and a voiceprint feature extractor (Es) any more, but audio information to be desensitized is input to the audio content extractor (Ec), target voiceprint audio information is input to the voiceprint feature extractor (Es), the audio content extractor (Ec) and the voiceprint feature extractor (Es) respectively extract audio content C1 and voiceprint features S1, the output of the audio content C1 and the voiceprint features S1 is transmitted to an audio synthesizer (D), the audio synthesizer (D) synthesizes audio, and synthesized audio is output
Figure BDA0003574668870000111
Namely, the target voiceprint is used for expressing the audio content, so that the desensitization of the audio information is realized, and the voiceprint information of the user corresponding to the audio information to be desensitized is well hidden.
Further, desensitized audio information may be provided to various speech processing software for processing and application, such as Automatic Speech Recognition (ASR), speech verification (ASV), and the like.
Fig. 3 is a schematic structural diagram of an audio information processing apparatus provided in an embodiment of the present specification, where the apparatus may include:
the joint training module 301 is configured to acquire sample audio information, and perform joint training on a pre-trained audio content extractor, a voiceprint feature extractor and an audio synthesizer according to the sample audio information;
a feature extraction module 302, configured to obtain audio information to be desensitized and target voiceprint audio information, and to extract audio content of the audio information to be desensitized by using the audio content extractor, and extract a target voiceprint feature in the target voiceprint audio information by using the voiceprint feature extractor;
and the audio synthesis module 303 is configured to input the audio content and the target voiceprint feature into the audio synthesizer, and output voiceprint desensitization audio information after processing by the audio synthesizer.
Wherein, the joint training of the pre-trained audio content extractor, the voiceprint feature extractor and the audio synthesizer according to the sample audio information comprises:
extracting audio content from the sample audio information using the audio content extractor;
extracting voiceprint features from the sample audio information using the voiceprint feature extractor;
the audio synthesizer predicts audio frequency points according to the audio content and the voiceprint characteristics;
calculating a deviation value according to the predicted audio point and the sample audio information, adjusting parameters of the audio content extractor, the voiceprint feature extractor and the audio synthesizer according to the deviation value, and performing iterative training until the calculated deviation value is smaller than a threshold value.
The device acquires sample audio information, performs combined training on a pre-trained audio content extractor, a voiceprint feature extractor and an audio synthesizer, acquires audio information to be desensitized and target voiceprint audio information, extracts audio content in the audio content extractor, extracts target voiceprint features in the voiceprint feature extractor, inputs the audio content and the target voiceprint features into the audio synthesizer, and outputs the voiceprint desensitization audio information after the processing of the audio synthesizer. Through the combined training, the training audio content extractor and the voiceprint feature extractor can accurately extract audio content and voiceprint features, the audio synthesizer learns the logic of synthesizing audio by combining the audio content and the voiceprint features, desensitization audio information with target voiceprints is synthesized when the desensitization audio information and the target voiceprint audio information are processed, the voiceprint information of a user is hidden, and the information safety of the user is guaranteed.
Fig. 4 is a schematic structural diagram of an electronic device provided in an embodiment of the present specification. An electronic device 400 according to this embodiment of the invention is described below with reference to fig. 4. The electronic device 400 shown in fig. 4 is only an example and should not bring any limitation to the function and the scope of use of the embodiments of the present invention.
As shown in fig. 4, electronic device 400 is embodied in the form of a general purpose computing device. The components of electronic device 400 may include, but are not limited to: at least one processing unit 410, at least one memory unit 420, a bus 430 that couples various system components including the memory unit 420 and the processing unit 410, a display unit 440, and the like.
Wherein the storage unit stores program code executable by the processing unit 410 to cause the processing unit 410 to perform steps according to various exemplary embodiments of the present invention described in the above-mentioned processing method section of the present specification. For example, the processing unit 410 may perform the steps as shown in fig. 1.
The storage unit 420 may include readable media in the form of volatile storage units, such as a random access memory unit (RAM)4201 and/or a cache memory unit 4202, and may further include a read only memory unit (ROM) 4203.
The memory unit 420 may also include a program/utility 4204 having a set (at least one) of program modules 4205, such program modules 4205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Bus 430 may be any bus representing one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 400 may also communicate with one or more external devices 500 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 400, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 400 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interfaces 450. Also, the electronic device 400 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) through the network adapter 460. The network adapter 460 may communicate with other modules of the electronic device 400 via the bus 430. It should be appreciated that although not shown in FIG. 4, other hardware and/or software modules may be used in conjunction with electronic device 400, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, to name a few.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments of the present invention described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiment of the present invention can be embodied in the form of a software product, which can be stored in a computer-readable storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to make a computing device (which can be a personal computer, a server, or a network device, etc.) execute the above-mentioned method according to the present invention. The computer program, when executed by a data processing apparatus, enables the computer readable medium to implement the above-described method of the invention, namely: such as the method shown in fig. 1.
Fig. 5 is a schematic diagram of a computer-readable medium provided in an embodiment of the present specification.
A computer program implementing the method shown in fig. 1 may be stored on one or more computer readable media. The computer readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
In summary, the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that some or all of the functionality of some or all of the components in embodiments in accordance with the invention may be implemented in practice using a general purpose data processing device such as a microprocessor or a Digital Signal Processor (DSP). The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.
While the foregoing embodiments have described the objects, aspects and advantages of the present invention in further detail, it should be understood that the present invention is not inherently related to any particular computer, virtual machine or electronic device, and various general-purpose machines may be used to implement the present invention. The invention is not to be considered as limited to the specific embodiments thereof, but is to be understood as being modified in all respects, all changes and equivalents that come within the spirit and scope of the invention.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments.
The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (10)

1. An audio information processing method, characterized by comprising:
acquiring sample audio information, and performing combined training on a pre-trained audio content extractor, a voiceprint feature extractor and an audio synthesizer according to the sample audio information;
acquiring audio information to be desensitized and target voiceprint audio information, extracting audio content of the audio information to be desensitized by using the audio content extractor, and extracting target voiceprint characteristics of the target voiceprint audio information by using the voiceprint characteristic extractor;
and inputting the audio content and the target voiceprint characteristics into the audio synthesizer, and outputting voiceprint desensitization audio information after the processing of the audio synthesizer.
2. The method of claim 1, wherein jointly training a pre-trained audio content extractor, a voiceprint feature extractor, and an audio synthesizer based on the sample audio information comprises:
extracting audio content from the sample audio information using the audio content extractor;
extracting voiceprint features from the sample audio information using the voiceprint feature extractor;
the audio synthesizer predicts audio frequency points according to the audio content and the voiceprint characteristics;
calculating a deviation value according to the predicted audio point and the sample audio information, adjusting parameters of the audio content extractor, the voiceprint feature extractor and the audio synthesizer according to the deviation value, and performing iterative training until the calculated deviation value is smaller than a threshold value.
3. The method of any of claims 1-2, wherein calculating a deviation value based on the predicted audio point and the sample audio information comprises:
sampling from the waveform of the sample audio information, calculating the difference value of each sampling point and the predicted audio frequency point, and summing all the difference values to obtain the deviation value.
4. The method according to any one of claims 1-3, further comprising:
setting an audio label for the sample audio information according to audio content in the sample audio information, and training an audio content extractor based on the audio label;
and setting different voiceprint labels according to different users to which different sample audio information belongs, and training a voiceprint feature extractor based on the voiceprint labels.
5. The method according to any of claims 1-4, wherein outputting voiceprint desensitization audio information after processing by the audio synthesizer comprises:
and the audio synthesizer predicts audio frequency points of the audio content according to the frequency domain distribution of the target voiceprint characteristics and synthesizes voiceprint desensitization audio information according to the audio frequency points.
6. The method according to any one of claims 1-5, further comprising:
deploying the pre-trained audio content extractor, the voiceprint feature extractor, and the audio synthesizer at a network access device;
the network access device acquires audio information to be desensitized sent by a terminal, desensitizes the audio information by using a deployed model, synthesizes voiceprint desensitized audio information and uploads the voiceprint desensitized audio information to a network.
7. An audio information processing apparatus characterized by comprising:
the joint training module is used for acquiring sample audio information and performing joint training on the pre-trained audio content extractor, the voiceprint feature extractor and the audio synthesizer according to the sample audio information;
the characteristic extraction module is used for acquiring audio information to be desensitized and target voiceprint audio information, extracting the audio content of the audio information to be desensitized by using the audio content extractor, and extracting target voiceprint characteristics in the target voiceprint audio information by using the voiceprint characteristic extractor;
and the audio synthesis module is used for inputting the audio content and the target voiceprint characteristics into the audio synthesizer, and outputting voiceprint desensitization audio information after the processing of the audio synthesizer.
8. The apparatus according to any of claims 1-7, wherein the joint training module is specifically configured to:
extracting audio content from the sample audio information using the audio content extractor;
extracting voiceprint features from the sample audio information using the voiceprint feature extractor;
the audio synthesizer predicts audio frequency points according to the audio content and the voiceprint characteristics;
calculating a deviation value according to the predicted audio point and the sample audio information, adjusting parameters of the audio content extractor, the voiceprint feature extractor and the audio synthesizer according to the deviation value, and performing iterative training until the calculated deviation value is smaller than a threshold value.
9. An electronic device, wherein the electronic device comprises:
a processor; and the number of the first and second groups,
a memory storing a computer executable program that, when executed, causes the processor to perform the method of any of claims 1-6.
10. A computer readable storage medium, wherein the computer readable storage medium stores one or more programs which, when executed by a processor, implement the method of any of claims 1-6.
CN202210329461.8A 2022-03-31 2022-03-31 Audio information processing method and device and electronic equipment Pending CN114937457A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210329461.8A CN114937457A (en) 2022-03-31 2022-03-31 Audio information processing method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210329461.8A CN114937457A (en) 2022-03-31 2022-03-31 Audio information processing method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN114937457A true CN114937457A (en) 2022-08-23

Family

ID=82863058

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210329461.8A Pending CN114937457A (en) 2022-03-31 2022-03-31 Audio information processing method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN114937457A (en)

Similar Documents

Publication Publication Date Title
CN111899719A (en) Method, apparatus, device and medium for generating audio
US20240021202A1 (en) Method and apparatus for recognizing voice, electronic device and medium
US9412359B2 (en) System and method for cloud-based text-to-speech web services
CN111161695B (en) Song generation method and device
CN111710326A (en) English voice synthesis method and system, electronic equipment and storage medium
CN113658577B (en) Speech synthesis model training method, audio generation method, equipment and medium
CN113053357A (en) Speech synthesis method, apparatus, device and computer readable storage medium
CN112365878A (en) Speech synthesis method, device, equipment and computer readable storage medium
CN114242033A (en) Speech synthesis method, apparatus, device, storage medium and program product
CN110930975A (en) Method and apparatus for outputting information
CN116386594A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
CN113327575B (en) Speech synthesis method, device, computer equipment and storage medium
CN114495977A (en) Speech translation and model training method, device, electronic equipment and storage medium
Jin et al. Voice-preserving zero-shot multiple accent conversion
CN112185340A (en) Speech synthesis method, speech synthesis device, storage medium and electronic apparatus
CN112927677A (en) Speech synthesis method and device
CN116580698A (en) Speech synthesis method, device, computer equipment and medium based on artificial intelligence
Amani et al. Kurdish spoken dialect recognition using x-vector speaker embedding
CN113314097B (en) Speech synthesis method, speech synthesis model processing device and electronic equipment
CN113948062B (en) Data conversion method and computer storage medium
Roy et al. A hybrid VQ-GMM approach for identifying Indian languages
CN115376498A (en) Speech recognition method, model training method, device, medium, and electronic apparatus
CN114937457A (en) Audio information processing method and device and electronic equipment
CN113345454A (en) Method, device, equipment and storage medium for training and applying voice conversion model
WO2023197206A1 (en) Personalized and dynamic text to speech voice cloning using incompletely trained text to speech models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination