CN113012706B - Data processing method and device and electronic equipment - Google Patents

Data processing method and device and electronic equipment Download PDF

Info

Publication number
CN113012706B
CN113012706B CN202110189853.4A CN202110189853A CN113012706B CN 113012706 B CN113012706 B CN 113012706B CN 202110189853 A CN202110189853 A CN 202110189853A CN 113012706 B CN113012706 B CN 113012706B
Authority
CN
China
Prior art keywords
audio
image
audio data
data
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110189853.4A
Other languages
Chinese (zh)
Other versions
CN113012706A (en
Inventor
师圣
杜杨洲
杨琳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lenovo Beijing Ltd
Original Assignee
Lenovo Beijing Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lenovo Beijing Ltd filed Critical Lenovo Beijing Ltd
Priority to CN202110189853.4A priority Critical patent/CN113012706B/en
Publication of CN113012706A publication Critical patent/CN113012706A/en
Application granted granted Critical
Publication of CN113012706B publication Critical patent/CN113012706B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • G10L21/14Transforming into visible information by displaying frequency domain information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • Signal Processing (AREA)
  • Library & Information Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses a data processing method, a data processing device and electronic equipment, which comprise the steps of obtaining first audio data, and converting the first audio data to obtain a first audio image; generating a second audio image based on the first audio image; and processing the audio characteristic information corresponding to the second audio image to obtain second audio data. The purpose of generating the audio data with the same semantics and different audio attribute characteristics by the existing audio data is realized by the conversion mode of the audio image, and the time cost and difficulty of data collection are reduced.

Description

Data processing method and device and electronic equipment
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to a data processing method, a data processing device, and an electronic device.
Background
With the rapid development of artificial intelligence, deep learning is widely applied to people's work and life. Deep learning performance increases linearly with the magnitude of training data, however, data for deep learning often suffers from difficult collection, especially small sample data collection. For example, in the process of performing deep learning on audio data, if the dialect audio data is required to be learned, the problem of long collection period and high time cost can exist due to the regional specificity of the dialect audio data, and the problem of inaccuracy of the voice audio and video of the counterpart due to the fact that the number of samples is small and the deep learning is utilized is caused.
Disclosure of Invention
In view of this, the present application provides the following technical solutions:
a data processing method, comprising:
acquiring first audio data;
converting the first audio data to obtain a first audio image;
generating a second audio image based on the first audio image;
and processing the audio characteristic information corresponding to the second audio image to obtain second audio data, wherein the first audio data and the second audio data have the same semantics, and the audio attribute characteristics of the first audio data and the second audio data are different.
Optionally, the generating a second audio image based on the first audio image includes:
processing the first audio image based on an image conversion model to obtain a second audio image;
the image conversion model is used for extracting the audio characteristics to be converted in the first audio image and generating the second audio image based on the audio characteristics, wherein the second audio image and the first audio image have different image characteristics.
Optionally, the method further comprises:
obtaining a first sample set comprising a number of first images having image features of a first audio image and a number of second images having image features of a second audio image;
and performing unsupervised training on the initial neural network model by using the first sample set to obtain an image conversion model.
Optionally, the image conversion model comprises a loop generation countermeasure network comprising a first generation countermeasure network and a second generation countermeasure network, wherein,
the first generation reactance network is used for extracting audio features to be converted in the first audio image and generating a second audio image based on the audio features;
the second generation countermeasure network is used for detecting whether the restored image of the second audio image is consistent with the first audio image.
Optionally, the method further comprises:
obtaining a second sample set, wherein the second sample set comprises a plurality of groups of image samples, each group of image samples comprises a third image with image characteristics of a first audio image and a fourth image with image characteristics of a second audio image, and the fourth image corresponds to the third image;
and performing supervised training on the initial neural network model based on the second sample set to obtain an image conversion model.
Optionally, the method further comprises:
generating an audio data training sample based on the first audio data and the second audio data;
and training the model based on the audio data training sample to obtain an audio recognition model, wherein the audio recognition model is used for recognizing the audio data with different audio attribute characteristics.
Optionally, the method further comprises:
acquiring audio data to be identified;
and inputting the audio data to be identified into the audio identification model to obtain an audio identification result, wherein the audio identification result comprises audio identification content and audio attribute characteristics.
Optionally, the method further comprises:
and if the audio identification content is not unique, determining target audio identification content in the audio identification content based on the audio identification characteristics.
A data processing apparatus comprising:
an acquisition unit configured to acquire first audio data;
the conversion unit is used for converting the first audio data to obtain a first audio image;
a generation unit configured to generate a second audio image based on the first audio image;
the processing unit is used for processing the audio characteristic information corresponding to the second audio image to obtain second audio data, the first audio data and the second audio data have the same semantics, and the audio attribute characteristics of the first audio data and the second audio data are different.
An electronic device, the electronic device comprising:
a memory for storing an application program and data generated by the operation of the application program;
a processor for executing the application program to realize:
acquiring first audio data;
converting the first audio data to obtain a first audio image;
generating a second audio image based on the first audio image;
and processing the audio characteristic information corresponding to the second audio image to obtain second audio data, wherein the first audio data and the second audio data have the same semantics, and the audio attribute characteristics of the first audio data and the second audio data are different.
As can be seen from the above technical solutions, the present application discloses a data processing method, a device, and an electronic device, which includes obtaining first audio data, and converting the first audio data to obtain a first audio image; generating a second audio image based on the first audio image; and processing the audio characteristic information corresponding to the second audio image to obtain second audio data. The purpose of generating the audio data with the same semantics and different audio attribute characteristics by the existing audio data is realized by the conversion mode of the audio image, and the time cost and difficulty of data collection are reduced.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic flow chart of a data processing method according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a cyclic generation countermeasure network according to an embodiment of the present disclosure;
FIG. 3 is a schematic diagram of a unidirectional generation type antagonizing network structure of a cyclically generation antagonizing network according to an embodiment of the present application;
fig. 4 is a schematic diagram of a corpus amplification application scenario provided in an embodiment of the present application;
FIG. 5 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
The embodiment of the application provides a data processing method, which is mainly used for processing audio data, and can be used for processing existing audio data or processing audio data acquired in real time. Correspondingly, the data processing method is operated on the electronic equipment, and the electronic equipment can be a terminal or a server, so that the electronic equipment can convert the audio data to be processed into target audio data corresponding to the audio data.
Referring to fig. 1, a flow chart of a data processing method provided in an embodiment of the present application is shown, where the method may include the following steps:
s101, acquiring first audio data.
The first audio data is audio data to be processed, and can be a section of audio stored in a local or cloud end, or can be audio recorded by a user through a terminal with a recording function, or can be audio acquired in real time through network transmission. The first audio data may be a piece of audio, or may be an audio set to be processed, that is, the audio set includes a plurality of audio segments, and each audio segment may be the first audio data, or may be the first audio data.
S102, converting the first audio data to obtain a first audio image.
The first audio image may be a sound wave diagram corresponding to the first audio data, or may be a spectrogram corresponding to the first audio data, and taking the spectrogram as an example, the first audio data in the time domain may be converted into a spectrogram in the frequency domain. Namely, the first audio data to be processed belongs to a time-domain audio signal, then fourier transformation can be performed on an audio frame corresponding to the time-domain audio signal to obtain a frequency-domain audio signal, and then a corresponding filter is called to obtain a spectrogram, for example, a die uses a mel filter to filter the audio signal to obtain a mel spectrogram. The specific processing mode for converting the audio data into the audio image is not limited, and the converted audio image can represent the corresponding audio data.
S103, generating a second audio image based on the first audio image.
The first audio image and the second audio image have different image characteristics, and the image characteristics mainly identify that the image styles of the audio images are different, that is, the attributes of the audio represented by the first audio image and the second audio image are different, for example, the audio data corresponding to the first audio image may be audio with mandarin attributes, and the audio data corresponding to the second audio image may be audio with cantonese attributes.
In one embodiment of the present application, image features in the first audio image may be extracted, then the extracted image features are converted into corresponding attributes, then converted image features are obtained, and then processing is performed according to the converted image features, so as to obtain the second audio image. In another approach, the conversion of the first audio image to the second audio image may also be implemented based on a pre-trained neural network model. Specific implementations are described in the examples that follow this application and are not described in detail herein.
S104, processing the audio features corresponding to the second audio image to obtain second audio data.
The second audio image may be a sound wave chart or a spectrogram, and the audio features of the second audio image that can be processed to obtain audio data may be obtained, and then the audio features are processed to obtain audio data. A digital audio editor may be employed to convert the spectrogram into audio data. The audio features may be different information, such as energy points in a waveform diagram, intensity information and phase information in a spectrogram, and the like, corresponding to different audio images.
In the embodiment of the application, the first audio data and the second audio data have the same semantics, and the audio attribute characteristics of the first audio data and the second audio data are different. The same semantics means that the content of the first audio data and the second audio data are identical, and the audio attribute features mean that the presentation styles of the audio are different, such as different tones, timbres, or language types. For example, the content corresponding to the first audio data and the second audio data is "hello", the first audio data corresponds to mandarin, its pronunciation corresponds to "n ǐ h" and the second audio data corresponds to cantonese, its pronunciation may be "leihou".
Through the steps, the second audio data corresponding to the existing first audio data can be generated, so that the conversion of the data with different audio attribute characteristics can be realized on the basis of the original data, and the expansion and the processing of the data are facilitated.
According to the data processing method provided by the embodiment of the application, first audio data are obtained, and the first audio data are converted to obtain a first audio image; generating a second audio image based on the first audio image; and processing the audio characteristic information corresponding to the second audio image to obtain second audio data. The purpose of generating the audio data with the same semantics and different audio attribute characteristics by the existing audio data is realized by the conversion mode of the audio image, and the time cost and difficulty of data collection are reduced.
In one implementation of the embodiments of the present application, the conversion of the first audio image into the second audio image may be implemented based on an image conversion model. The image conversion model is used for extracting audio features to be converted in the first audio image and generating a second audio image based on the audio features to be converted, wherein the second audio image has different image features from the first audio image.
The image conversion model is a neural network model, and in the embodiment of the application, the image conversion model can be created by performing supervised training on the neural network model or performing unsupervised training.
In one embodiment, generating the image conversion model includes:
obtaining a first sample set comprising a number of first images having image features of a first audio image and a number of second images having image features of a second audio image;
and performing unsupervised training on the initial neural network model by using the first sample set to obtain an image conversion model.
That is, this embodiment corresponds to performing an unsupervised training on the initial neural network model, and the training sample used at this time only includes a plurality of first images having the image features of the first audio image and a plurality of second images having the image features of the second audio image, that is, the first images and the second images may not be in a one-to-one correspondence relationship. If the image features of the first audio image represent the image features corresponding to the mandarin chinese spectrogram and the image features of the second audio image represent the image features corresponding to the Shanghai phone spectrogram, the first image is a plurality of mandarin chinese spectrograms and the second image is a plurality of Shanghai phone spectrograms.
Specifically, the unsupervised training may also be performed, and may be selected in a training manner and a model, in this embodiment of the present application, taking a generated countermeasure network as an example, the image conversion model may generate a countermeasure network (Generative Adersarial Network, GAN) model, that is, the image conversion model includes a cyclic generation countermeasure network, where the cyclic generation countermeasure network includes a first generation countermeasure network and a second generation countermeasure network, where the first generation countermeasure network is used to extract an audio feature to be converted in the first audio image, and generate a second audio image based on the audio feature; the second generation countermeasure network is used for detecting whether the restored image of the second audio image is consistent with the first audio image. The audio features to be converted refer to audio features capable of representing audio attributes, such as tone features and corresponding features of tone features in the image.
Specifically, the generated countermeasure network model is composed of two parts: a discriminator and a generator, wherein the discriminator is used for learning to discriminate between true and false samples. The generator is used to capture the potential distribution of the real samples and to generate pseudo-samples that are indistinguishable from the real samples. And the cyclic generation countermeasure network (cyclgan) stores the key information of the input and converted images by using the cyclic consistency loss, so that the image style migration under the condition of data unpaired is realized.
Taking a cyclic generation countermeasure network as an example, referring to fig. 2, a schematic structural diagram of a cyclic generation countermeasure network provided in an embodiment of the present application is shown. In fig. 2 (a) X, Y correspond to two different image styles, i.e. X and Y have different image characteristics, it is desirable to convert the image in X to the image in Y. G, F correspond to two generators in the forward and reverse generation type countermeasure network (GAN), respectively, i.e. G converts the image X in X into a picture G (X) in Y, and then uses a discriminator D Y Whether the image belongs to the image in Y is judged, so that a basic structure of generating an countermeasure network, namely GAN is formed. In fig. 2 (b), compared with (a), a partial structure is added, that is, the G (x) is required to be generated by the generator F in the reverse GAN to generate F (x), so that the F (x) is as close as possible to the original input x, that is, the cycle-consistency loss (cycle-consistency loss) is as small as possible, thereby solving the problem that the GAN cannot output the corresponding picture in a targeted manner. (b) Process x->G(x)->F (G (x)). About.x, this is referred to as forward loop consistency. To improve the training effect, similarly, train againConversion from Y-domain to X-domain, as in Process Y->F(y)->G (F (y)). About.y, is referred to as reverse loop consistency.
Referring to fig. 3, a schematic diagram of a unidirectional generation type countermeasure network structure of a loop generation countermeasure network is shown, a real picture of a domain a is firstly converted into a false picture of a domain B through a generator G, and then picture reconstruction is performed through the generator G, so that original picture information is saved. At the same time, the false picture and the real picture in the field B pass through the discriminator D Y And (5) judging the authenticity, and finally obtaining the complete unidirectional GAN. In fig. 3, it is desirable to be able to convert a picture of domainA to a picture of domainB. Thus, two generators G are required AB And G BA And respectively converting the pictures of domain A and domain B. The pictures of domain A pass through generator G AB Represented as a pseudo picture in domainB, with G AB (a) Represents G AB (a) Through generator G BA Reconstructed pictures denoted domainA, using G BA (G AB (a) A) representation. Finally, two loss functions (loss) are needed to train the unidirectional GAN, namely the reconstruction loss of the generator and the discrimination loss of the discriminator. Distinguishing device D B Is used for judging whether the input picture is a real domainB picture, and the generator is used for reconstructing a picture a, so that the aim is to generate a picture G BA (G AB (a) As similar as possible to the original.
The above description is made of the process of training the first sample set through the loop generation type countermeasure network to obtain the image conversion model, so that the generated image conversion model can realize the process of converting the first audio image into the second audio image.
In another possible implementation, the image conversion model may also be generated based on a supervised training approach. In this embodiment, the images in the training samples are required to have a correspondence relationship, that is, the images before conversion and the images after conversion are stored in a one-to-one matching manner. Specifically, the generation of the image conversion model by the supervised training mode may include the following steps:
obtaining a second sample set, wherein the second sample set comprises a plurality of groups of image samples, each group of image samples comprises a third image with image characteristics of a first audio image and a fourth image with image characteristics of a second audio image, and the fourth image corresponds to the third image;
and performing supervised training on the initial neural network model based on the second sample set to obtain an image conversion model.
In the training process, it is necessary to learn the relevant image features of the third image having the image features of the first audio image and the image features of the fourth image corresponding thereto. By learning the features, a prediction model is constructed, so that output corresponding to input can be predicted through the prediction model, namely, an output image corresponding to the output is obtained through the third image as output and is compared with a fourth image, and parameters of the prediction model are repeatedly adjusted, so that the output prediction image is similar to the fourth image.
It should be noted that, in the embodiment of the present application, the initial neural network model is not limited, as long as the corresponding training process can be satisfied.
The data processing method of the embodiment of the application can realize conversion of different attributes of the existing data, such as conversion of audio in different pronunciation forms, and amplification of the audio data, and can also be amplification of other types of data. The method can solve the problems of high data collection difficulty, long collection period and high time cost. Such as dialect, collection of audio data in a particular language.
In the embodiment of the application, the second audio data with the same semantics but different audio attribute characteristics can be obtained based on the first audio data, and further the audio data can be used as training samples of the corresponding audio application model, so that the problems of less sample data or high sample collection difficulty in the training process of the neural network model are solved. The application scenario of the data processing method of the present application will be described below by taking an audio application model as an audio recognition model as an example.
Generating an audio data training sample based on the first audio data and the second audio data; model training is carried out based on the audio training samples, and an audio recognition model is obtained. Wherein the audio recognition model is used to recognize audio data having different audio attribute characteristics.
In this embodiment, the training sample is supplemented with the second audio data generated from the image obtained by audio image conversion, and then the training sample is subjected to deep learning, and model parameters are continuously adjusted during the learning process, so as to obtain an audio recognition model. The audio recognition model can be a text recognition model, namely, the recognized audio data is converted into corresponding text, and can also be a translation model, namely, the recognized audio is translated into the corresponding audio or text to be output.
The audio data to be identified can be input into the audio identification model to obtain an audio identification result, wherein the audio identification result comprises audio identification content and audio attribute characteristics. The audio attribute features may represent attributes corresponding to the current audio, for example, belong to languages, which mainly refer to different tones generated due to different geographic areas, for example, a certain audio is identified based on an audio identification model, so that specific content of the audio can be obtained, and an output identification result is output based on the audio attribute features, for example, shanghai's words, "nong good", and Mandarin ' hello '. In practical application, corresponding content output is performed based on the corresponding audio attribute features, if the audio content to be identified is "your good", the output identification result is "your good".
Correspondingly, if the audio identification content is not unique, determining target audio identification content in the audio identification content based on the audio attribute characteristics. For example, the audio recognition content is a pronunciation similar to "non hao", the corresponding audio recognition content may be "nong good" or "good", if the audio attribute feature is mandarin, the output audio recognition content is "good", if the audio attribute feature is Shanghai, the output recognition content may be "nong good", or translated into "hello".
The following takes the example of the expansion of the cantonese audio prediction based on the Chinese Mandarin audio in the small sample audio library and the relevant audio recognition application. Namely, the corresponding first audio data is mandarin audio, and the second audio data is cantonese audio. Referring to fig. 4, a schematic diagram of a corpus amplification application scenario provided in an embodiment of the present application is shown.
Firstly, a spectrogram data set is constructed based on a spectrogram corresponding to the Chinese Mandarin audio corpus and a spectrogram corresponding to the Guangdong audio corpus, and an image style migration model is trained according to the Mandarin spectrogram data set (Mandarin training dataset X) and the Guangdong spectrogram data set (Cantonese training dataset Y), so that an audio spectrogram conversion model is obtained. And generating a cantonese spectrogram corresponding to the given mandarin spectrogram by using the trained model, and then generating a cantonese corresponding to the given mandarin by using a language translation tool. I.e. the data expansion task of cantonese is completed through the above processing procedure, for example, the corresponding given mandarin is "is it is, say, is it is? "after the above processing, the obtained cantonese is" is a coupler, and is a praise? The expanded data can be used for end-to-end voice recognition training of related algorithms (such as a transducer and other parallel algorithms), so that the prediction can be effectively expanded, and the labor cost and the time cost can be saved.
Based on the data processing method provided by the embodiment of the application, for the training corpus problem of speech recognition, the data expansion of the speech is converted into the expansion of image data, namely, an image style migration processing mode is adopted for processing the spectrogram of the audio, so that the corpus expansion purpose is achieved. The method can be applied to small sample data with large data collection difficulty, long collection period and high time cost, such as dialects, specific human language and the like, and a large amount of manpower, material resources and time cost are saved.
Referring to fig. 5, which shows a schematic structural diagram of a data processing apparatus provided in an embodiment of the present application, the technical solution of the present embodiment is mainly applied to reduce time cost and difficulty of data collection, and the data processing apparatus includes:
an acquisition unit 10 for acquiring first audio data;
a conversion unit 20, configured to convert the first audio data to obtain a first audio image;
a generating unit 30 for generating a second audio image based on the first audio image;
and a processing unit 40, configured to process the audio feature information corresponding to the second audio image to obtain second audio data, where the first audio data and the second audio data have the same semantics, and the audio attribute features of the first audio data and the second audio data are different.
Optionally, the generating unit includes:
the first processing subunit is used for processing the first audio image based on the image conversion model to obtain a second audio image;
the image conversion model is used for extracting the audio characteristics to be converted in the first audio image and generating the second audio image based on the audio characteristics, wherein the second audio image and the first audio image have different image characteristics.
Optionally, the apparatus further comprises: a first training subunit, the first training subunit to:
obtaining a first sample set comprising a number of first images having image features of a first audio image and a number of second images having image features of a second audio image;
and performing unsupervised training on the initial neural network model by using the first sample set to obtain an image conversion model.
Further, the image conversion model includes a loop generation countermeasure network including a first generation countermeasure network and a second generation countermeasure network, wherein,
the first generation reactance network is used for extracting audio features to be converted in the first audio image and generating a second audio image based on the audio features;
the second generation countermeasure network is used for detecting whether the restored image of the second audio image is consistent with the first audio image.
Optionally, the apparatus further comprises: a second training subunit, the second training subunit to:
obtaining a second sample set, wherein the second sample set comprises a plurality of groups of image samples, each group of image samples comprises a third image with image characteristics of a first audio image and a fourth image with image characteristics of a second audio image, and the fourth image corresponds to the third image;
and performing supervised training on the initial neural network model based on the second sample set to obtain an image conversion model.
Optionally, the apparatus further comprises:
a sample generation unit for generating an audio data training sample based on the first audio data and the second audio data;
and the model training unit is used for carrying out model training based on the audio data training samples to obtain an audio recognition model, and the audio recognition model is used for recognizing the audio data with different audio attribute characteristics.
Further, the apparatus further comprises:
the audio acquisition unit is used for acquiring audio data to be identified;
the identification unit is used for inputting the audio data to be identified into the audio identification model to obtain an audio identification result, wherein the audio identification result comprises audio identification content and audio attribute characteristics.
Optionally, the apparatus further comprises:
and the determining unit is used for determining target audio identification content in the audio identification content based on the audio identification characteristics if the audio identification content is not unique.
It should be noted that, the specific implementation of each unit in this embodiment may refer to the corresponding content in the foregoing, which is not described in detail herein.
Referring to fig. 6, a schematic structural diagram of an electronic device according to an embodiment of the present application is provided, and the technical solution of this embodiment is mainly used to reduce time cost and difficulty of data collection. Specifically, the electronic device in this embodiment may include the following structure:
a memory 601 for storing an application program and data generated by the running of the application program;
a processor 602, configured to execute the application program to implement:
acquiring first audio data;
converting the first audio data to obtain a first audio image;
generating a second audio image based on the first audio image;
and processing the audio characteristic information corresponding to the second audio image to obtain second audio data, wherein the first audio data and the second audio data have the same semantics, and the audio attribute characteristics of the first audio data and the second audio data are different.
According to the technical scheme, in the electronic equipment provided by the embodiment of the application, the first audio data are acquired, and the first audio data are converted to obtain the first audio image; generating a second audio image based on the first audio image; and processing the audio characteristic information corresponding to the second audio image to obtain second audio data. The purpose of generating the audio data with the same semantics and different audio attribute characteristics by the existing audio data is realized by the conversion mode of the audio image, and the time cost and difficulty of data collection are reduced.
It should be noted that, the specific implementation of the processor in this embodiment may refer to the corresponding content in the foregoing, which is not described in detail herein.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (9)

1. A data processing method, comprising:
acquiring first audio data;
converting the first audio data to obtain a first audio image;
generating a second audio image based on the first audio image;
processing the audio characteristic information corresponding to the second audio image to obtain second audio data, wherein the first audio data and the second audio data have the same semantics and have different audio attribute characteristics;
generating an audio data training sample based on the first audio data and the second audio data;
and training the model based on the audio data training sample to obtain an audio recognition model, wherein the audio recognition model is used for recognizing the audio data with different audio attribute characteristics.
2. The method of claim 1, the generating a second audio image based on the first audio image, comprising:
processing the first audio image based on an image conversion model to obtain a second audio image;
the image conversion model is used for extracting audio features to be converted in the first audio image and generating the second audio image based on the audio features, wherein the second audio image and the first audio image have different image features.
3. The method of claim 2, the method further comprising:
obtaining a first sample set comprising a number of first images having image features of a first audio image and a number of second images having image features of a second audio image;
and performing unsupervised training on the initial neural network model by using the first sample set to obtain an image conversion model.
4. The method of claim 3, wherein the image conversion model comprises a loop generation countermeasure network comprising a first generation countermeasure network and a second generation countermeasure network, wherein,
the first generation reactance network is used for extracting audio features to be converted in the first audio image and generating a second audio image based on the audio features;
the second generation countermeasure network is used for detecting whether the restored image of the second audio image is consistent with the first audio image.
5. The method of claim 2, the method further comprising:
obtaining a second sample set, wherein the second sample set comprises a plurality of groups of image samples, each group of image samples comprises a third image with image characteristics of a first audio image and a fourth image with image characteristics of a second audio image, and the fourth image corresponds to the third image;
and performing supervised training on the initial neural network model based on the second sample set to obtain an image conversion model.
6. The method of claim 1, the method further comprising:
acquiring audio data to be identified;
and inputting the audio data to be identified into the audio identification model to obtain an audio identification result, wherein the audio identification result comprises audio identification content and audio attribute characteristics.
7. The method of claim 6, the method further comprising:
and if the audio identification content is not unique, determining target audio identification content in the audio identification content based on the audio identification characteristics.
8. A data processing apparatus comprising:
an acquisition unit configured to acquire first audio data;
the conversion unit is used for converting the first audio data to obtain a first audio image;
a generation unit configured to generate a second audio image based on the first audio image;
the processing unit is used for processing the audio characteristic information corresponding to the second audio image to obtain second audio data, wherein the first audio data and the second audio data have the same semantics, and the audio attribute characteristics of the first audio data and the second audio data are different;
a sample generation unit for generating an audio data training sample based on the first audio data and the second audio data;
and the model training unit is used for carrying out model training based on the audio data training samples to obtain an audio recognition model, and the audio recognition model is used for recognizing the audio data with different audio attribute characteristics.
9. An electronic device, the electronic device comprising:
a memory for storing an application program and data generated by the operation of the application program;
a processor for executing the application program to realize:
acquiring first audio data;
converting the first audio data to obtain a first audio image;
generating a second audio image based on the first audio image;
processing the audio characteristic information corresponding to the second audio image to obtain second audio data, wherein the first audio data and the second audio data have the same semantics and have different audio attribute characteristics;
generating an audio data training sample based on the first audio data and the second audio data;
and training the model based on the audio data training sample to obtain an audio recognition model, wherein the audio recognition model is used for recognizing the audio data with different audio attribute characteristics.
CN202110189853.4A 2021-02-18 2021-02-18 Data processing method and device and electronic equipment Active CN113012706B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110189853.4A CN113012706B (en) 2021-02-18 2021-02-18 Data processing method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110189853.4A CN113012706B (en) 2021-02-18 2021-02-18 Data processing method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN113012706A CN113012706A (en) 2021-06-22
CN113012706B true CN113012706B (en) 2023-06-27

Family

ID=76403324

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110189853.4A Active CN113012706B (en) 2021-02-18 2021-02-18 Data processing method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN113012706B (en)

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106847294B (en) * 2017-01-17 2018-11-30 百度在线网络技术(北京)有限公司 Audio-frequency processing method and device based on artificial intelligence
KR102473447B1 (en) * 2018-03-22 2022-12-05 삼성전자주식회사 Electronic device and Method for controlling the electronic device thereof
CN108922518B (en) * 2018-07-18 2020-10-23 苏州思必驰信息科技有限公司 Voice data amplification method and system
CN110751944B (en) * 2019-09-19 2024-09-24 平安科技(深圳)有限公司 Method, device, equipment and storage medium for constructing voice recognition model
CN110853617B (en) * 2019-11-19 2022-03-01 腾讯科技(深圳)有限公司 Model training method, language identification method, device and equipment
CN110992938A (en) * 2019-12-10 2020-04-10 同盾控股有限公司 Voice data processing method and device, electronic equipment and computer readable medium
CN111261144B (en) * 2019-12-31 2023-03-03 华为技术有限公司 Voice recognition method, device, terminal and storage medium
CN111816160A (en) * 2020-07-28 2020-10-23 苏州思必驰信息科技有限公司 Mandarin and cantonese mixed speech recognition model training method and system

Also Published As

Publication number Publication date
CN113012706A (en) 2021-06-22

Similar Documents

Publication Publication Date Title
CN111785261B (en) Cross-language voice conversion method and system based on entanglement and explanatory characterization
CN110335587B (en) Speech synthesis method, system, terminal device and readable storage medium
CN110782872A (en) Language identification method and device based on deep convolutional recurrent neural network
CN111276119B (en) Speech generation method, system and computer equipment
CN114333865B (en) Model training and tone conversion method, device, equipment and medium
WO2024055752A9 (en) Speech synthesis model training method, speech synthesis method, and related apparatuses
CN111916054A (en) Lip-based voice generation method, device and system and storage medium
CN112289343B (en) Audio repair method and device, electronic equipment and computer readable storage medium
JP6452061B1 (en) Learning data generation method, learning method, and evaluation apparatus
CN114863905A (en) Voice category acquisition method and device, electronic equipment and storage medium
CN117854545A (en) Multi-instrument identification method and system based on time convolution network
CN113012706B (en) Data processing method and device and electronic equipment
CN112767912A (en) Cross-language voice conversion method and device, computer equipment and storage medium
CN112580669A (en) Training method and device for voice information
CN117037820A (en) Voice conversion method based on diffusion content and style decoupling
CN116453501A (en) Speech synthesis method based on neural network and related equipment
CN116206592A (en) Voice cloning method, device, equipment and storage medium
CN113241054B (en) Speech smoothing model generation method, speech smoothing method and device
CN115995225A (en) Model training method and device, speech synthesis method and device and storage medium
CN111862931B (en) Voice generation method and device
CN113990295A (en) Video generation method and device
CN114282046A (en) Method for acquiring style corpus and related method and equipment
CN113486208A (en) Voice search equipment based on artificial intelligence and search method thereof
CN118656483B (en) Man-machine interaction dialogue system based on artificial intelligence
CN116612746B (en) Speech coding recognition method in acoustic library based on artificial intelligence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant