CN116844574A

CN116844574A - Voice tag sample generation method, device, equipment and storage medium

Info

Publication number: CN116844574A
Application number: CN202310632820.1A
Authority: CN
Inventors: 张旭龙; 王健宗; 程宁; 孙一夫
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2023-05-31
Filing date: 2023-05-31
Publication date: 2023-10-03

Abstract

The embodiment of the application provides a voice tag sample generation method, a device, equipment and a storage medium, relating to the technical fields of artificial intelligence and digital medical treatment. The method comprises the following steps: segmenting an audio sample to obtain a plurality of segmented audios, obtaining the label similarity of each segmented audio and the reference audio by using a label correction sub-model, selecting a correction label of each segmented audio from the reference labels of the reference audio based on the label similarity, and adjusting a loss value by combining a predicted label of the segmented audio until a voice label labeling model with proper parameters is obtained. The voice label labeling model obtained by the embodiment of the application generates a proper reference label for each segmented audio sample, effectively reduces the difficulty and cost of audio sample labeling, has higher label labeling accuracy of the segmented audio samples, and effectively improves the generation quantity and quality of the voice samples containing the labels.

Description

Voice tag sample generation method, device, equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence and digital medical technology, and in particular, to a method, an apparatus, a device, and a storage medium for generating a voice tag sample.

Background

Speech is an important medium for realizing man-machine interaction, in recent years, emotion recognition is needed for speech to improve man-machine interaction quality, for example, intelligent speech customer service can master emotion of a customer at any time, and better promotion and communication are facilitated; the home robot can provide emotion value and the like in real time according to emotion movements of the host. Or in the medical field, the functions of disease auxiliary diagnosis, health management, remote consultation and the like can be supported by automatically identifying emotion of patient voice.

In the related art, a speech sample used for training a speech synthesis model contains emotion information, and a large amount of speech data marked with emotion information is required to be used as a training sample. In some technologies, emotion information is marked on voice data by using manpower, so that the collection cost is high, and the quantity of actually obtained training samples is insufficient, so that the voice emotion recognition model is fitted. Therefore, in some technologies, the voice data is divided into a plurality of small fragments, each small fragment is used as a training sample, and emotion information of the small fragment inherits emotion information of the whole sentence. However, since the emotion information in a sentence is not constant, the emotion information of the voice sample obtained in this way is not accurate. Therefore, how to accurately generate the reference label of the voice sample and expand the voice sample, and improve the accuracy of voice emotion recognition becomes a technical problem to be solved urgently.

Disclosure of Invention

The embodiment of the application mainly aims at providing a voice tag sample generation method, a device, equipment and a storage medium, and a reference tag expansion voice sample for accurately generating a voice sample is provided so as to improve the accuracy of voice emotion recognition.

To achieve the above object, a first aspect of an embodiment of the present application provides a method for generating a voice tag sample, where the voice tag labeling model includes: a label predictor model and a label corrector model, the method comprising:

segmenting the acquired audio samples to obtain a plurality of segmented audios;

performing label prediction on the segmented audio by using the label predictor model to obtain a predicted label;

selecting the reference audio of each voice category according to a preset category set and a preset feature quantity sequence by using the tag correction sub-model; the preset category set comprises a plurality of voice categories, each voice category comprises a plurality of reference audios, and each reference audio comprises a reference label;

selecting reference audio of each voice category by using the label correction sub-model, wherein each reference audio comprises a reference label;

calculating the label similarity of the segmented audio and each reference audio by using the label correction sub-model;

Performing similarity selection on the segmented audio according to the label similarity to obtain a corrected label of the segmented audio;

and obtaining target audio, inputting the target audio into the voice tag labeling model after parameter adjustment, and obtaining a plurality of target segmented audio samples of the target audio.

In an embodiment, the selecting the reference audio of each voice class by using the label correction sub-model includes:

acquiring a preset category set; the preset class set comprises an audio subset of a plurality of voice classes, each voice class comprising a plurality of reference audio;

obtaining the feature quantity of each voice category based on a preset feature quantity sequence;

and selecting the reference audio of the characteristic quantity from the audio subsets of each voice category to form an audio set.

In an embodiment, the label modifying sub-model includes a first feature extractor, and before the label predicting sub-model is used to predict the label of the segmented audio, the method includes:

acquiring the audio set;

and extracting the characteristics of each reference audio in the audio set by using the first characteristic extractor to obtain a first characteristic vector of the reference audio in the audio set.

In an embodiment, the label predictor model includes a second feature extractor, and the label predicting the segmented audio by using the label predictor model to obtain a predicted label, including:

extracting features of the segmented audio by using the second feature extractor to obtain a second feature vector of the segmented audio;

and carrying out category identification on the second feature vector to obtain the predictive label.

In an embodiment, the calculating the tag similarity of the segmented audio and each of the reference audio using the tag correction sub-model includes:

acquiring the second feature vector of the segmented audio;

the tag similarity of the second feature vector and each of the first feature vectors is calculated.

In an embodiment, the selecting the similarity of the segmented audio according to the similarity of the label by using the label correction sub-model to obtain a corrected label of the segmented audio includes:

selecting preset similarity according to the label similarity based on a preset selection principle;

selecting the first feature vector of the preset similarity as a target feature vector;

selecting the reference audio of the target feature vector as similar audio;

And taking the reference label of the similar audio as a correction label of the segmented audio.

In an embodiment, the segmenting the acquired audio samples to obtain a plurality of segmented audio includes:

acquiring an audio sample;

segmenting the audio sample according to the preset segmentation number and the preset segmentation length to obtain segmented audio corresponding to the preset segmentation number.

To achieve the above object, a second aspect of the embodiments of the present application provides a voice tag sample generating device, where the voice tag labeling model includes: a label predictor model and a label corrector model, the apparatus comprising:

the audio segmentation unit is used for segmenting the acquired audio samples to obtain a plurality of segmented audios;

the label prediction unit is used for carrying out label prediction on the segmented audio by utilizing the label prediction sub-model to obtain a prediction label;

the reference audio selecting unit is used for selecting the reference audio of each voice category according to a preset category set and a preset feature quantity sequence by using the tag correction submodel; the preset category set comprises a plurality of voice categories, each voice category comprises a plurality of reference audios, and each reference audio comprises a reference label;

A tag similarity calculation unit for calculating tag similarity of the segmented audio and each of the reference audio using the tag correction sub-model;

the label correction unit is used for selecting the similarity of the segmented audio according to the label similarity by utilizing the label correction sub-model to obtain a corrected label of the segmented audio;

the parameter adjustment unit is used for adjusting model parameters of the label prediction sub-model and the label correction sub-model according to the label loss values of the prediction label and the correction label;

the voice sample generation unit is used for acquiring target audio, inputting the target audio into the voice tag annotation model after parameter adjustment, and obtaining a plurality of target segmented audio samples of the target audio.

To achieve the above object, a third aspect of the embodiments of the present application proposes an electronic device, including a memory storing a computer program and a processor implementing the method according to the first aspect when the processor executes the computer program.

To achieve the above object, a fourth aspect of the embodiments of the present application proposes a storage medium, which is a computer-readable storage medium storing a computer program that, when executed by a processor, implements the method according to the first aspect.

According to the voice tag sample generation method, the device, the equipment and the storage medium, the acquired audio samples are segmented to obtain a plurality of segmented audios, then the segmented audios are subjected to tag prediction by using the tag prediction sub-model to obtain the prediction tags, the tag correction sub-model is used for selecting the reference audios of each voice category, then the tag similarity of each segmented audio and the reference audios is calculated, the correction tags of the segmented audios are obtained according to the tag similarity, the parameters of the voice tag marking model are adjusted according to the prediction tags and the correction tags, and the trained voice tag marking model is used for acquiring a plurality of target segmented audio samples of the target audios. According to the embodiment of the application, the voice label labeling model is utilized to obtain a plurality of segmented audio samples according to each audio sample, and each segmented audio sample is provided with the proper reference label, so that the difficulty and cost of labeling the audio samples can be effectively reduced, and as a plurality of reference audios are selected for each voice category, the label labeling accuracy of the segmented audio samples is higher, the generation quantity and quality of the voice samples containing the labels are effectively improved, the prediction accuracy of the voice emotion recognition model is further improved, and the phenomenon that the voice emotion recognition model is over-fitted is avoided.

Drawings

Fig. 1 is a flowchart of a voice tag sample generation method according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a voice tag label model according to the voice tag sample generation method provided by the embodiment of the present invention.

Fig. 3 is a flowchart of step S110 in fig. 1.

Fig. 4 is a schematic diagram of a label predictor model structure of a voice label sample generating method according to an embodiment of the present invention.

Fig. 5 is a flowchart of step S120 in fig. 1.

Fig. 6 is a flowchart of step S130 in fig. 1.

Fig. 7 is a schematic diagram of a label correction sub-model structure of a voice label sample generation method according to an embodiment of the present invention.

Fig. 8 is a flowchart of a voice tag sample generation method according to another embodiment of the present invention.

Fig. 9 is a flowchart of step S140 in fig. 1.

Fig. 10 is a flowchart of step S150 in fig. 1.

Fig. 11 is a schematic diagram of a voice tag label model according to a voice tag sample generation method according to another embodiment of the present invention.

Fig. 12 is a block diagram of a voice tag sample generating apparatus according to still another embodiment of the present invention.

Fig. 13 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

It should be noted that although functional block division is performed in a device diagram and a logic sequence is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the block division in the device, or in the flowchart.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.

First, several nouns involved in the present invention are parsed:

artificial intelligence (artificial intelligence, AI): is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding the intelligence of people; artificial intelligence is a branch of computer science that attempts to understand the nature of intelligence and to produce a new intelligent machine that can react in a manner similar to human intelligence, research in this field including robotics, language recognition, image recognition, natural language processing, and expert systems. Artificial intelligence can simulate the information process of consciousness and thinking of people. Artificial intelligence is also a theory, method, technique, and application system that utilizes a digital computer or digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.

Convolutional neural network (Convolutional Neural Networks, CNN): is a feedforward neural network which comprises convolution calculation and has a depth structure, and is one of representative algorithms of deep learning. The convolutional neural network has characteristic learning capability and can carry out translation invariant classification on input information according to a hierarchical structure of the convolutional neural network. The convolutional neural network imitates the visual perception mechanism construction of living beings, can carry out supervised learning and unsupervised learning, and the convolutional kernel parameter sharing and the sparsity of interlayer connection in the hidden layer enable the convolutional neural network to check the characteristics with smaller calculation amount. One common convolutional neural network structure is input layer-convolutional layer-pooling layer-full-link layer-output layer.

Deep learning: is the inherent law and presentation hierarchy of the learning sample data, and the information obtained in these learning processes is greatly helpful for interpretation of data such as text, images and sounds. Its final goal is to have the machine have analytical learning capabilities like a person, and to recognize text, image, and sound data. Deep learning is a complex machine learning algorithm that achieves far greater results in terms of speech and image recognition than prior art. Deep learning has achieved many results in search technology, data mining, machine learning, machine translation, natural language processing, multimedia learning, speech, recommendation and personalization techniques, and other related fields. The deep learning makes the machine imitate the activities of human beings such as audio-visual and thinking, solves a plurality of complex pattern recognition problems, and makes the related technology of artificial intelligence greatly advanced.

Mel spectroscopy (mel spline): i.e., mel-frequency spectrum, is a spectrum obtained by fourier transforming an acoustic signal and then transforming the signal in mel-scale. The spectrogram is often a large one, and in order to obtain sound features of a suitable size, the spectrogram may be transformed into a mel-scale filter bank. In the Mel frequency domain, the Mel frequency of the voice and the perception capability of the person to the tone are in linear relation, and the Mel spectrum is obtained by combining Mel frequency cepstrum and a spectrogram.

Speech is an important medium for realizing man-machine interaction, in recent years, emotion recognition is needed for speech to improve man-machine interaction quality, for example, intelligent speech customer service can master emotion of a customer at any time, and better promotion and communication are facilitated; the home robot can provide emotion value and the like in real time according to emotion movements of the host.

The applicant found that the related art: the speech samples used by the training speech synthesis model contain emotion information, and a large amount of speech data marked with emotion information is needed to be used as training samples. At present, a few public data sets are available for training a robust speech emotion recognition model, and meanwhile, the data sets are marked with the speaking level. In some technologies, emotion information is marked on voice data by using manpower, so that the collection cost is high, and the quantity of actually obtained training samples is insufficient, so that the voice emotion recognition model is fitted.

Therefore, in some technologies, aiming at the situation that the training data set is insufficient, a random cutting method in the field of computer vision is adopted, voice data is divided into a plurality of small fragments, each small fragment is used as a training sample, the small fragments are used as training units, the original data quantity can be expanded by tens of times, and emotion information of the small fragments inherits emotion information of a whole sentence. However, in a sentence, emotion information is not constant, emotion contribution values of different time frames to the sentence are different, emotion information of a voice sample obtained in this way is inaccurate, that is, in an expanded data set, actual emotion classification of a part of segmented audio is inconsistent with labels of a whole sentence obtained by inheritance, for example, a happy song has a flat transition in the middle, but not all the time is happy emotion. Therefore, how to accurately generate the reference label of the voice sample and expand the voice sample, and improve the accuracy of voice emotion recognition becomes a technical problem to be solved urgently.

Based on the above, the embodiment of the invention provides a voice tag sample generation method, a device, equipment and a storage medium, which are used for obtaining a plurality of segmented audios by segmenting an obtained audio sample, then carrying out tag prediction on the segmented audios by using a tag prediction sub-model to obtain a predicted tag, then selecting a reference audio of each voice category by using a tag correction sub-model based on a preset category set and a preset feature quantity sequence, then calculating the tag similarity of each segmented audio and the reference audio, obtaining a correction tag of the segmented audio according to the tag similarity, adjusting parameters of a voice tag marking model according to the predicted tag and the correction tag, and obtaining a plurality of target segmented audio samples of a target audio by using a trained voice tag marking model.

According to the embodiment of the application, the voice label labeling model is utilized to obtain a plurality of segmented audio samples according to each audio sample, and each segmented audio sample is provided with the proper reference label, so that the difficulty and cost of labeling the audio samples can be effectively reduced, and as a plurality of reference audios are selected for each voice category, the label labeling accuracy of the segmented audio samples is higher, the generation quantity and quality of the voice samples containing the labels are effectively improved, the prediction accuracy of the voice emotion recognition model is further improved, and the phenomenon that the voice emotion recognition model is over-fitted is avoided.

The embodiment of the application provides a voice tag sample generation method, a device, equipment and a storage medium, and specifically, the following embodiment is used for explaining, and first describes the voice tag sample generation method in the embodiment of the application.

The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (ArtificialIntelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and expand human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The embodiment of the invention provides a voice tag sample generation method, relates to the technical field of artificial intelligence, and particularly relates to the technical field of data mining. The voice tag sample generation method provided by the embodiment of the invention can be applied to a terminal, a server and a computer program running in the terminal or the server. For example, the computer program may be a native program or a software module in an operating system; the Application may be a local (Native) Application (APP), i.e. a program that needs to be installed in an operating system to run, such as a client that supports voice tag sample generation, or an applet, i.e. a program that only needs to be downloaded to a browser environment to run; but also an applet that can be embedded in any APP. In general, the computer programs described above may be any form of application, module or plug-in. Wherein the terminal communicates with the server through a network. The voice tag sample generation method may be performed by a terminal or a server, or performed in conjunction with the terminal and the server.

In some embodiments, the terminal may be a smart phone, tablet, notebook, desktop, or smart watch, or the like. The server can be an independent server, and can also be a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDNs), basic cloud computing services such as big data and artificial intelligent platforms, and the like; or may be service nodes in a blockchain system, where a Peer-To-Peer (P2P) network is formed between the service nodes, and the P2P protocol is an application layer protocol that runs on top of a transmission control protocol (TCP, transmission Control Protocol) protocol. The server may be provided with a server of the voice tag sample generating system, through which the server may interact with the terminal, for example, the server may be provided with corresponding software, which may be an application for implementing the voice tag sample generating method, or the like, but is not limited to the above form. The terminal and the server may be connected by a communication connection manner such as bluetooth, USB (Universal Serial Bus ) or a network, which is not limited herein.

The application is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In order to facilitate understanding of the embodiments of the present application, the concepts of sample labeling and speech emotion recognition will be briefly described below with reference to examples of specific application scenarios.

The speech data contains emotion information, such as emotion related to happiness (happiness, plain, sadness) when chatting to a certain event, emotion related to forgive (forgive, undiagnosed or undiagnosed) when receiving other people's apology, and the like. The voice emotion recognition is to extract emotion information in voice data and divide the emotion information into different voice emotion classification results according to preset classification standards. The preset classification criteria may be happy, sad, hard or angry, etc., and different classification criteria may be set according to actual usage situations.

In a customer service scenario: in order to ensure the service quality, special customer service quality inspectors are required to perform spot check monitoring and scoring on the service records, and form quality reports to be fed back to business personnel and customer service personnel. The voice data in the customer service call process is identified, and the emotion states of the customer service and the user are identified, so that the service quality condition can be effectively monitored.

Input: voice data in the customer service call process;

and (3) outputting: emotion information corresponding to the voice data.

The sample labeling is to label the used voice sample in the process of training the voice emotion recognition model, such as [ voice sample 1, anger ], [ voice sample 2, peace ] and the like, and a large number of labeling samples are used for training, so that the emotion recognition accuracy of the voice emotion recognition model on the input voice can be improved.

The following describes a voice tag sample generation method in an embodiment of the present invention.

Fig. 1 is an optional flowchart of a voice tag sample generating method according to an embodiment of the present invention, where the method in fig. 1 may include, but is not limited to, steps S110 to S170. It should be understood that the order of steps S110 to S170 in fig. 1 is not particularly limited, and the order of steps may be adjusted, or some steps may be reduced or increased according to actual requirements.

Step S110: and segmenting the acquired audio samples to obtain a plurality of segmented audios.

In one embodiment, the audio samples include voice content and voice tags, where the voice tags may be voice emotion tags. The voice emotion labels are obtained by classifying emotion information according to preset classification standards. For example, when a person is chatted, emotion related to happiness (happiness, plain, sadness) is expressed, when other person is sorry, emotion related to forgiveness (forgiveness, inappetence, forgiveness) is expressed, when emotion is high, emotion related to agitation is expressed, and emotion information is included in some emotion. In one embodiment, the predetermined classification criteria may be happy, sad, hard, or angry, etc. The preset classification standards are not particularly limited, and different classification standards can be set according to actual use situations to obtain different voice emotion labels. For example, in a customer service evaluation system, it is necessary to determine whether or not a customer service has an anger emotion during the service, and the emotion type to be recognized may be set as anger.

In one embodiment, the voice content is a wave file representing the voice signal in the form of waves (wave), a spectrogram (spectrum) representing the wave file in the form of frequencies, or a file representing the cepstrum coefficient of Mel frequency (Mel-Frequency Cepstral Coefficient, MFCC), and the present embodiment is not limited to the representation form of the voice content. In this embodiment, the voice content may be extracted from a voice client, such as by capturing a piece of audio input by a user through an audio input device (e.g., microphone) as voice content on a terminal, or from a database that aggregates a plurality of voice contents. For example, in a customer service system, voice content is obtained by acquiring call content between customer service and a user; or in the security service scene, the voice content is obtained by obtaining the call content between the security agent and the user, and the method for obtaining the voice content is not particularly limited in this embodiment.

In one embodiment, referring to FIG. 2, the phonetic label annotation model 10 comprises: a label predictor sub-model 100 and a label corrector sub-model 200.

In an embodiment, referring to fig. 3, a flowchart of a specific implementation of step S110 is shown in an embodiment, where step S110 of segmenting an acquired audio sample to obtain a plurality of segmented audio includes:

Step S111: an audio sample is acquired.

In one embodiment, the voice content in the audio sample may be a whole sentence, and the voice tag is a tag related to emotion information of the whole sentence.

Step S112: segmenting the audio sample according to the preset segmentation number and the preset segmentation length to obtain segmented audio corresponding to the preset segmentation number.

In an embodiment, in order to obtain a complete segmented audio, the number of preset segments of different audio samples may be different, and the preset segment lengths of different segmented audio may be different.

In an embodiment, firstly, voice recognition is performed on voice content to obtain a recognition text corresponding to the voice content, then, segmentation characteristics of the recognition text are extracted, a model related to segmentation is built in advance, and segmentation detection is performed on the recognition text corresponding to the voice content by using the extracted segmentation characteristics and the model built in advance to determine the position of a required segment, so that the preset segmentation number of the voice content and the preset segmentation length corresponding to each segmented audio are obtained. And segmenting the voice frame of the voice content according to the preset segmentation number and the preset segmentation length corresponding to each segmented audio frequency, so as to improve the integrity of segmented voice. It will be appreciated that the number of preset segments and the preset segment length corresponding to each segment of audio may also be a priori value according to the actual application. The number of preset segments and the preset segment length corresponding to each segment of audio are not particularly limited, and the method can be used according to actual requirements.

According to the method, the audio samples are segmented to obtain a plurality of segmented audios, so that the number of the samples is greatly increased, and the phenomenon of fitting excessively during sample training is avoided.

Step S120: and carrying out label prediction on the segmented audio by using a label prediction sub-model to obtain a predicted label.

In one embodiment, referring to FIG. 4, the structure of the label predictor model 100 includes: a second feature extractor 110, a full connection layer 120, and a class prediction layer 130. Wherein the second feature extractor 110 receives the input segmented audio, the second feature extractor 110 is connected to the full connection layer 120, the full connection layer 120 is connected to the class prediction layer 130, and the class prediction layer 130 outputs a prediction tag of the segmented audio.

In an embodiment, referring to fig. 5, a flowchart of a specific implementation of step S120 is shown in an embodiment, where step S120 of segmenting an acquired audio sample to obtain a plurality of segmented audio includes:

step S121: and extracting the characteristics of the segmented audio by using a second characteristic extractor to obtain a second characteristic vector of the segmented audio.

In one embodiment, the feature extraction process extracts high-dimensional frequency domain features, which are typically used to find periodic features in the segmented audio, and frequency domain analysis uses mainly fourier transform computations to transform the original signal into a sequence of frequency domains, the values in the sequence corresponding to the energy values of the frequencies in the time region. The frequency domain feature extraction mode comprises a Mel frequency cepstrum coefficient, a chromaticity feature, a short-time average zero-crossing rate, a frequency spectrum root mean square value, a frequency spectrum center moment, a frequency spectrum single-tone value, a frequency spectrum bandwidth, a frequency spectrum polynomial coefficient and the like.

In an embodiment, the second feature extractor first extracts first audio features from the segmented audio, the first audio features being low-level audio features. In an embodiment, the first audio feature may comprise: the characteristic values of spectrum correlation such as short-time average zero-crossing rate, mel frequency cepstrum coefficient, spectrum root mean square value, spectrum center moment, spectrum single-tone value, spectrum bandwidth, spectrum polynomial coefficient, and the like are not particularly limited herein.

In one embodiment, the second feature extractor is a Resnet network, removing the last fully connected layer. Wherein the second feature extractor is a deep learning based convolutional neural network. Therefore, the first audio feature is multiplied by the convolution kernel, and then the waveform sequence is abstracted to learn the spatial relationship and the time sequence relationship of the segmented audio, and meanwhile, the data dimension reduction is also carried out on the first audio feature of the segmented audio. A second feature vector of the segmented audio is constructed using the learned spatial and temporal relationships. It will be appreciated that the second feature extractor comprises a plurality of convolution kernels and that the parameters of the convolution kernels are automatically learned by back-propagation gradient descent.

As can be seen from the above description, in connection with fig. 4, the second feature extractor 110 is capable of extracting features of the segmented audio to obtain a second feature vector of the segmented audio.

Step S122: and carrying out category identification on the second feature vector to obtain a predictive label.

In an embodiment, referring to fig. 4, the second feature vector is input to the full connection layer 120 to perform a full connection operation, where the full connection operation maps the learned second feature vector representation to the classification space of the segmented audio, and the class prediction layer 130 obtains the voice class corresponding to each segmented audio, that is, the voice emotion class, and uses the obtained voice class as the prediction label.

In an embodiment, the full-connection layer 120 maps each feature value of the extracted second feature vector to a range from 0 to 1 through a Softmax function to obtain a mapped value, and the output value has non-negativity and normalization, that is, probability distribution, so that the mapped value can be understood as a probability value, and predicts a probability of each speech class, wherein the highest value is the speech class finally predicted by the segmented audio. The mapped value is then input into the class prediction layer 130 to obtain a prediction tag.

In an application example, for example, the voice category includes two categories of normal and anger, and 3 pieces of segmented audio are obtained after the audio sample is segmented, which are respectively: segment audio 1, segment audio 2, and segment audio 3, the mapping value of each segment audio is as follows:

As can be seen from the above table, the normal probability value of the segment audio 1 is greater than the anger probability value, so the predictive label of the segment audio 1 is: normal; the anger probability value of the segmented audio 2 is larger than the normal probability value, so the predictive label of the segmented audio 1 is: anger; the anger probability value of the segmented audio 2 is larger than the normal probability value, so the predictive label of the segmented audio 1 is: anger.

From the above, in the embodiment of the present application, label prediction is performed on the segmented audio by using the label predictor model to obtain a predicted label.

Step S130: and selecting the reference audio of each voice category by using the label correction sub-model.

In an embodiment, the set of preset categories comprises a plurality of speech categories, each speech category comprising a plurality of reference tones, each reference tone comprising a reference tag, i.e. the speech category is characterized by the reference tag.

In one embodiment, a set of predetermined categories is first obtained, wherein the set of predetermined categories is a set of audio data samples, and the set of audio data samples includes a plurality of reference audio for each of the plurality of speech categories. The types of reference audio may include: song, rap, chat, hypnotic, melody, conversation, etc. In order to ensure that the voice tag labeling model can give consideration to the relevance and the difference between audios, the lengths of the reference audios can be different, the content of each reference audio is ensured to be different as much as possible, and a preset class set is formed by collecting the reference audios.

In an embodiment, a plurality of reference tones are included in each speech category, each reference tone including a reference tag, i.e. the speech category is characterized by the reference tag.

In one embodiment, to obtain more reference audio, the reference audio may be augmented as original audio. Specifically, the manner of the expansion process includes one or a combination of the following manners: audio sequence clipping, audio sequence repetition, audio sequence rotation, audio tone enhancement, audio tone reduction, audio gaussian noise, audio data compression, audio data expansion, etc., it is understood that the expansion process operation does not change the affective information of the reference audio. The embodiment utilizes the expansion processing mode to expand the original audio to obtain an expansion sample corresponding to the original audio. In this embodiment, the label of the augmented sample inherits the label of the reference audio.

In an embodiment, referring to fig. 6, a flowchart showing a specific implementation of step S130 is shown, in this embodiment, step S130 of selecting reference audio of each voice class by using the label correction sub-model includes:

step S131: and acquiring a preset category set.

In an embodiment, the set of preset categories includes a plurality of audio subsets of speech categories, each audio subset including a plurality of reference audio.

Step S132: and obtaining the feature quantity of each voice category based on the preset feature quantity sequence.

In an embodiment, for different voice classes, different numbers of reference audio frequencies need to be selected, and a set of feature numbers of each voice class is called a preset feature number sequence. In an embodiment, a random manner is used to select more than 1 reference audio in the audio subset of each voice category, that is, the feature number of each voice category is more than 1, so as to avoid that when only one is selected as the reference audio, the error may be larger, and selecting a plurality of reference audio can further improve the selection rationality, thereby improving the accuracy of voice tag labeling.

Step S133: and selecting reference audio with characteristic quantity from the audio subset of each voice category to form an audio set.

In an embodiment, as can be seen from the above, an audio subset of each voice category is constructed according to the reference audio in each voice category, where the audio subset includes all the reference voices of the corresponding voice category, and then a corresponding number of reference audio is selected from each audio subset according to the feature number corresponding to each voice category to form an audio set. It is understood that the audio set contains multiple reference audio for each speech category.

In one embodiment, referring to FIG. 7, the structure of the tag correction sub-model 200 includes: a first feature extractor 210 and a similarity calculator 220, the output of the first feature extractor 210 being connected to the input of the similarity calculator 220.

In an embodiment, step S130 is followed by a step of operating on the reference voice, and in an embodiment, referring to fig. 8, a flowchart is shown for a specific implementation of the above steps, where the step of operating on the reference voice includes:

step S810: an audio collection is obtained.

Step S820: and extracting the characteristics of each reference audio in the audio set by using a first characteristic extractor to obtain a first characteristic vector of the reference audio in the audio set.

In an embodiment, referring to fig. 7, a first feature extractor 210 is used to perform feature extraction on each reference audio in the audio set, so as to obtain a first feature vector of each reference audio.

In an embodiment, referring to fig. 4 and 7, the model architecture of the first feature extractor 210 and the second feature extractor 110 are the same, and the model parameters are the same, and both may be the same feature extractor. The feature extraction process is therefore described above with respect to the feature extraction of the second feature extractor, with the object that the extracted features of the segmented audio and the reference audio are correlated, enabling a subsequent comparison. It can be appreciated that the parameters of the two may be fine-tuned according to actual requirements, which is not particularly limited in this embodiment.

From the above, the first feature extractor can obtain the first feature vector of each reference audio.

Step S140: and calculating the label similarity of the segmented audio and each reference audio by using the label correction sub-model.

In one embodiment, referring to FIG. 7, the tag correction sub-model 200 uses the similarity calculator 220 to calculate a tag similarity between each of the segmented audio and the reference audio.

In an embodiment, referring to fig. 9, a flowchart of a specific implementation of step S140 is shown, in this embodiment, step S140 of calculating a tag similarity between a segmented audio and each reference audio by using a tag correction sub-model includes:

step S141: a second feature vector of the segmented audio is obtained.

In an embodiment, referring to fig. 7, in conjunction with step S121, the first feature extractor 210 may be used to perform feature extraction on the segmented audio to obtain a second feature vector of the segmented audio.

Step S142: and calculating the label similarity of the second characteristic vector and each first characteristic vector.

In an embodiment, the tag similarity is a vector similarity value between the second feature vector and the first feature vector. In an embodiment, the vector similarity value is a cosine similarity value, specifically, a cosine value of an included angle of two vectors in the vector space, that is, the vector similarity value is calculated by using a cosine similarity calculation method, and the difference between the two vectors is measured.

Tag similarity is expressed as: cos < a, b > = (a, b)/|a||b|, wherein (, and) represents the inner product, and the absolute value represents the modulo operation.

It can be understood that the tag similarity can be obtained by comparing other methods for representing the similarity of the two vectors, and the specific calculation mode of the tag similarity is not limited in the embodiment of the present application.

Step S150: and selecting the similarity of the segmented audio according to the similarity of the labels to obtain the corrected labels of the segmented audio.

In an embodiment, the reference tag of the reference voice corresponding to the preset similarity is a correction tag of the segmented audio.

In an embodiment, referring to fig. 10, which is a flowchart showing a specific implementation of step S150, in this embodiment, a step S150 of selecting similarity of segmented audio according to similarity of labels by using a label correction sub-model to obtain a corrected label of the segmented audio includes:

step S151: and selecting the preset similarity according to the label similarity based on a preset selection principle.

In one embodiment, the predetermined selection rule is a maximum rule. It will be appreciated that the number of tag similarities obtained for the segmented audio and each reference tag corresponds to the number of reference tags. Therefore, the maximum value principle of the embodiment refers to selecting the similarity value with the largest similarity value from all the label similarities as the preset similarity.

Step S152: and selecting a first feature vector with preset similarity as a target feature vector.

In an embodiment, a first feature vector corresponding to a preset similarity is used as a target feature vector, and the target feature vector is related to a label of the segmented audio.

Step S153: and selecting the reference audio of the target feature vector as the similar audio.

In an embodiment, the reference audio corresponding to the target feature vector is used as the similar audio.

Step S154: and taking the reference label of the similar audio as a correction label of the segmented audio.

In one embodiment, the reference tag of the similar audio is used as a correction tag to which the label correction is performed on the segmented audio.

From the above, the label correction sub-model obtains the correction label which can represent the segmented audio voice class at a plurality of reference audios.

Step S160: and adjusting model parameters of the label prediction sub-model and the label correction sub-model according to the label loss values of the prediction label and the correction label.

In one embodiment, when training a phonetic label labeling model: the predictive label and the correction label are compared and calculated to obtain a label loss value, wherein the label loss value can be a cross entropy loss value S, the correction label y' and the predictive label y respectively carry out cross entropy loss with network output, the two parts are weighted to obtain total loss, and the cross entropy loss value S is expressed as:

S＝(1-α)y+αy′

Wherein S represents a cross entropy loss value, α represents a super parameter, y' represents a predictive label, and y represents a correction label.

In an embodiment, parameters of the voice tag label model are adjusted according to the obtained cross entropy loss value, that is, model parameters of the tag predictor model and the tag corrector model are adjusted, and if a preset convergence condition is met, the generation of the voice tag sample is completed. It is understood that the preset convergence condition may be that the cross entropy loss value is smaller than a threshold value or reaches a preset convergence number, and the preset convergence condition is not specifically limited in this embodiment.

Step S170: and obtaining target audio, inputting the target audio into the voice tag labeling model after parameter adjustment, and obtaining a plurality of target segmented audio samples of the target audio.

In one embodiment, the target segment audio samples each include a corresponding tag.

In an embodiment, for the trained voice tag labeling model, the similarity between the predicted tag and the corrected tag is high, so that the predicted tag can also well represent emotion information of the segmented audio. It will be appreciated that comparing the predicted tag with the revised tag during the application process, if there is a large difference between the two, the model parameters of the voice tag annotation model need to be retrained.

As can be seen from the foregoing, in the embodiment of the present application, the acquired audio samples are segmented to obtain a plurality of segmented audio, then the segmented audio is subjected to label prediction by using a label predictor model to obtain a predicted label, then the reference audio of each voice class is selected by using a label modifier model based on a preset class set and a preset feature quantity sequence, then the label similarity of each segmented audio and the reference audio is calculated, a modified label of the segmented audio is obtained according to the label similarity, parameters of a voice label model are adjusted according to the predicted label and the modified label, and a plurality of target segmented audio samples of the target audio are acquired by using the trained voice label model.

In one embodiment, referring to fig. 11, a schematic diagram of a voice tag labeling model according to an embodiment of the present application is shown, wherein the voice tag labeling model 10 includes: the label predictor model 100 and the label corrector model 200, the label predictor model 100 has a structure including: the structure of the label modifier sub-model 200 includes: a first feature extractor 210 and a similarity calculator 220. The similarity calculator 220 includes: an audio subset 221 of each speech class and a parameter sharing between the second feature extractor 110 and the first feature extractor 210.

The overall flow of the voice tag sample generation method in the embodiment of the present application is described below with reference to fig. 11.

In this embodiment, the audio sample d0= { X0, Y0}, where X0 is the voice content and Y0 is the tag related to emotion information in which the voice tag is a whole sentence. By obtaining a plurality of segment audios by segmentation, expressed as d= { S, F } = { (S1, F1), …, (sn, fn) }, wherein the number of segment audios is n, (S1, F1) represents the first segment audio, S1 represents the speech content of the first segment audio, F1 represents the predictive label of the first segment audio, and so on.

The method comprises the steps of inputting segmented audio x, obtaining a second feature vector by using a first feature extractor 210 of a tag correction sub-model 200, calculating tag similarity of every two first feature vectors of every reference audio in an audio set in a similarity calculator 220, wherein the first feature vector of every reference audio forms a first feature set, the tag similarity forms a similarity matrix, the tag similarity with the highest score in the similarity matrix is preset similarity, the first feature vector with the preset similarity is used as a target feature vector, then selecting the reference audio of the target feature vector as similar audio, the reference tag of the similar audio is used as a correction tag of the segmented audio, and obtaining a tag loss value according to the correction tag and the reference tag. It can be understood that when the voice tag label model is trained, the prediction of the segmented audio frequency and the network parameters of the whole voice tag label model need to be updated in the iteration process, which are different from the traditional deep learning paradigm in that only model parameters are updated, the voice tag label model training process of the embodiment can obtain the prediction tag of the updated and changed segmented audio frequency, and can obtain the voice tag label model with higher prediction precision.

Further, after the trained voice label labeling model is obtained, each segmented audio after segmentation is subjected to segment-level prediction, and a final speech-level prediction label is obtained through mode voting. In this embodiment, mode voting is to predict all the segmented audio in one audio to obtain prediction labels of a plurality of segmented audio, and according to all the prediction labels in one audio, the prediction label with the highest frequency is used as the predicted voice class of the speech level audio. The label self-optimization is carried out on the prediction labels of the segmented audio, so that the prediction accuracy of the prediction labels of the segmented audio is improved, and meanwhile, the prediction accuracy of the speaking level of the whole sentence is improved.

In the embodiment of the application, under the background of rare voice emotion labeling samples, the existing speech-level audio labeling information is utilized for data enhancement. In addition, unlike the method that the average value of each voice category is calculated to obtain the reference audio, the feature quantity of each voice category in the embodiment is larger than 1, so that the purpose of avoiding that when only one voice category is selected as the reference audio, errors are possibly larger, and selecting a plurality of reference audio can further improve the selection rationality, thereby improving the accuracy of voice label marking. According to the technical scheme provided by the embodiment of the application, the acquired audio samples are segmented to obtain a plurality of segmented audios, then the segmented audios are subjected to label prediction by using a label predictor model to obtain a predicted label, then the reference audio of each voice category is selected by using a label correction submodel based on a preset category set and a preset feature quantity sequence, then the label similarity of each segmented audio and the reference audio is calculated, the correction label of the segmented audio is obtained according to the label similarity, the parameters of a voice label marking model are adjusted according to the predicted label and the correction label, and a plurality of target segmented audio samples of target audios are acquired by using a trained voice label marking model. According to the embodiment of the application, the voice label labeling model is utilized to obtain a plurality of segmented audio samples according to each audio sample, and each segmented audio sample is provided with the proper reference label, so that the difficulty and cost of labeling the audio samples can be effectively reduced, and as a plurality of reference audios are selected for each voice category, the label labeling accuracy of the segmented audio samples is higher, the generation quantity and quality of the voice samples containing the labels are effectively improved, the prediction accuracy of emotion recognition by the voice emotion recognition model through the voice samples is improved, and the phenomenon of overfitting of the voice emotion recognition model is avoided.

The voice tag sample method provided by the embodiment of the application can be used in the medical field, the audio sample is the voice of the patient collected under the informed consent of the patient, and then the sample is trained by using the voice tag sample method provided by the embodiment of the application, so that the emotion classification model suitable for medical scenes is obtained. The application of the emotion classification model in the digital medical field is described below.

In a medical scenario, it is possible to determine whether a psychological state of a patient is stable by analyzing emotional changes in a patient's voice signal, and provide more personalized therapeutic services. For example, some patients with mental disorders may experience fluctuations in data or mood swings, and the mood classification model may monitor such fluctuations and prompt the physician to make adjustments in time. This is very useful for treating psychological problems such as depression, anxiety etc. and can help medical workers make better treatment decisions. Meanwhile, the emotion of the patient voice can be classified to provide more prediction and recommendation information for doctors, and the doctors can better know the illness state of the patient and provide necessary auxiliary treatment for the illness state of the patient by using the information, so that a more effective treatment scheme is formulated.

The embodiment of the invention also provides a voice tag sample generating device, which can realize the voice tag sample generating method, and the voice tag labeling model comprises the following steps: label predictor model and label corrector model referring to fig. 12, the apparatus comprises:

the audio segmentation unit 1210 is configured to segment the acquired audio samples to obtain a plurality of segmented audio.

The label prediction unit 1220 is configured to perform label prediction on the segmented audio by using the label prediction sub-model, so as to obtain a predicted label.

A reference audio selecting unit 1230, configured to select a reference audio of each voice category according to a preset category set and a preset feature quantity sequence by using the tag correction sub-model; the preset category set includes a plurality of voice categories, each voice category including a plurality of reference tones, each reference tone including a reference tag.

A tag similarity calculating unit 1240 for calculating the tag similarity of the segmented audio and each reference audio using the tag correction sub-model.

And the label correction unit 1250 is used for selecting the similarity of the segmented audio according to the similarity of the labels by utilizing the label correction sub-model to obtain a corrected label of the segmented audio.

A parameter adjustment unit 1260 for adjusting model parameters of the label predictor model and the label modifier model according to the label loss values of the predicted label and the modifier label.

The speech sample generating unit 1270 is configured to obtain a target audio, input the target audio to the adjusted parameter speech tag label model, and obtain a plurality of target segment audio samples of the target audio.

The specific implementation manner of the voice tag sample generating device in this embodiment is substantially identical to the specific implementation manner of the voice tag sample generating method described above, and will not be described in detail herein.

The embodiment of the invention also provides electronic equipment, which comprises:

at least one memory; at least one processor; at least one program;

the program is stored in the memory, and the processor executes the at least one program to implement the voice tag sample generating method of the present invention. The electronic equipment can be any intelligent terminal including a mobile phone, a tablet personal computer, a personal digital assistant (Personal Digital Assistant, PDA for short), a vehicle-mounted computer and the like.

Referring to fig. 13, fig. 13 illustrates a hardware structure of an electronic device according to another embodiment, the electronic device includes:

the processor 1301 may be implemented by a general purpose CPU (central processing unit), a microprocessor, an application specific integrated circuit (ApplicationSpecificIntegratedCircuit, ASIC), or one or more integrated circuits, etc. for executing related programs, so as to implement the technical solution provided by the embodiments of the present invention; the memory 1302 may be implemented in the form of a ROM (read only memory), a static storage device, a dynamic storage device, or a RAM (random access memory). The memory 1302 may store an operating system and other application programs, and when the technical solutions provided in the embodiments of the present disclosure are implemented by software or firmware, relevant program codes are stored in the memory 1302, and the processor 1301 invokes a voice tag sample generating method for executing the embodiments of the present disclosure; an input/output interface 1303 for implementing information input and output; the communication interface 1304 is configured to implement communication interaction between the device and other devices, and may implement communication in a wired manner (e.g. USB, network cable, etc.), or may implement communication in a wireless manner (e.g. mobile network, WIFI, bluetooth, etc.); and a bus 1305 to transfer information between the various components of the device (e.g., the processor 1301, memory 1302, input/output interfaces 1303, and communication interfaces 1304); wherein the processor 1301, the memory 1302, the input/output interface 1303 and the communication interface 1304 enable a communication connection between each other inside the device via a bus 1305.

The embodiment of the application also provides a storage medium, which is a computer readable storage medium, and the storage medium stores a computer program, and the computer program realizes the voice tag sample generation method when being executed by a processor.

The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

According to the voice tag sample generation method, the voice tag sample generation device, the electronic equipment and the storage medium, the acquired audio samples are segmented to obtain a plurality of segmented audios, then the segmented audios are subjected to tag prediction by using the tag prediction sub-model to obtain predicted tags, then the reference audios of each voice category are selected by using the tag correction sub-model based on the preset category set and the preset feature quantity sequence, then the tag similarity of each segmented audio and the reference audio is calculated, the correction tags of the segmented audios are obtained according to the tag similarity, parameters of the voice tag marking model are adjusted according to the predicted tags and the correction tags, and the plurality of target segmented audio samples of the target audios are acquired by using the trained voice tag marking model. According to the embodiment of the application, the voice label labeling model is utilized to obtain a plurality of segmented audio samples according to each audio sample, and each segmented audio sample is provided with the proper reference label, so that the difficulty and cost of labeling the audio samples can be effectively reduced, and as a plurality of reference audios are selected for each voice category, the label labeling accuracy of the segmented audio samples is higher, the generation quantity and quality of the voice samples containing the labels are effectively improved, the prediction accuracy of emotion recognition by the voice emotion recognition model through the voice samples is improved, and the phenomenon of overfitting of the voice emotion recognition model is avoided.

The embodiments described in the embodiments of the present application are for more clearly describing the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application, and those skilled in the art can know that, with the evolution of technology and the appearance of new application scenarios, the technical solutions provided by the embodiments of the present application are equally applicable to similar technical problems.

It will be appreciated by persons skilled in the art that the embodiments of the application are not limited by the illustrations, and that more or fewer steps than those shown may be included, or certain steps may be combined, or different steps may be included.

The above described apparatus embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Those of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.

The terms "first," "second," "third," "fourth," and the like in the description of the application and in the above figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in the present application, "at least one (item)" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the above-described division of units is merely a logical function division, and there may be another division manner in actual implementation, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including multiple instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method of the various embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing a program.

The preferred embodiments of the present application have been described above with reference to the accompanying drawings, and are not thereby limiting the scope of the claims of the embodiments of the present application. Any modifications, equivalent substitutions and improvements made by those skilled in the art without departing from the scope and spirit of the embodiments of the present application shall fall within the scope of the claims of the embodiments of the present application.

Claims

1. A voice tag sample generation method is characterized in that a voice tag labeling model comprises the following steps: a label predictor model and a label corrector model, the method comprising:

adjusting model parameters of the label predictor model and the label modifier model according to the label loss values of the predicted label and the modifier label;

2. The method for generating a voice tag sample according to claim 1, wherein the selecting the reference audio for each voice category using the tag correction sub-model comprises:

Acquiring a preset category set; the preset class set comprises a plurality of audio subsets of voice classes, each audio subset comprising a plurality of reference audio;

3. The method according to claim 2, wherein the label modification sub-model includes a first feature extractor, and the label prediction is performed on the segmented audio by using the label prediction sub-model, and before obtaining a predicted label, the method includes:

acquiring the audio set;

4. The method of claim 3, wherein the label predictor model includes a second feature extractor, and wherein label predicting the segmented audio using the label predictor model results in a predicted label, comprising:

5. The method of claim 4, wherein said calculating the tag similarity of the segmented audio and each of the reference audio using the tag correction sub-model comprises:

acquiring the second feature vector of the segmented audio;

6. The method for generating a voice tag sample according to claim 5, wherein performing similarity selection on the segmented audio according to the tag similarity by using the tag correction sub-model to obtain a corrected tag of the segmented audio, comprises:

selecting the reference audio of the target feature vector as similar audio;

7. The method for generating voice tag samples according to any one of claims 1 to 6, wherein the segmenting the acquired audio samples to obtain a plurality of segmented audio comprises:

acquiring an audio sample;

8. A voice tag sample generation apparatus, wherein the voice tag labeling model comprises: a label predictor model and a label corrector model, the apparatus comprising:

a reference audio selecting unit, configured to select a reference audio of each voice class by using the tag correction sub-model, where each reference audio includes a reference tag;

The label correction unit is used for selecting the similarity of the segmented audio according to the label similarity to obtain a correction label of the segmented audio;

9. An electronic device comprising a memory storing a computer program and a processor that when executing the computer program implements the voice tag sample generation method of any one of claims 1 to 7.

10. A computer readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the voice tag sample generation method of any one of claims 1 to 7.