CN111554277B

CN111554277B - Voice data recognition method, device, equipment and medium

Info

Publication number: CN111554277B
Application number: CN202010417958.6A
Authority: CN
Inventors: 宋元峰
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2020-05-15
Filing date: 2020-05-15
Publication date: 2023-11-03
Anticipated expiration: 2040-05-15
Also published as: CN111554277A

Abstract

The application discloses a voice data recognition method, a device, equipment and a medium, wherein the method comprises the following steps: acquiring voice data to be recognized, and inputting the voice data to be recognized into a preset recognition model optimized based on non-manual annotation voice data; the non-manual labeling voice data are obtained based on a preset pre-training model optimized by simulated tag data, and the simulated tag data are obtained based on conversion of preset non-tag original voice data; and carrying out feature extraction processing on the voice data to be recognized based on the preset recognition model to obtain a classification result. The application solves the technical problem of low separation efficiency of speaker voice separation caused by the need of artificial marking of a large amount of training data in the prior art.

Description

Voice data recognition method, device, equipment and medium

Technical Field

The application relates to the technical field of artificial intelligence of financial science and technology (Fintech), in particular to a voice data recognition method, voice data recognition equipment and voice data recognition media.

Background

With the continuous development of finance technology, especially internet technology finance, more and more technologies (such as distributed, blockchain, artificial intelligence, etc.) are applied in the finance field, but the finance industry also puts forward higher requirements on technologies, such as higher requirements on the finance industry for recognizing speech data.

With the development of mobile devices, speech is a daily communication manner, in which speech segments of different speakers are firstly segmented from a segment of speech stream, and then a speaker speech separation technology for judging to which speaker each segment of speech belongs is more and more important, however, at present, in order to accurately perform speaker speech separation, a large amount of training data with marks is required for training, that is, in the prior art, in order to accurately perform speaker speech separation, a large amount of training data is required for marking by people, which causes resource waste, and reduces separation efficiency of speaker speech separation.

Disclosure of Invention

The application mainly aims to provide a voice data recognition method, a device, equipment and a medium, which aim to solve the technical problem of low separation efficiency of speaker voice separation caused by the fact that a large amount of training data need to be marked manually in the prior art.

In order to achieve the above object, the present application provides a voice data recognition method, including:

acquiring voice data to be recognized, and inputting the voice data to be recognized into a preset recognition model optimized based on non-manual annotation voice data;

the non-manual labeling voice data are obtained based on a preset pre-training model optimized by simulated tag data, and the simulated tag data are obtained based on conversion of preset non-tag original voice data;

And carrying out feature extraction processing on the voice data to be recognized based on the preset recognition model to obtain a classification result.

Optionally, the simulated tag data is obtained by partially replacing preset unlabeled original voice data with generated unlabeled voice data, and the simulated tag data at least comprises data of a true simulated tag and a false simulated tag.

Optionally, before the step of performing feature extraction processing on the voice data to be identified based on the preset recognition model to obtain the classification result, the method further includes:

acquiring original voice data without a preset tag;

generating unlabeled voice data, and partially replacing the preset unlabeled original voice data with the unlabeled voice data to obtain simulated label data;

training a preset training model based on the simulated tag data to obtain a target model meeting preset conditions, and setting the target model as the preset pre-training model.

Optionally, the step of training a preset training model based on the simulated tag data to obtain a target model meeting a preset condition, and setting the target model as the preset pre-training model includes:

Determining simulated false tag data and simulated true tag data in the simulated tag data;

inputting the simulated false tag data and the simulated true tag data into a preset training model to obtain a recognition result;

and adjusting model parameters of the preset training model based on the identification result and the true and false simulation labels in the simulation label data until a target model meeting preset conditions is obtained, and setting the target model as the preset pre-training model.

Optionally, the step of acquiring the preset unlabeled original voice data includes:

and acquiring an original voice file without labels, and carrying out vector coding processing on the original voice file based on a preset attention mechanism to obtain original voice data without labels.

Optionally, the step of generating the unlabeled voice data, and replacing the preset unlabeled original voice data with the unlabeled voice data to obtain the simulated label data includes:

generating non-tag voice data, wherein the non-tag voice data comprises non-tag random voice frame data or non-tag random voice fragment data;

and partially replacing the preset untagged original voice data with the untagged voice data to obtain simulated tag data.

Optionally, after the step of performing feature extraction processing on the voice data to be recognized based on the preset recognition model to obtain a classification result, the method includes:

and updating the preset recognition model by the voice data to be recognized and the classification result to obtain an updated preset recognition model.

The application also provides a voice data recognition device, which comprises:

the first acquisition module is used for acquiring voice data to be recognized, and inputting the voice data to be recognized into a preset recognition model optimized based on non-manual annotation voice data;

and the classification module is used for carrying out feature extraction processing on the voice data to be recognized based on the preset recognition model to obtain a classification result.

Optionally, the voice data recognition device further includes:

the second acquisition module is used for acquiring preset original voice data without labels;

the generation module is used for generating non-tag voice data, and partially replacing the preset non-tag original voice data with the non-tag voice data to obtain simulated tag data;

and the third acquisition module is used for training a preset training model based on the simulated tag data to obtain a target model meeting preset conditions, and setting the target model as the preset pre-training model.

Optionally, the third obtaining module includes:

a determining unit, configured to determine simulated false tag data and simulated true tag data in the simulated tag data;

the first acquisition unit is used for inputting the simulated false tag data and the simulated true tag data into a preset training model to obtain a recognition result;

and the adjusting unit is used for adjusting the model parameters of the preset training model based on the identification result and the true and false simulation labels in the simulation label data until a target model meeting preset conditions is obtained, and setting the target model as the preset pre-training model.

Optionally, the second obtaining module includes:

the second acquisition unit is used for acquiring an original voice file without labels, and carrying out vector coding processing on the original voice file based on a preset attention mechanism to obtain preset original voice data without labels.

Optionally, the generating module includes:

a generation unit configured to generate unlabeled voice data, where the unlabeled voice data includes unlabeled random voice frame data or unlabeled random voice fragment data;

and the replacing unit is used for partially replacing the preset untagged original voice data with the untagged voice data to obtain simulated tag data.

Optionally, the voice data recognition device further includes:

and the updating module is used for updating the preset recognition model by the voice data to be recognized and the classification result to obtain an updated preset recognition model.

The application also provides a voice data recognition device, which is entity device, comprising: the voice data recognition method comprises a memory, a processor and a program of the voice data recognition method stored in the memory and capable of running on the processor, wherein the program of the voice data recognition method can realize the steps of the voice data recognition method when being executed by the processor.

The present application also provides a medium on which a program for implementing the above-mentioned voice data recognition method is stored, which when executed by a processor implements the steps of the above-mentioned voice data recognition method.

According to the method, the voice data to be recognized are obtained and input into a preset recognition model optimized based on non-manual annotation voice data; the non-manual labeling voice data are obtained based on a preset pre-training model optimized by simulated tag data, and the simulated tag data are obtained based on conversion of preset non-tag original voice data; and carrying out feature extraction processing on the voice data to be recognized based on the preset recognition model to obtain a classification result. According to the application, the non-artificial labeling voice data is obtained based on the preset pre-training model optimized by the simulated label data instead of the artificial labeling, and the simulated label data is data with labels obtained based on the conversion of the preset non-label original voice data, so that the preset pre-training model can be accurately optimized to label the data, the training efficiency of the preset recognition model is improved as a whole, and the feature extraction processing is performed on the voice data to be recognized based on the preset recognition model, so that the classification result is obtained, and the separation efficiency of speaker voice separation is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the embodiments of the application or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, and it will be obvious to a person skilled in the art that other drawings can be obtained from these drawings without inventive effort.

FIG. 1 is a flowchart of a voice data recognition method according to a first embodiment of the present application;

FIG. 2 is a detailed flowchart of steps before the step of extracting features from the speech data to be recognized to obtain a classification result based on the preset recognition model in the first embodiment of the speech data recognition method of the present application;

FIG. 3 is a schematic diagram of a device architecture of a hardware operating environment according to an embodiment of the present application;

fig. 4 is a schematic diagram of a first scenario in the voice data recognition method of the present application.

The achievement of the objects, functional features and advantages of the present application will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

An embodiment of the present application provides a voice data recognition method, in a first embodiment of the voice data recognition method of the present application, referring to fig. 1, the voice data recognition method includes:

step S10, obtaining voice data to be recognized, and inputting the voice data to be recognized into a preset recognition model optimized based on non-manual labeling voice data;

and step S20, carrying out feature extraction processing on the voice data to be recognized based on the preset recognition model to obtain a classification result.

The method comprises the following specific steps:

In this embodiment, a speaker voice separation technology is provided, which is based on accurately obtaining a preset recognition model, and in order to accurately obtain the preset recognition model, a large amount of already-labeled training data is required, in the prior art, the training data is required to be labeled manually, and the labeling of the training data is time-consuming and labor-consuming, and has poor expansibility.

It should be noted that the preset pre-training model is already trained, specifically, the preset pre-training model is obtained based on simulated tag data, and the simulated tag data is obtained based on conversion of preset non-tag original voice data. Because the data for training the preset training model is obtained based on the simulated tag data, the simulated tag data is obtained based on the conversion of the preset non-tag original voice data, when the preset training model is trained sufficiently, the hidden representation of the original voice data is learned, and the hidden representation internally contains information such as a speaker, namely, the preset training model is obtained after being sufficiently trained, so that the labeling can be accurately performed, and in addition, the preset training model is obtained based on the non-supervision data (without labeling) training, so that the generalization capability is high. In order to learn the hidden representation of the original voice data, the original voice data is encoded into (1, 0,1, 0) or (1,0,1,0,1,0) by a plurality of feature codes. Although the feature of the encoding of each piece of original voice data is plural, in order to obtain analog tag data, each piece of original voice data may include at least a first-bit encoding feature 1 indicating that the original voice data is authentic rather than synthesized, and in addition, the original voice data is encoded with plural features such as (1,0,1,0,1,0,1,0) or (1,0,1,0,1,0) or the like, although the feature of the encoding of each piece of original voice data is plural, in order to obtain analog tag data, each piece of original voice data may include at least a first-bit encoding feature 1 indicating that the number of frame data included in the original voice data is a preset number.

It should be noted that, the preset recognition model may be embedded with a labeling layer formed by the preset pre-training model, so as to obtain non-artificial labeled voice data in the preset recognition model for training and further training.

It should be noted that, the simulated tag data is obtained by partially replacing preset untagged original voice data with generated untagged voice data, and the simulated tag data at least includes data of a true and false simulated tag.

Or the simulated tag data is obtained by partially deleting the preset untagged original voice data and the frame data, and the simulated tag data at least comprises the data of a complete simulated tag and a non-complete simulated tag.

In this embodiment, the coding features of each piece of original speech data may further include at least the coding features of the speaker's voice. Thus, voice data can be obtained that annotate the voice characteristics of the speaker.

In this embodiment, a labeling layer formed by embedding a preset training model in a preset recognition model is specifically described as an example.

In this embodiment, after a preset recognition model is accurately obtained based on labeled voice data, since the preset recognition model is already trained sufficiently, feature extraction processing can be performed on the voice data to be recognized based on the preset recognition model, so as to obtain a classification result, where the classification result includes: the speech segments of different speakers, each segment of speech belonging to which speaker.

Before the step of extracting features of the voice data to be recognized based on the preset recognition model to obtain a classification result, the method further comprises the following steps:

step S11, obtaining preset original voice data without labels;

in this embodiment, first, the original voice data without a preset tag is obtained, and in order to ensure the training effect, the number of the original voice data is greater than the preset number value, and it should be noted that the original voice data is in a voice form. And the original voice data may be real non-synthesized or generated voice data so as to generate simulated tag data, where the non-synthesized or generated voice data refers to collected voice data sent by a person, and not voice data fitted by a machine, where it is to be noted that each voice data includes a plurality of voice files such as z1, z2, z3, z4 in fig. 4, and each voice file includes a plurality of frames of voice data such as X1, X2, X3, X4 in fig. 4, and so on.

Alternatively, each piece of voice data includes a plurality of voice files such as z1, z2, z3, z4, etc. in fig. 4, the number of which is determined, and each voice file includes a plurality of frames of voice data such as X1, X2, X3, X4, etc. in fig. 4. The number of frames of the voice file is determined to generate simulated tag data.

In this embodiment, the specific data content of the original voice data is not limited, and the original data is unlabeled data. The method is to train in an unsupervised mode to obtain a preset training model.

Step S12, generating unlabeled voice data, and partially replacing the preset unlabeled original voice data with the unlabeled voice data to obtain simulated label data;

generating unlabeled voice data after obtaining original voice data, specifically, generating unlabeled voice data through a preset generator (generator in fig. 4), namely, obtaining unlabeled voice data through machine fitting, wherein the unlabeled voice data is generated or synthesized simulated false label data, the unlabeled voice data can be data length of each frame or data length of each segment, the limitation is not specifically made, in order to ensure that replacement can be performed at any time, the generator generates the unlabeled voice data with various data lengths, and the preset unlabeled original voice data is partially replaced by the unlabeled voice data to obtain simulated label data.

In this embodiment, another mode of obtaining the simulated tag data is further provided, where in the other mode of obtaining the simulated tag data, frame data in the preset untagged original voice data is deleted randomly, or frame data in the preset untagged original voice data is added randomly, so as to obtain the simulated tag data.

Step S13, training a preset training model based on the simulated tag data to obtain a target model meeting preset conditions, and setting the target model as the preset pre-training model.

Training a preset training model based on the simulated tag data to obtain a target model meeting preset conditions, and setting the target model as the preset training model, namely training the preset training model based on replaced preset untagged original voice data and un-replaced preset untagged original voice data; or the preset training model can be trained based on the preset untagged original voice data after deleting the frame data or after adding the frame data and the preset untagged original voice data without deleting the frame data.

Training a preset training model based on the simulated tag data to obtain a target model meeting preset conditions, and setting the target model as the preset pre-training model, wherein the step of training the preset training model comprises the following steps:

Step A1, determining simulated false tag data and simulated true tag data in the simulated tag data;

in this embodiment, the replaced original voice data is known simulated false tag data, and the other non-replaced original voice data is known simulated true tag data, that is, the known simulated false tag data and the known simulated true tag data constitute simulated tag data.

Or in this embodiment, the original voice data after the frame data is deleted or added is known simulated false tag data, and the other unprocessed original voice data is known simulated true tag data, that is, the known simulated false tag data and the known simulated true tag data form simulated tag data.

Step A2, inputting the simulated false label data and the simulated true label data into a preset training model to obtain a recognition result;

and A3, adjusting model parameters of the preset training model based on the identification result and the true and false simulation labels in the simulation label data until a target model meeting preset conditions is obtained, and setting the target model as the preset pre-training model.

After obtaining the known simulated false tag data and the known simulated true tag data, inputting the known simulated false tag data and the known simulated true tag data into a preset training model to train the preset training model, specifically, obtaining the recognition result of the preset training model after predicting the simulated false tag data and the simulated true tag data, as shown in fig. 4, in the recognition result, which are original (original, non-replaced or non-processed) and which are replaced (non-original, replaced or deleted or the like) in the original voice data, and setting the preset training model as a preset target model based on the error until the preset training parameter meets the preset training condition after determining the error, because the simulated false tag data and the simulated true tag data are known, that is, which are original (original or non-replaced) and which are replaced (non-original or non-replaced) are known in the original voice data.

It should be noted that, because the preset non-tag original voice data is encoded by a plurality of features, other implicit representations of the preset non-tag original voice data can be learned in the process of performing true and false recognition on the preset non-tag original voice data.

After a preset pre-training model is obtained, non-artificial annotation voice data are obtained based on the preset pre-training model; and obtaining a preset recognition model based on the non-artificial labeling voice data, specifically, training a basic model based on the non-artificial labeling voice data to obtain a model meeting a certain preset condition, and setting the model meeting the certain preset condition as the preset recognition model. It should be noted that, the certain preset condition may be: the preset loss function of the basic model converges or the training of the basic model reaches the preset set times at the moment.

Further, according to the first embodiment of the present application, in another embodiment of the present application, the step of adjusting the model parameters of the preset training model based on the recognition result until a target model satisfying a preset condition is obtained, and setting the target model as the preset pre-training model includes:

adjusting model parameters of the preset training model based on the identification result until a target model meeting preset conditions is obtained, and setting the target model as the preset pre-training model;

the preset conditions comprise that the training times reach preset times or the preset loss function converges.

In this embodiment, specific content of a preset condition is determined, where the specific content of the preset condition is: the training times reach the preset times or the preset loss function converges, specifically, the preset times can be 1 ten thousand times, and the target model meeting the preset conditions is accurately determined, so that a foundation is laid for accurately obtaining the preset identification model.

Further, in another embodiment of the present application, the step of generating unlabeled voice data, replacing the preset unlabeled original voice data with part of the unlabeled voice data, and obtaining simulated label data includes:

Step B1, generating non-tag voice data, wherein the non-tag voice data comprises non-tag random voice frame data or non-tag random voice fragment data;

and B2, partially replacing the preset untagged original voice data with the untagged voice data to obtain simulated tag data.

In this embodiment, the non-tag voice data includes non-tag random voice frame data or non-tag random voice fragment data, and after generating the non-tag voice data, replacing the preset non-tag original voice data with the non-tag voice data partially, and obtaining the simulated tag data includes: after generating the unlabeled random voice frame data, selecting a plurality of pieces of preset unlabeled original voice data, replacing at least one frame data of each piece of data in the plurality of pieces of preset unlabeled original voice data with the unlabeled random voice frame data to obtain simulated label data, or after generating the unlabeled random voice fragment data, selecting a plurality of pieces of preset unlabeled original voice data, replacing at least one fragment data of each piece of data in the plurality of pieces of preset unlabeled original voice data with the unlabeled random voice fragment data to obtain simulated label data, wherein the explanation is that each piece of preset unlabeled original voice data can comprise a plurality of voice fragments, and each voice fragment can comprise multi-frame data.

In addition, in this embodiment, after generating the unlabeled random speech frame data, a plurality of pieces of preset unlabeled original speech data are selected, each piece of data in the plurality of pieces of preset unlabeled original speech data is added with the unlabeled random speech frame data to obtain simulated label data, or after generating the unlabeled random speech fragment data, a plurality of pieces of preset unlabeled original speech data are selected, each piece of data in the plurality of pieces of preset unlabeled original speech data is added with at least two frames of unlabeled random speech frame data to obtain the simulated label data.

Or in this embodiment, selecting a plurality of pieces of preset non-tag original voice data, reducing each piece of data in the plurality of pieces of preset non-tag original voice data by non-tag random voice frame data to obtain simulated tag data, or selecting a plurality of pieces of preset non-tag original voice data, reducing each piece of data in the plurality of pieces of preset non-tag original voice data by at least two frames of non-tag random voice frame data to obtain simulated tag data.

In this embodiment, the unlabeled voice data is generated, where the unlabeled voice data includes unlabeled random voice frame data or unlabeled random voice segment data; and partially replacing the preset untagged original voice data with the untagged voice data to obtain simulated tag data. Because the simulated tag data are accurately obtained, a foundation is laid for training a preset recognition model.

In another embodiment of the voice data recognition method, the step of obtaining the original voice data without the preset tag includes:

and step C1, acquiring an original voice file without a tag, and performing vector coding processing on the original voice file based on a preset attention mechanism to obtain original voice data without the tag.

In this embodiment, it is determined how to obtain the unlabeled original voice data, specifically, firstly, the unlabeled original voice file is obtained, and after the unlabeled original voice file is obtained, vector encoding processing (encoding is avoided for all features to improve efficiency) based on a preset attention mechanism is performed on the original voice file, where the attention mechanism (attention mechanism) is a part of manually selectively focusing on all information, and other visible information is ignored at the same time. That is, in order to reasonably utilize limited information processing resources, it is necessary to select a specific portion in an area and then concentrate on, for example, people often only a small number of words to be read are focused and processed when reading. Namely, the mechanism of attention has two main aspects: it is decided which part of the input needs to be focused on, and limited information processing resources are allocated to important parts. In this embodiment, by introducing the attention mechanism, the original voice file is processed with attention directions to focus on the information features of different speakers to obtain unlabeled original voice data, specifically, for example, in the existing acoustic feature extraction process of the original voice data, a matrix (voice feature data) of M rows such as 12 rows (assuming that the acoustic feature is 12 dimensions) and N columns is obtained, where N is the total frame number and each dimension vector is different, by introducing the attention mechanism, only a matrix (voice feature data) of M rows such as 10 rows (assuming that the acoustic feature is 12 dimensions) and N columns may be obtained, that is, the unlabeled original voice data includes at least a vector or feature code of the information features of different speakers.

In this embodiment, an unlabeled original voice file is obtained, vector encoding processing based on a preset attention mechanism is performed on the original voice file, so as to obtain unlabeled original voice data, and a foundation is laid for accurate prediction.

In another embodiment of the voice data recognition method, the method includes the steps of:

and D1, updating the preset recognition model by the voice data to be recognized and the classification result to obtain an updated preset recognition model.

In this embodiment, after obtaining the voice data to be recognized and the classification result, retraining the voice data to be recognized and the corresponding classification result as first training data already marked to update the preset recognition model, in order to avoid that the preset recognition model is updated too frequently, in this embodiment, the preset recognition model is updated at intervals of preset time periods such as intervals of one month or one week, specifically, the voice data to be recognized of the preset recognition model in intervals of one month or one week and the corresponding classification result are obtained, the voice data to be recognized of the intervals of one month or one week and the corresponding classification result are input into the preset recognition model as labeling data, in addition, when the number of the voice data to be recognized and the corresponding classification result reaches the preset number, the voice data to be recognized reaching the preset number and the corresponding classification result are input into the preset recognition model as labeling data to update the preset recognition model.

In this embodiment, after the voice data to be recognized and the classification result are obtained, the voice data to be recognized and the corresponding classification result are used as the first training data already marked to retrain the preset recognition model so as to update the preset recognition model.

Referring to fig. 3, fig. 3 is a schematic device structure diagram of a hardware running environment according to an embodiment of the present application.

As shown in fig. 3, the voice data recognition apparatus may include: a processor 1001, such as a CPU, memory 1005, and a communication bus 1002. Wherein a communication bus 1002 is used to enable connected communication between the processor 1001 and a memory 1005. The memory 1005 may be a high-speed RAM memory or a stable memory (non-volatile memory), such as a disk memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.

Optionally, the voice data recognition device may further include a rectangular user interface, a network interface, a camera, an RF (radio frequency) circuit, a sensor, an audio circuit, a WiFi module, and the like. The rectangular user interface may include a Display screen (Display), an input sub-module such as a Keyboard (Keyboard), and the optional rectangular user interface may also include a standard wired interface, a wireless interface. The network interface may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface).

It will be appreciated by those skilled in the art that the speech data recognition device structure shown in fig. 3 is not limiting of the speech data recognition device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

As shown in fig. 3, an operating system, a network communication module, and a voice data recognition program may be included in the memory 1005, which is one type of computer medium. The operating system is a program that manages and controls the hardware and software resources of the speech data recognition device, supporting the operation of the speech data recognition program and other software and/or programs. The network communication module is used to enable communication between components within the memory 1005 and other hardware and software in the voice data recognition system.

In the voice data recognition apparatus shown in fig. 3, a processor 1001 is configured to execute a voice data recognition program stored in a memory 1005, and to implement the steps of the voice data recognition method described in any one of the above.

The specific implementation manner of the voice data recognition device of the present application is basically the same as that of each embodiment of the voice data recognition method, and will not be repeated here.

The application also provides a voice data recognition device, which comprises:

Optionally, the voice data recognition device further includes:

Optionally, the third obtaining module includes:

Optionally, the second obtaining module includes:

Optionally, the generating module includes:

Optionally, the voice data recognition device further includes:

The specific implementation manner of the voice data recognition device of the present application is basically the same as the above embodiments of the voice data recognition method, and will not be described herein again.

Embodiments of the present application provide a medium, and the medium stores one or more programs, which may be further executed by one or more processors to implement the steps of the method for recognizing speech data described in any of the above.

The specific implementation manner of the medium of the present application is basically the same as that of each embodiment of the voice data recognition method, and will not be repeated here.

The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the application, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein, or any application, directly or indirectly, within the scope of the application.

Claims

1. A voice data recognition method, characterized in that the voice data recognition method comprises:

the non-manual labeling voice data are obtained based on a preset pre-training model optimized by simulated label data, and are not manually labeled, and the simulated label data are data with labels obtained by converting the preset non-label original voice data; the simulated tag data is obtained by partially replacing preset untagged original voice data with generated untagged voice data, and at least comprises true and false simulated tag data; or the simulated tag data is obtained by partially deleting frame data from preset non-tag original voice data, and the simulated tag data at least comprises data of a complete simulated tag and non-complete simulated tag;

the step of obtaining the original voice data without the preset label comprises the following steps:

acquiring an original voice file without a tag, and performing vector coding processing on the original voice file based on a preset attention mechanism to obtain original voice data without the tag;

the method for obtaining the simulated tag data comprises the following steps: acquiring original voice data without a preset tag; generating unlabeled voice data, and partially replacing the preset unlabeled original voice data with the unlabeled voice data to obtain simulated label data; or randomly deleting frame data in the preset untagged original voice data, or randomly adding frame data in the preset untagged original voice data to obtain simulated tag data;

2. The method for recognizing voice data according to claim 1, wherein before the step of extracting features from the voice data to be recognized based on the preset recognition model to obtain a classification result, the method further comprises:

3. The method for recognizing voice data according to claim 2, wherein the step of training a preset training model based on the simulated tag data to obtain a target model satisfying a preset condition, and setting the target model as the preset pre-training model comprises:

4. The method for recognizing voice data according to claim 1, wherein the step of generating unlabeled voice data, partially replacing the preset unlabeled original voice data with the unlabeled voice data to obtain analog label data, comprises:

5. The method for recognizing voice data according to any one of claims 1 to 4, wherein after the step of performing feature extraction processing on the voice data to be recognized based on the preset recognition model to obtain a classification result, the method comprises:

6. A voice data recognition apparatus, characterized in that the voice data recognition apparatus comprises:

the voice data recognition device is used for realizing:

7. A voice data recognition apparatus, characterized in that the voice data recognition apparatus comprises: a memory, a processor and a program stored on the memory for implementing the voice data recognition method,

the memory is used for storing a program for realizing the voice data recognition method;

the processor is configured to execute a program for implementing the voice data recognition method to implement the steps of the voice data recognition method according to any one of claims 1 to 5.

8. A medium, characterized in that a program for realizing a voice data recognition method is stored on the medium, the program for realizing the voice data recognition method being executed by a processor to realize the steps of the voice data recognition method according to any one of claims 1 to 5.