CN111554277A

CN111554277A - Voice data recognition method, device, equipment and medium

Info

Publication number: CN111554277A
Application number: CN202010417958.6A
Authority: CN
Inventors: 宋元峰
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2020-05-15
Filing date: 2020-05-15
Publication date: 2020-08-18
Anticipated expiration: 2040-05-15
Also published as: CN111554277B

Abstract

The application discloses a voice data recognition method, a device, equipment and a medium, wherein the method comprises the following steps: acquiring voice data to be recognized, and inputting the voice data to be recognized into a preset recognition model based on non-artificial labeling voice data optimization; the non-artificial labeling voice data are obtained based on a preset training model optimized by simulation label data, and the simulation label data are obtained based on conversion of preset label-free original voice data; and performing feature extraction processing on the voice data to be recognized based on the preset recognition model to obtain a classification result. The method and the device solve the technical problem that in the prior art, due to the fact that a large amount of training data need to be marked artificially, separation efficiency of speaker voice separation is low.

Description

Voice data recognition method, device, equipment and medium

Technical Field

The present application relates to the field of artificial intelligence in financial technology (Fintech), and in particular, to a method, device, and medium for voice data recognition.

Background

With the continuous development of financial technologies, especially internet technology and finance, more and more technologies (such as distributed, Blockchain, artificial intelligence and the like) are applied to the financial field, but the financial industry also puts higher requirements on the technologies, for example, the financial industry also has higher requirements on voice data identification.

With the development of mobile devices, voices become a daily communication mode, wherein voice segments of different speakers are cut from a voice stream, and then a speaker voice separation technology for judging which speaker each voice belongs to is more and more important, however, at present, a large amount of labeled training data is needed for training for accurately separating the voices of the speakers, that is, in the prior art, for accurately separating the voices of the speakers, a large amount of training data needs to be labeled manually, which causes waste of resources and reduces separation efficiency of voice separation of the speakers.

Disclosure of Invention

The present application mainly aims to provide a method, an apparatus, a device and a medium for speech data recognition, which aim to solve the technical problem of low separation efficiency of speaker speech separation in the prior art due to the fact that a large amount of training data needs to be marked artificially.

In order to achieve the above object, the present application provides a voice data recognition method, including:

acquiring voice data to be recognized, and inputting the voice data to be recognized into a preset recognition model based on non-artificial labeling voice data optimization;

the non-artificial labeling voice data are obtained based on a preset training model optimized by simulation label data, and the simulation label data are obtained based on conversion of preset label-free original voice data;

and performing feature extraction processing on the voice data to be recognized based on the preset recognition model to obtain a classification result.

Optionally, the analog tag data is obtained by partially replacing preset non-tag original voice data with generated non-tag voice data, and the analog tag data at least includes data of a true analog tag and a false analog tag.

Optionally, before the step of performing feature extraction processing on the speech data to be recognized based on the preset recognition model to obtain a classification result, the method further includes:

acquiring preset original voice data without a label;

generating non-tag voice data, and partially replacing the preset non-tag original voice data with the non-tag voice data to obtain analog tag data;

and training a preset training model based on the simulation label data to obtain a target model meeting preset conditions, and setting the target model as the preset pre-training model.

Optionally, the training a preset training model based on the simulation tag data to obtain a target model meeting a preset condition, and setting the target model as the preset pre-training model includes:

determining simulated false tag data and simulated true tag data in the simulated tag data;

inputting the simulated false label data and the simulated true label data into a preset training model to obtain a recognition result;

and adjusting model parameters of the preset training model based on the recognition result and the true and false simulation labels in the simulation label data until a target model meeting preset conditions is obtained, and setting the target model as the preset pre-training model.

Optionally, the step of obtaining preset unlabeled original voice data includes:

the method comprises the steps of obtaining an original voice file without a label, and carrying out vector coding processing based on a preset attention mechanism on the original voice file to obtain preset original voice data without the label.

Optionally, the generating of the unlabelled voice data, and partially replacing the preset unlabelled original voice data with the unlabelled voice data to obtain the simulated label data includes:

generating non-tag voice data, wherein the non-tag voice data comprises non-tag random voice frame data or non-tag random voice fragment data;

and replacing part of the preset non-tag original voice data with the non-tag voice data to obtain simulated tag data.

Optionally, after the step of performing feature extraction processing on the speech data to be recognized based on the preset recognition model to obtain a classification result, the method includes:

and updating the preset recognition model by the voice data to be recognized and the classification result to obtain an updated preset recognition model.

The present application also provides a voice data recognition apparatus, the voice data recognition apparatus includes:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring voice data to be recognized and inputting the voice data to be recognized into a preset recognition model based on non-artificial labeling voice data optimization;

and the classification module is used for carrying out feature extraction processing on the voice data to be recognized based on the preset recognition model to obtain a classification result.

Optionally, the voice data recognition apparatus further includes:

the second acquisition module is used for acquiring preset original voice data without labels;

the generating module is used for generating the non-tag voice data, and replacing part of the preset non-tag original voice data with the non-tag voice data to obtain simulated tag data;

and the third acquisition module is used for training a preset training model based on the simulation label data to obtain a target model meeting preset conditions, and setting the target model as the preset pre-training model.

Optionally, the third obtaining module includes:

the determining unit is used for determining simulated false tag data and simulated true tag data in the simulated tag data;

the first acquisition unit is used for inputting the simulated false label data and the simulated true label data into a preset training model to obtain a recognition result;

and the adjusting unit is used for adjusting the model parameters of the preset training model based on the recognition result and the true and false simulation labels in the simulation label data until a target model meeting preset conditions is obtained, and setting the target model as the preset pre-training model.

Optionally, the second obtaining module includes:

and the second acquisition unit is used for acquiring the original voice file without the label, and carrying out vector coding processing based on a preset attention mechanism on the original voice file to obtain preset original voice data without the label.

Optionally, the generating module includes:

the generating unit is used for generating non-tag voice data, wherein the non-tag voice data comprises non-tag random voice frame data or non-tag random voice fragment data;

and the replacing unit is used for partially replacing the preset non-tag original voice data with the non-tag voice data to obtain the analog tag data.

Optionally, the voice data recognition apparatus further includes:

and the updating module is used for updating the preset recognition model by the voice data to be recognized and the classification result to obtain an updated preset recognition model.

The present application further provides a voice data recognition apparatus, the voice data recognition apparatus being an entity apparatus, the voice data recognition apparatus including: a memory, a processor and a program of the speech data recognition method stored on the memory and executable on the processor, which program, when executed by the processor, may implement the steps of the speech data recognition method as described above.

The present application also provides a medium having stored thereon a program for implementing the above-described voice data recognition method, the program for the voice data recognition method implementing the steps of the above-described voice data recognition method when executed by a processor.

The method comprises the steps that voice data to be recognized are obtained and input into a preset recognition model based on non-manual labeling voice data optimization; the non-artificial labeling voice data are obtained based on a preset training model optimized by simulation label data, and the simulation label data are obtained based on conversion of preset label-free original voice data; and performing feature extraction processing on the voice data to be recognized based on the preset recognition model to obtain a classification result. In the application, the non-artificial labeling voice data is obtained based on a preset training model optimized by simulation label data instead of artificial labeling, and the simulation label data is data with labels obtained based on conversion of preset label-free original voice data, so that the preset training model can be accurately optimized to perform data labeling, the training efficiency of the preset recognition model is improved integrally, and then the voice data to be recognized is subjected to feature extraction processing based on the preset recognition model to obtain a classification result so as to improve the separation efficiency of voice separation of speakers.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a flowchart illustrating a first embodiment of a speech data recognition method according to the present application;

fig. 2 is a schematic detailed flow chart of steps before a step of performing feature extraction processing on the speech data to be recognized based on the preset recognition model to obtain a classification result in the first embodiment of the speech data recognition method of the present application;

FIG. 3 is a schematic diagram of an apparatus configuration of a hardware operating environment according to an embodiment of the present application;

fig. 4 is a schematic diagram of a first scenario in the speech data recognition method of the present application.

The objectives, features, and advantages of the present application will be further described with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In a first embodiment of the speech data recognition method of the present application, referring to fig. 1, the speech data recognition method includes:

step S10, acquiring voice data to be recognized, and inputting the voice data to be recognized into a preset recognition model based on non-manual labeling voice data optimization;

and step S20, based on the preset recognition model, performing feature extraction processing on the voice data to be recognized to obtain a classification result.

The method comprises the following specific steps:

in this embodiment, a speaker voice separation technique is provided, which is based on accurately obtaining a preset recognition model, and needs a large amount of labeled training data in order to accurately obtain the preset recognition model, in the prior art, the training data needs to be labeled manually, and the training data is time-consuming and labor-consuming and has poor expansibility, in this embodiment, in order to avoid the time-consuming and labor-consuming process caused by the manual labeling of the training data and the poor expansibility, which causes the technical problem that the preset recognition model is difficult to obtain quickly and accurately, or in order to achieve the technical effect of improving the training efficiency of the preset recognition model, and further, improving the separation efficiency of speaker voice separation, the voice data is labeled by using the preset training model, because the voice data is labeled by using the preset training model and not labeled manually, therefore, the expansibility of the labeling process is improved, and time and labor consumption caused by labeling is avoided.

It should be noted that the preset pre-training model is trained, specifically, the preset pre-training model is obtained based on analog tag data, and the analog tag data is obtained based on conversion of preset label-free original voice data. Because the data for training the preset training model is obtained based on the analog tag data, the analog tag data is obtained based on the conversion of the preset non-tag original voice data, when the preset training model is trained sufficiently, the hidden representation of the original voice data is learned, and the hidden representation internally contains information such as a speaker, namely, the preset training model is obtained after the preset training model is trained sufficiently, so that the marking can be accurately performed. Note that, in order to learn the hidden representation of the original speech data, the original speech data needs to be subjected to multiple feature codes, such as that each piece of original speech data is coded as (1, 0, 1, 0) or (1, 0, 1, 0, 1, 0). Although the feature of the encoding of each piece of original speech data is plural, in order to obtain the analog tag data, each piece of original speech data may include at least a first coding feature 1 indicating that the original speech data is real and not synthesized, and in addition, the original speech data is encoded by plural features such as (1, 0, 1, 0, 1, 0, 1, 0) or (1, 0, 1, 0, 1, 0), etc. of the encoding of each piece of original speech data, although the feature of the encoding of each piece of original speech data is plural, in order to obtain the analog tag data, each piece of original speech data may include at least a first coding feature 1 indicating that the number of frame data included in the original speech data is a preset number.

It should be noted that the preset recognition model may be embedded with a labeling layer formed by the preset training model, so as to obtain the non-artificially labeled speech data in the preset recognition model for training, and further perform training.

It should be noted that the analog tag data is obtained by partially replacing preset non-tag original voice data with generated non-tag voice data, and the analog tag data at least includes data of a true analog tag and a false analog tag.

Or the analog tag data is obtained by deleting frame data from preset non-tag original voice data, and the analog tag data at least comprises data of a complete analog tag and a non-complete analog tag.

In this embodiment, the encoding characteristics of each piece of original speech data may further include at least encoding characteristics of the speaker's voice. Thus, speech data labeling the speech characteristics of the speaker can be obtained.

In this embodiment, a specific description will be given by taking an example of a labeling layer formed by embedding a preset training model in a preset recognition model.

In this embodiment, after the preset recognition model is accurately obtained based on the labeled voice data, since the preset recognition model is trained sufficiently, the voice data to be recognized can be subjected to feature extraction processing based on the preset recognition model to obtain a classification result, where the classification result includes: the speech segments of different speakers and to which speaker each segment of speech belongs.

Before the step of performing feature extraction processing on the speech data to be recognized based on the preset recognition model to obtain a classification result, the method further includes:

step S11, acquiring preset original voice data without labels;

in this embodiment, first, preset unlabeled original voice data is obtained, and in order to ensure the training effect, the number of the original voice data is greater than a preset number value, it should be noted that the original voice data is in a voice form. And the original voice data may be real non-synthesized or generated voice data so as to generate analog tag data, the non-synthesized or generated voice data referring to collected human-uttered voice data, not voice data fitted by a machine, wherein it is noted that each voice data includes a plurality of voice files such as z1, z2, z3, z4, etc. in fig. 4, and each voice file includes a plurality of frames of voice data such as X1, X2, X3, X4, etc. in fig. 4.

Alternatively, each piece of voice data includes a plurality of voice files such as z1, z2, z3, z4, and the like in fig. 4, the number of the plurality of voice files is determined, and each voice file includes a plurality of frames of voice data such as X1, X2, X3, X4, and the like in fig. 4. The number of frames of the voice file is determined so as to generate analog tag data.

In this embodiment, the specific data content of the original voice data is not limited, and the original data is data without a tag. Namely, the preset training model is obtained through the unsupervised training.

Step S12, generating unlabelled voice data, and replacing part of the preset unlabelled original voice data with the unlabelled voice data to obtain simulated label data;

after the original voice data is obtained, generating non-tag voice data, specifically, generating non-tag voice data through a preset generator (generator in fig. 4), that is, obtaining non-tag voice data through machine fitting, where the non-tag voice data is generated or synthesized simulated fake tag data, it should be noted that the non-tag voice data may be a data length of each frame, or a data length of each segment, and is not specifically limited.

In this embodiment, another way of obtaining the analog tag data is further provided, and in the another way of obtaining the analog tag data, frame data in the preset unlabeled original voice data is randomly deleted, or frame data in the preset unlabeled original voice data is randomly added, so as to obtain the analog tag data.

Step S13, training a preset training model based on the simulation label data to obtain a target model meeting preset conditions, and setting the target model as the preset pre-training model.

Training a preset training model based on the simulated label data to obtain a target model meeting preset conditions, and setting the target model as the preset pre-training model, namely training the preset training model based on the replaced preset non-label original voice data and the non-replaced preset non-label original voice data; or the preset training model may be trained based on the preset unlabeled original speech data after deleting frame data or after adding frame data, and the preset unlabeled original speech data without deleting frame data.

The step of training a preset training model based on the simulation tag data to obtain a target model meeting preset conditions, and setting the target model as the preset pre-training model includes:

step A1, determining simulated false tag data and simulated true tag data in the simulated tag data;

in this embodiment, the replaced original voice data is known simulated false tag data, and the other original voice data that is not replaced is known simulated true tag data, that is, the known simulated false tag data and the known simulated true tag data constitute simulated tag data.

Or in this embodiment, the original voice data after the frame data is deleted or the frame data is added is known simulated false tag data, and the other unprocessed original voice data is known simulated true tag data, that is, the known simulated false tag data and the known simulated true tag data constitute simulated tag data.

Step A2, inputting the simulated false label data and the simulated true label data into a preset training model to obtain a recognition result;

step A3, adjusting model parameters of the preset training model based on the recognition result and the true and false simulation labels in the simulation label data until a target model meeting preset conditions is obtained, and setting the target model as the preset pre-training model.

After obtaining the known simulated false tag data and the known simulated true tag data, inputting the known simulated false tag data and the known simulated true tag data into a preset training model to train the preset training model, specifically, obtaining a recognition result of the preset training model after predicting the simulated false tag data and the simulated true tag data, as shown in fig. 4, in which it is predicted which are original (original, non-substituted or non-processed) and which are replaced (non-original, substituted or deleted and the like) in the original speech data, and since the simulated false tag data and the simulated true tag data are known, that is, which are original (original or non-substituted) and which are replaced (non-original or non-substituted) in the original speech data are known, therefore, the recognition result is compared with the known result, the error between the recognition result and the known result is determined, after the error is determined, the model parameters of the preset training model are adjusted in a targeted mode based on the error until the target model meeting the preset condition is obtained, and the target model is set as the preset pre-training model.

It should be noted that, because the preset unlabeled original speech data is subjected to multiple feature codes, other implicit representations of the preset unlabeled original speech data can be learned in the process of performing true and false recognition on the preset unlabeled original speech data.

After a preset training model is obtained, obtaining non-artificial labeling voice data based on the preset training model; and obtaining a preset recognition model based on the non-artificial labeling voice data, specifically, training a basic model based on the non-artificial labeling voice data to obtain a model meeting certain preset conditions, and setting the model meeting certain preset conditions as the preset recognition model. It should be noted that the certain preset condition may be: the preset loss function of the basic model converges or the training of the basic model reaches the preset set times at the moment.

Further, based on the first embodiment of the present application, in another embodiment of the present application, the step of adjusting the model parameters of the preset training model based on the recognition result until a target model meeting a preset condition is obtained, and setting the target model as the preset training model includes:

adjusting model parameters of the preset training model based on the recognition result until a target model meeting preset conditions is obtained, and setting the target model as the preset pre-training model;

the preset condition comprises that the training times reach preset times or preset loss function convergence.

In this embodiment, specific contents of the preset condition are determined, and the specific contents of the preset condition are: the training frequency reaches the preset frequency or the preset loss function is converged, specifically, the preset frequency can be 1 ten thousand, and a target model meeting the preset condition is accurately determined, so that a foundation is laid for accurately obtaining the preset recognition model.

Further, in another embodiment of the present application, the generating of the unlabeled voice data, and partially replacing the preset unlabeled original voice data with the unlabeled voice data to obtain the simulated label data includes:

step B1, generating non-tag voice data, wherein the non-tag voice data comprises non-tag random voice frame data or non-tag random voice fragment data;

and step B2, replacing part of the preset unlabelled original voice data with the unlabelled voice data to obtain simulated label data.

In this embodiment, the non-tag voice data includes non-tag random voice frame data or non-tag random voice segment data, and after generating the non-tag voice data, partially replacing the preset non-tag original voice data with the non-tag voice data to obtain the analog tag data includes: after generating the non-tag random voice frame data, selecting a plurality of pieces of preset non-tag original voice data, replacing at least one frame data of each piece of data in the plurality of pieces of preset non-tag original voice data with the non-tag random voice frame data to obtain analog tag data, or after generating the non-tag random voice fragment data, selecting a plurality of pieces of preset non-tag original voice data, replacing at least one fragment data of each piece of data in the plurality of pieces of preset non-tag original voice data with the non-tag random voice fragment data to obtain the analog tag data, wherein each piece of preset non-tag original voice data can comprise a plurality of voice fragments, and each voice fragment can comprise multi-frame data.

In addition, in this embodiment, after the non-tag random voice frame data is generated, a plurality of pieces of preset non-tag original voice data are selected, and the non-tag random voice frame data is added to each piece of the plurality of pieces of preset non-tag original voice data to obtain the analog tag data, or after the non-tag random voice fragment data is generated, a plurality of pieces of preset non-tag original voice data are selected, and at least two frames of the non-tag random voice frame data are added to each piece of the plurality of pieces of preset non-tag original voice data to obtain the analog tag data.

Or, in this embodiment, a plurality of pieces of preset unlabeled original voice data are selected, and the unlabeled random voice frame data is reduced for each piece of the plurality of pieces of preset unlabeled original voice data to obtain the analog label data, or a plurality of pieces of preset unlabeled original voice data are selected, and at least two frames of the unlabeled random voice frame data are reduced for each piece of the plurality of pieces of preset unlabeled original voice data to obtain the analog label data.

In this embodiment, by generating non-tag voice data, where the non-tag voice data includes non-tag random voice frame data or non-tag random voice segment data; and replacing part of the preset non-tag original voice data with the non-tag voice data to obtain simulated tag data. Because the analog label data are accurately obtained, a foundation is laid for training the preset identification model.

In another embodiment of the present invention, the step of obtaining preset original voice data without a tag includes:

and step C1, acquiring an original voice file without a label, and carrying out vector coding processing based on a preset attention mechanism on the original voice file to obtain original voice data without the label.

In this embodiment, how to obtain the unlabeled original speech data is determined, specifically, a unlabeled original speech file is obtained first, and after the unlabeled original speech file is obtained, a vector encoding process based on a preset attention mechanism is performed on the original speech file (all features are prevented from being encoded to improve efficiency), where the attention mechanism (attentionprocess) is to manually and selectively pay attention to a part of all information while ignoring other visible information. That is, in order to make reasonable use of limited information processing resources, it is necessary to select a specific portion of the area and then focus on, for example, a person reading, and often only a small number of words to be read will be focused on and processed. Namely, the attention mechanism has two main aspects: it is decided which part of the input needs to be taken care of, and limited information processing resources are allocated to the important part. In this embodiment, an attention mechanism is introduced to process an original speech file with attention direction to focus on information features of different speakers to obtain original speech data without tags, specifically, for example, in the existing process of extracting acoustic features from original speech data, a matrix (speech feature data) with M rows, such as 12 rows (assuming that acoustic features are 12 dimensions) and N columns is obtained, where N is a total number of frames, and the size of vectors in each dimension is different, only a matrix (speech feature data) with M rows, such as 10 rows (assuming that acoustic features are 12 dimensions) and N columns can be obtained by introducing the attention mechanism, that is, the original speech data without tags at least include vectors or feature codes of information features of different speakers.

In this embodiment, an original voice file without a tag is obtained, and vector coding processing based on a preset attention mechanism is performed on the original voice file to obtain original voice data without a tag, so that a foundation is laid for accurate prediction.

In another embodiment of the speech data recognition method, after the step of performing feature extraction processing on the speech data to be recognized based on the preset recognition model to obtain a classification result, the method includes:

and D1, updating the preset recognition model by the voice data to be recognized and the classification result to obtain an updated preset recognition model.

In this embodiment, after obtaining the voice data to be recognized and the classification result, the voice data to be recognized and the corresponding classification result are used as the labeled first training data to retrain the preset recognition model again to update the preset recognition model, in order to avoid the situation that the preset recognition model is updated too frequently, in this embodiment, the preset recognition model is updated at a preset time interval, such as a month interval or a week interval, specifically, the voice data to be recognized of the preset recognition model in one month or one week interval and the corresponding classification result are obtained, the voice data to be recognized in one month or one week interval and the corresponding classification result are input into the preset recognition model as labeled data, and in addition, when the voice data to be recognized and the number of the corresponding classification results reach a preset number, and then, inputting the voice data to be recognized, which reach the preset number, and the corresponding classification results as labels into a preset recognition model as labeling data so as to update the preset recognition model.

In this embodiment, after the voice data to be recognized and the classification result are obtained, the voice data to be recognized and the corresponding classification result are used as the labeled first training data to train the preset recognition model again so as to update the preset recognition model.

Referring to fig. 3, fig. 3 is a schematic device structure diagram of a hardware operating environment according to an embodiment of the present application.

As shown in fig. 3, the voice data recognition apparatus may include: a processor 1001, such as a CPU, a memory 1005, and a communication bus 1002. The communication bus 1002 is used for realizing connection communication between the processor 1001 and the memory 1005. The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a memory device separate from the processor 1001 described above.

Optionally, the voice data recognition device may further include a rectangular user interface, a network interface, a camera, RF (radio frequency) circuits, a sensor, an audio circuit, a WiFi module, and the like. The rectangular user interface may comprise a Display screen (Display), an input sub-module such as a Keyboard (Keyboard), and the optional rectangular user interface may also comprise a standard wired interface, a wireless interface. The network interface may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface).

Those skilled in the art will appreciate that the voice data recognition device architecture shown in FIG. 3 is not intended to be limiting of the voice data recognition device and may include more or fewer components than shown, or some components may be combined, or a different arrangement of components.

As shown in fig. 3, a memory 1005, which is a kind of computer medium, may include therein an operating system, a network communication module, and a voice data recognition program. The operating system is a program that manages and controls the hardware and software resources of the voice data recognition device, supporting the operation of the voice data recognition program as well as other software and/or programs. The network communication module is used to enable communication between the various components within the memory 1005, as well as with other hardware and software in the voice data recognition system.

In the speech data recognition apparatus shown in fig. 3, the processor 1001 is configured to execute a speech data recognition program stored in the memory 1005 to implement the steps of the speech data recognition method according to any one of the above.

The specific implementation of the speech data recognition device of the present application is substantially the same as the embodiments of the speech data recognition method, and is not described herein again.

Optionally, the voice data recognition apparatus further includes:

Optionally, the third obtaining module includes:

Optionally, the second obtaining module includes:

Optionally, the generating module includes:

Optionally, the voice data recognition apparatus further includes:

The specific implementation of the speech data recognition apparatus of the present application is substantially the same as the embodiments of the speech data recognition method, and is not described herein again.

The present embodiment provides a medium, and the medium stores one or more programs, which may also be executed by one or more processors for implementing the steps of the voice data recognition method described in any one of the above.

The specific implementation of the medium of the present application is substantially the same as that of each embodiment of the voice data recognition method, and is not described herein again.

The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings, or which are directly or indirectly applied to other related technical fields, are included in the scope of the present application.

Claims

1. A voice data recognition method, characterized in that the voice data recognition method comprises:

2. The voice data recognition method according to claim 1, wherein the analog tag data is obtained by partially replacing a preset non-tag original voice data with the generated non-tag voice data, and the analog tag data at least includes data of a true analog tag and a false analog tag.

3. The speech data recognition method according to claim 2, wherein before the step of performing feature extraction processing on the speech data to be recognized based on the preset recognition model to obtain a classification result, the method further comprises:

acquiring preset original voice data without a label;

4. The voice data recognition method according to claim 2, wherein the step of training a preset training model based on the simulation label data to obtain a target model satisfying a preset condition, and setting the target model as the preset pre-training model comprises:

5. The voice data recognition method of claim 2, wherein the step of obtaining the preset unlabeled original voice data comprises:

6. The voice data recognition method according to claim 5, wherein the step of generating the unlabeled voice data and partially replacing the preset unlabeled original voice data with the unlabeled voice data to obtain the analog labeled data comprises:

7. The speech data recognition method according to any one of claims 1 to 6, wherein after the step of performing feature extraction processing on the speech data to be recognized based on the preset recognition model to obtain a classification result, the method comprises:

8. A speech data recognition apparatus, characterized in that the speech data recognition apparatus comprises:

9. A voice data recognition apparatus characterized by comprising: a memory, a processor, and a program stored on the memory for implementing the voice data recognition method,

the memory is used for storing a program for realizing the voice data recognition method;

the processor is configured to execute a program implementing the speech data recognition method to implement the steps of the speech data recognition method according to any one of claims 1 to 7.

10. A medium having stored thereon a program for implementing a voice data recognition method, the program being executed by a processor to implement the steps of the voice data recognition method according to any one of claims 1 to 7.