CN110992938A

CN110992938A - Voice data processing method and device, electronic equipment and computer readable medium

Info

Publication number: CN110992938A
Application number: CN201911261757.5A
Authority: CN
Inventors: 黄嘉晨
Original assignee: Tongdun Holdings Co Ltd
Current assignee: Tongdun Holdings Co Ltd
Priority date: 2019-12-10
Filing date: 2019-12-10
Publication date: 2020-04-10

Abstract

The embodiment of the disclosure provides a voice data processing method and device, electronic equipment and a computer readable medium. The method comprises the following steps: acquiring voice data; processing the voice data through a voice recognition model to obtain at least one decoding data of the voice data and a first score of the decoding data; determining target decoded data of the voice data according to the first score of the at least one decoded data; and generating a target training sample set according to the voice data and the target decoding data. According to the voice data processing method, the voice data processing device, the electronic equipment and the computer readable medium, the target decoding data of the voice data can be obtained based on at least one decoding data of the voice data and the first score of the decoding data, and then the extended target training sample set can be automatically obtained according to the voice data and the target decoding data without manual marking, so that a training task of a voice recognition model is supported.

Description

Voice data processing method and device, electronic equipment and computer readable medium

Technical Field

The present disclosure relates to the field of speech recognition technologies, and in particular, to a method and an apparatus for processing speech data, an electronic device, and a computer-readable medium.

Background

The goal of speech recognition is to convert the lexical content in human speech into computer readable input, such as keystrokes, binary codes, or character sequences. Currently, there are two general approaches to speech recognition systems: the first method converts voice data into acoustic features, then decodes the acoustic features through a language model such as Weighted Finite-state-transmitter (WFST), and outputs a text, wherein the currently commonly used acoustic model is a neural network model; the second method uses an end-to-end neural network model to directly process voice data and convert it into decoded text.

However, in both of the above methods, in order to improve the accuracy of the neural network model (the neural network model with the acoustic recognition function in the first method and the end-to-end neural network model in the second method), a large amount of labeled speech data needs to be trained. At present, the labeling of voice data needs to consume a lot of manpower and financial resources, and the labeling cost is very high, so that the realization of the voice recognition function is limited by the number of the labeled data.

Therefore, a new voice data processing method, apparatus, electronic device and computer readable medium are needed.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

The embodiment of the disclosure provides a voice data processing method, a voice data processing device, an electronic device and a computer readable medium, so that the defect that the acquisition cost of the labeled data of voice recognition in the prior art is high is overcome at least to a certain extent.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.

The embodiment of the present disclosure provides a method for processing voice data, including: acquiring voice data; processing the voice data through a voice recognition model to obtain at least one decoding data of the voice data and a first score of the decoding data; determining target decoded data of the voice data according to the first score of the at least one decoded data; and generating a target training sample set according to the voice data and the target decoding data.

In an exemplary embodiment of the present disclosure, the method further comprises: and training the voice recognition model through the target training sample set to obtain the trained voice recognition model.

In an exemplary embodiment of the present disclosure, processing the speech data through a speech recognition model to obtain at least one decoded data and a first score thereof includes: processing the voice data through the voice recognition model to obtain the at least one decoding data and the confidence thereof; determining the first score of the at least one decoded data according to a confidence of the at least one decoded data.

In an exemplary embodiment of the present disclosure, processing the speech data through a speech recognition model to obtain at least one decoded data and a first score thereof includes: processing the voice data through the voice recognition model to obtain the at least one decoding data; processing the at least one decoded data through a language model to obtain the first score of the at least one decoded data.

In an exemplary embodiment of the present disclosure, determining target decoded data of the voice data according to the first score of the at least one decoded data includes: and taking decoded data with the largest first score in the at least one decoded data as the target decoded data of the voice data.

In an exemplary embodiment of the present disclosure, the method further comprises: and if the first scores of the at least one decoding data corresponding to the voice data are all smaller than a score threshold, removing the voice data.

In an exemplary embodiment of the present disclosure, the speech recognition model is obtained after training through a first training sample set.

An embodiment of the present disclosure provides a data processing apparatus, including: a data acquisition module configured to acquire voice data; the data decoding module is configured to process the voice data through a voice recognition model to obtain at least one decoding data of the voice data and a first score of the decoding data; a score analysis module configured to determine target decoding data of the speech data according to a first score of the at least one decoding data; a sample generation module configured to generate a target training sample set from the speech data and the target decoding data.

An embodiment of the present disclosure provides an electronic device, including: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the voice data processing method as described in the above embodiments.

An embodiment of the present disclosure provides a computer-readable medium, on which a computer program is stored, where the computer program is executed by a processor to implement the voice data processing method described in the above embodiment.

In some embodiments of the present disclosure, the voice data is processed by the voice recognition model to obtain at least one decoded data and the first score thereof, and the target decoded data of the voice data is determined according to the first score of the at least one decoded data, so that the target decoded data of the voice data can be determined according to the recognition result of the voice recognition model itself. And a target training sample set can be generated according to the voice data and the target decoding data, and then the voice labeling data can be expanded to support the training of the voice recognition model according to a new training sample set automatically obtained by the voice recognition model without consuming manpower and material resources.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty. In the drawings:

FIG. 1 illustrates a schematic diagram of an exemplary system architecture 100 to which a voice data processing method or apparatus of an embodiment of the present disclosure may be applied;

FIG. 2 schematically shows a flow diagram of a method of speech data processing according to an embodiment of the present disclosure;

FIG. 3 schematically shows a flow chart of a method of speech data processing according to another embodiment of the present disclosure;

FIG. 4 is a flowchart in an exemplary embodiment based on step S220 of FIG. 2;

FIG. 5 is a flowchart in an exemplary embodiment based on step S220 of FIG. 2;

FIG. 6 is a flowchart in an exemplary embodiment based on step S230 of FIG. 2;

FIG. 7 schematically illustrates a flow chart of a method of processing voice data according to yet another embodiment of the present disclosure;

FIG. 8 schematically illustrates a flow chart of a method of processing voice data according to yet another embodiment of the present disclosure;

FIG. 9 schematically illustrates a flow chart of a method of processing voice data according to yet another embodiment of the present disclosure;

FIG. 10 schematically shows a block diagram of a data processing apparatus according to an embodiment of the present disclosure;

FIG. 11 illustrates a schematic structural diagram of a computer system suitable for use in implementing an electronic device of an embodiment of the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the disclosure.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

Fig. 1 shows a schematic diagram of an exemplary system architecture 100 to which a voice data processing method or apparatus of an embodiment of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include one or more of

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, server 105 may be a server cluster comprised of multiple servers, or the like.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, portable computers, desktop computers, wearable devices, virtual reality devices, smart homes, and so forth.

The server 105 may be a server that provides various services. For example, the terminal device 103 (which may also be the terminal device 101 or 102) uploads voice data to the server 105. The server 105 may obtain voice data; processing the voice data through a voice recognition model to obtain at least one decoding data of the voice data and a first score of the decoding data; determining target decoded data of the voice data according to the first score of the at least one decoded data; and generating a target training sample set according to the voice data and the target decoding data. And the target training sample set is fed back to the terminal equipment 103, so that the terminal equipment 103 can train the voice recognition model through the target training sample set to obtain the trained voice recognition model, and therefore, the voice labeling data can be expanded to support the training of the voice recognition model according to the new training sample set automatically obtained by the voice recognition model without consuming manpower and material resources.

Fig. 2 schematically shows a flow chart of a speech data processing method according to an embodiment of the present disclosure. The method provided by the embodiment of the present disclosure may be processed by any electronic device with computing processing capability, for example, the server 105 and/or the

terminal devices

102 and 103 in the embodiment of fig. 1 described above, and in the following embodiment, the server 105 is taken as an execution subject for example, but the present disclosure is not limited thereto.

As shown in fig. 2, a voice data processing method provided by an embodiment of the present disclosure may include the following steps.

In step S210, voice data is acquired.

In the embodiment of the present disclosure, the voice data may be voice data without labels. The voice data may be preprocessed, such as endpoint detection, weighting, framing, windowing, high-pass filtering, etc. of the high frequency portion of the voice data.

In the embodiment of the disclosure, the voice data can be further processed through the acoustic model to obtain the acoustic features.

In step S220, the voice data is processed through the voice recognition model, and at least one decoded data of the voice data and a first score thereof are obtained.

In an embodiment of the present disclosure, the speech recognition model may be a neural network model. For example, the speech recognition model may be a deep neural network model or the like, but the specific form of the speech recognition model is not particularly limited by the present disclosure.

If the voice data is processed by the acoustic model and the acoustic features are obtained in step S210, the acoustic features may be processed by the voice recognition model in this step.

In an exemplary embodiment, the speech recognition model is obtained after training by a first set of training samples. Wherein the first set of training samples includes a first amount of speech data with labels. The voice data in step S210 may be a second number. The second number is greater than the first number.

In the embodiment of the disclosure, the voice data is processed by the voice recognition model, and at least one decoded data output by the voice recognition model and the confidence of each decoded data can be obtained. The confidence of each decoded data may be used as the confidence of each decoded data.

In step S230, target decoded data of the voice data is determined according to the first score of the at least one decoded data.

In the embodiment of the present disclosure, the at least one decoded data may be sorted according to a first score of the decoded data, decoded data with a largest first score may be determined according to a sorting result, and the decoded data may be used as target decoded data of the voice data.

In step S240, a target training sample set is generated according to the speech data and the target decoding data.

In the embodiment of the present disclosure, the target decoding data may be used as a voice labeling result of the voice data, so as to generate a target training sample set according to the voice data and the target decoding data.

According to the voice data processing method provided by the embodiment of the disclosure, the voice data is processed through the voice recognition model, at least one piece of decoded data and the first score of the decoded data are obtained, the target decoded data of the voice data is determined according to the first score of the at least one piece of decoded data, and the target decoded data of the voice data can be determined according to the recognition result of the voice recognition model. And a target training sample set can be generated according to the voice data and the target decoding data, and then the voice labeling data can be expanded to support the training of the voice recognition model according to a new training sample set automatically obtained by the voice recognition model without consuming manpower and material resources.

Fig. 3 schematically shows a flow chart of a speech data processing method according to another embodiment of the present disclosure.

As shown in fig. 3, the voice data processing method provided by the present embodiment may include the following steps.

In step S310, voice data is acquired.

This step may adopt a step similar to step S210 in fig. 2, and is not described here again.

In step S320, the voice data is processed through the voice recognition model, and at least one decoded data of the voice data and a first score thereof are obtained.

The speech recognition model can be obtained after being trained through the first training sample set. This step may adopt a step similar to step S220 in fig. 2, and is not described herein again.

In step S330, target decoded data of the voice data is determined according to the first score of the at least one decoded data.

This step may adopt a step similar to step S230 in fig. 2, and is not described here again.

In step S340, a target training sample set is generated according to the speech data and the target decoding data.

This step may adopt a step similar to step S240 in fig. 2, and is not described here again.

In step S350, the speech recognition model is trained through the target training sample set, and a trained speech recognition model is obtained.

In the embodiment of the present disclosure, the speech recognition model is trained through a target training sample set. When the speech recognition model is obtained after being trained through the first training sample set, the possible accuracy of the speech recognition model obtained after being trained through the first training sample set does not reach the preset lower accuracy limit because the number of samples of the first training sample set is small. The speech recognition model can be continuously trained through the target training sample set, the training termination condition can be that preset precision is achieved or preset iteration times are achieved, and therefore when the number of samples in the first training sample set cannot meet the training task, the training task can be still achieved according to the self-expanded target training sample set, and the trained speech recognition model is obtained.

In an exemplary embodiment, steps S310 to S350 may be performed in a loop. The trained speech recognition model obtained in step S350 may be used as the speech recognition model in step S320 in the next iteration of the loop, and the iteration termination condition may be a training termination condition of the speech recognition model.

The voice data processing method of the embodiment of the disclosure can process the voice data without labels according to the voice recognition model to obtain the target decoding data of the voice data, and further can automatically label the voice data according to the voice recognition model to obtain the extended target training sample set. Further, the speech recognition model can be trained according to the target training sample set to obtain the speech recognition model reaching the preset precision

Fig. 4 is a flowchart in an exemplary embodiment based on step S220 of fig. 2.

As shown in fig. 4, step S220 in the above-mentioned embodiment of fig. 2 may further include the following steps.

In step S2211, the voice data is processed through the voice recognition model to obtain at least one decoded data and its confidence level.

In the embodiment of the present disclosure, when the speech recognition model processes the speech data, at least one decoding object of the speech data may be passed, and the confidence of each decoding object. The confidence is a technique for judging the reliability of the speech recognition result of the speech recognition system itself, and is generally given by the reliability of the recognition result or the probability value of the recognition result.

In step S2212, a first score of the at least one decoded data is determined according to the confidence of the at least one decoded data.

In the embodiment of the present disclosure, the confidence of each decoded data may be used as the first score of the decoded data, or the confidence of each decoded data may be subjected to quantization processing to obtain the first score of the decoded data.

Fig. 5 is a flowchart in an exemplary embodiment based on step S220 of fig. 2.

As shown in fig. 5, step S220 in the above-mentioned embodiment of fig. 2 may further include the following steps.

In step S2221, the speech data is processed by the speech recognition model to obtain at least one decoded data.

This step may adopt a step similar to step S2211 in fig. 4, and is not described here again.

In step S2222, the at least one decoded data is processed by the language model to obtain a first score of the at least one decoded data.

In embodiments of the present disclosure, the language model may include one or more of the following: chinese language model (N-gram), attention-based language model (Transformer, etc.), recurrent neural Network language model (RNN). The language model may re-score the decoded data. And processing at least one piece of decoding data through the voice model, obtaining a re-scoring result of each piece of decoding data, and taking the re-scoring result as a first score of the decoding data.

Fig. 6 is a flowchart in an exemplary embodiment based on step S230 of fig. 2.

As shown in fig. 6, step S230 in the above-mentioned embodiment of fig. 2 may further include the following steps.

In step S231, decoded data having the largest first score among the at least one decoded data is set as target decoded data of the voice data.

The voice data processing method disclosed by the embodiment of the disclosure can process the voice data by the voice recognition model, automatically obtain the target decoding data of the voice data, and further automatically label the voice data without depending on human and physical forces to obtain new sample data so as to support the training process of the voice recognition model.

Fig. 7 schematically shows a flow chart of a voice data processing method according to yet another embodiment of the present disclosure.

As shown in fig. 7, the voice data processing method according to the above embodiment further includes the following steps.

In step S710, if the first score of at least one decoded data corresponding to the speech data is smaller than the score threshold, the speech data is removed.

In the embodiment of the present disclosure, the score threshold may be obtained according to empirical data, and the present disclosure does not make any special limitation on the specific numerical value of the score threshold.

Fig. 8 schematically shows a flow chart of a voice data processing method according to still another embodiment of the present disclosure.

As shown in fig. 8, the voice data processing method according to the above embodiment further includes the following steps.

In step S810, voice data is acquired.

In step S820, the voice data is processed by the voice recognition model to obtain at least one decoded data and its confidence level.

In step S830, a first score of the at least one decoded data is determined according to the confidence of the at least one decoded data.

In the embodiment of the present disclosure, if the first score of the at least one decoding data corresponding to the voice data is smaller than the score threshold, the voice data is rejected.

The steps may adopt steps similar to step S2212 in fig. 4, and are not described here again.

In step S840, decoded data having the largest first score among the at least one decoded data is used as the target decoded data of the voice data.

This step may adopt a step similar to step S231 in fig. 6, and is not described herein again.

In step S850, a target training sample set is generated from the speech data and the target decoding data.

In step S860, the speech recognition model is trained through the target training sample set, and steps S810 to S860 are iteratively executed until a training termination condition of the speech recognition model is satisfied, and the training is terminated and the trained speech recognition model is obtained.

In the embodiment of the present disclosure, the training termination condition of the speech recognition model may be to reach a preset precision or to reach a preset iteration number.

According to the voice data processing method, the voice data are processed according to the voice recognition model, the voice data can be automatically marked based on the obtained target decoding data, and then a target training sample set is obtained, so that the training sample set is automatically expanded. Furthermore, the voice recognition model is trained through the target training sample set, so that the dependence of the training process of the voice recognition system on the labeled data can be eliminated, and the consumption of manpower and material resources can be saved.

Fig. 9 schematically shows a flow chart of a voice data processing method according to still another embodiment of the present disclosure.

As shown in fig. 9, the voice data processing method provided by the present embodiment includes the following steps.

In step S910, voice data is acquired.

In step S920, the voice data is processed by the voice recognition model to obtain at least one decoded data.

This step may adopt a step similar to step S2221 in fig. 5, and is not described here again.

In step S930, the at least one decoded data is processed by the language model to obtain a first score of the at least one decoded data.

This step may adopt a step similar to step S2222 in fig. 5, and is not described here again.

In step S940, decoded data having the largest first score among the at least one decoded data is used as the target decoded data of the voice data.

In step S950, a target training sample set is generated according to the speech data and the target decoding data.

In step S960, the speech recognition model is trained through the target training sample set, and steps S910 to S960 are iteratively performed until a training termination condition of the speech recognition model is satisfied, ending the training and obtaining a trained speech recognition model.

According to the voice data processing method, the voice data are processed according to the voice recognition model to obtain the decoded data, the decoded data are re-scored through the voice model, the target decoded data are determined according to the re-scoring result, the decoded data can be constrained through the language model to obtain high-quality target decoded data, the voice data are automatically marked based on the obtained target decoded data, and then the target training sample set is obtained, so that the automatic expansion of the training sample set is realized. Furthermore, the voice recognition model is trained through the target training sample set, so that the dependence of the training process of the voice recognition system on the labeled data can be eliminated, and the consumption of manpower and material resources can be saved.

The following describes embodiments of the apparatus of the present disclosure, which may be used to perform the above-mentioned voice data processing method of the present disclosure. For details that are not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the embodiments of the voice data processing method described above in the present disclosure.

Fig. 10 schematically shows a block diagram of a data processing apparatus according to an embodiment of the present disclosure.

Referring to fig. 10, a data processing apparatus 1000 according to an embodiment of the present disclosure may include: a data acquisition module 1010, a data decoding module 1020, a score analysis module 1030, and a sample generation module 1040.

In the data processing apparatus 1000, the data acquisition module 1010 may be configured to acquire voice data.

The data decoding module 1020 may be configured to process the speech data through a speech recognition model to obtain at least one decoded data of the speech data and a first score thereof.

In an exemplary embodiment, the data decoding module 1020 may include a first decoding unit and a first score unit. The first decoding unit may be configured to process the voice data through a voice recognition model, and obtain at least one decoded data and a confidence thereof. The first score unit may be configured to determine a first score of the at least one decoded data according to a confidence of the at least one decoded data.

In an exemplary embodiment, the speech recognition model is obtained after training by a first set of training samples.

In an exemplary embodiment, the data decoding module 1020 may include a second decoding unit and a second fractional value unit. Wherein the second decoding unit may be configured to process the voice data through a voice recognition model to obtain at least one decoded data. The second decoding unit may be configured to process the at least one decoded data by the language model to obtain a first score of the at least one decoded data.

Score analysis module 1030 may be configured to determine target decoding data for the voice data based on a first score of the at least one decoding data.

In an exemplary embodiment, the score analysis module 1030 may be configured to take decoded data of which a first score is the largest among the at least one decoded data as target decoded data of the voice data.

The sample generation module 1140 may be configured to generate a target training sample set from the speech data and the target decoding data.

In an exemplary embodiment, the data processing apparatus 1000 may further include a model training module, and the model training module may be configured to train the speech recognition model through the target training sample set to obtain a trained speech recognition model.

In an exemplary embodiment, the data processing apparatus 1000 may further include a data culling module, and the data culling module may be configured to cull the voice data if the first scores of the at least one decoded data corresponding to the voice data are all smaller than the score threshold.

The data processing apparatus according to the embodiment of the present disclosure processes voice data by a voice recognition model to obtain at least one decoded data and a first score thereof, determines target decoded data of the voice data according to the first score of the at least one decoded data, and can determine the target decoded data of the voice data according to a recognition result of the voice recognition model itself. And a target training sample set can be generated according to the voice data and the target decoding data, and then the voice labeling data can be expanded to support the training of the voice recognition model according to a new training sample set automatically obtained by the voice recognition model without consuming manpower and material resources.

FIG. 11 illustrates a schematic structural diagram of a computer system suitable for use in implementing an electronic device of an embodiment of the present disclosure. It should be noted that the computer system 1100 of the electronic device shown in fig. 11 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 11, the computer system 1100 includes a Central Processing Unit (CPU)1101, which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)1102 or a program loaded from a storage section 1108 into a Random Access Memory (RAM) 1103. In the RAM 1103, various programs and data necessary for system operation are also stored. The CPU 1101, ROM 1102, and RAM 1103 are connected to each other by a bus 1104. An input/output (I/O) interface 1105 is also connected to bus 1104.

The following components are connected to the I/O interface 1105: an input portion 1106 including a keyboard, mouse, and the like; an output portion 1107 including a signal output unit such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage section 1108 including a hard disk and the like; and a communication section 1109 including a network interface card such as a LAN card, a modem, or the like. The communication section 1109 performs communication processing via a network such as the internet. A driver 1110 is also connected to the I/O interface 1105 as necessary. A removable medium 1112 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1110 as necessary, so that a computer program read out therefrom is mounted into the storage section 1108 as necessary.

In particular, the processes described below with reference to the flowcharts may be implemented as computer software programs, according to embodiments of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication portion 1109 and/or installed from the removable medium 1111. When the computer program is executed by a Central Processing Unit (CPU)1101, various functions defined in the system of the present application are executed.

It should be noted that the computer readable media shown in the present disclosure may be computer readable signal media or computer readable storage media or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer-readable signal medium may include a propagated data signal with computer-readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules and/or units described in the embodiments of the present disclosure may be implemented by software, or may be implemented by hardware, and the described modules and/or units may also be disposed in a processor. Wherein the names of such modules and/or units do not in some way constitute a limitation on the modules and/or units themselves.

As another aspect, the present application also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by an electronic device, cause the electronic device to implement the method as described in the embodiments below. For example, the electronic device may implement the steps shown in fig. 2, 3, 4, 5, 6, 7, 8, or 9.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for processing voice data, comprising:

acquiring voice data;

processing the voice data through a voice recognition model to obtain at least one decoding data of the voice data and a first score of the decoding data;

determining target decoded data of the voice data according to the first score of the at least one decoded data;

and generating a target training sample set according to the voice data and the target decoding data.

2. The method of claim 1, further comprising:

and training the voice recognition model through the target training sample set to obtain the trained voice recognition model.

3. The method of claim 1, wherein processing the speech data through a speech recognition model to obtain at least one decoded data and its first score comprises:

processing the voice data through the voice recognition model to obtain the at least one decoding data and the confidence thereof;

determining the first score of the at least one decoded data according to a confidence of the at least one decoded data.

4. The method of claim 1, wherein processing the speech data through a speech recognition model to obtain at least one decoded data and its first score comprises:

processing the voice data through the voice recognition model to obtain the at least one decoding data;

processing the at least one decoded data through a language model to obtain the first score of the at least one decoded data.

5. The method of claim 1, wherein determining target decoded data for the speech data based on the first score of the at least one decoded data comprises:

and taking decoded data with the largest first score in the at least one decoded data as the target decoded data of the voice data.

6. The method of claim 1, further comprising:

and if the first scores of the at least one decoding data corresponding to the voice data are all smaller than a score threshold, removing the voice data.

7. The method of claim 1, wherein the speech recognition model is obtained after training with a first set of training samples.

8. A speech data processing apparatus, comprising:

a data acquisition module configured to acquire voice data;

the data decoding module is configured to process the voice data through a voice recognition model to obtain at least one decoding data of the voice data and a first score of the decoding data;

a score analysis module configured to determine target decoding data of the speech data according to a first score of the at least one decoding data;

a sample generation module configured to generate a target training sample set from the speech data and the target decoding data.

9. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-7.