CN118053420A

CN118053420A - Speech recognition method, apparatus, device, medium and program product

Info

Publication number: CN118053420A
Application number: CN202410185894.XA
Authority: CN
Inventors: 刘涛; 白杰; 尚天顺
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2024-02-19
Filing date: 2024-02-19
Publication date: 2024-05-17

Abstract

The present disclosure provides a speech recognition method that may be applied to the fields of artificial intelligence and financial technology. The voice recognition method comprises the following steps: acquiring voice data to be recognized, wherein the voice data to be recognized at least partially comprises specific voices and/or specialized vocabularies; recognizing the voice data to be recognized by adopting a pre-trained voice recognition model to obtain a recognition result; the voice recognition model is obtained by iterative training of a plurality of different data sets. The present disclosure also provides a speech recognition apparatus, device, storage medium and program product.

Description

Speech recognition method, apparatus, device, medium and program product

Technical Field

The present disclosure relates to the field of artificial intelligence and finance, in particular to the field of speech recognition, and more particularly to a speech recognition method, apparatus, device, medium and program product.

Background

Speech recognition technology, also known as automatic speech recognition (Automatic Speech Recognition, ASR), aims at converting lexical content in human speech into computer-readable inputs, such as keys, binary codes, or character sequences.

The basic principle of speech recognition is to make a machine automatically recognize and understand a human spoken sentence by speech signal processing and pattern recognition of an input speech signal. The basic flow comprises four parts of feature extraction, an acoustic model, a language model, a dictionary and decoding. The feature extraction is to convert the sound signal from the time domain to the frequency domain and provide feature vectors for the acoustic model; the acoustic model is to calculate the score of each feature vector on the acoustic feature; the language model calculates the probability of the sound signal corresponding to the possible phrase sequence according to the theory of linguistic correlation; the dictionary is a sequence of phrases that is decoded to obtain the final possible text representation.

Language models are widely used. In the large model training process, although training cost can be reduced through transfer learning, when the transfer learning of a language model is performed in a specific field such as a small language (including dialect), the problem of lack of training data still exists due to higher difficulty in data acquisition and data processing and higher manual labeling cost. The lack of training data can cause poor model training performance in the process of transfer learning, and the training effect which can be used for practical application is difficult to achieve.

Disclosure of Invention

In view of the foregoing, the present disclosure provides speech recognition methods, apparatus, devices, media, and program products that improve language model performance.

According to a first aspect of the present disclosure, there is provided a speech recognition method comprising: acquiring voice data to be recognized, wherein the voice data to be recognized at least partially comprises specific voices and/or specialized vocabularies; recognizing the voice data to be recognized by adopting a pre-trained voice recognition model to obtain a recognition result; the voice recognition model is obtained by iterative training of a plurality of different data sets.

According to an embodiment of the present disclosure, training of a speech recognition model includes: acquiring a training voice data set, wherein the training voice data set comprises a first voice data set and a second voice data set, the number of samples of the first voice data set is larger than that of samples of the second voice data set, and the second voice data set is a specific professional voice data set; fragmenting the second voice data set by adopting a window sliding method to obtain a plurality of voice fragments; training an initial voice recognition model by adopting a first voice data set to obtain an intermediate voice recognition model; and retraining the intermediate speech recognition model by adopting a plurality of speech fragments to obtain a speech recognition model.

According to an embodiment of the present disclosure, fragmenting a second speech data set using a window sliding method to obtain a plurality of speech fragments includes: determining the length of a sliding window and the window overlapping rate; determining a plurality of training samples in the second speech data set; and respectively sliding the sliding windows according to the window overlapping rate so as to divide a plurality of training samples and obtain a plurality of voice fragments.

According to an embodiment of the present disclosure, fragmenting the second speech data set by using a window sliding method, obtaining a plurality of speech fragments further includes: respectively carrying out voice activity detection on a plurality of voice fragments by adopting a voice activity detection algorithm so as to determine the voice fragments to be processed containing non-voice data; judging the position of non-voice data in a voice fragment to be processed; removing the non-voice data under the condition that the non-voice data are at the head and tail positions of the voice fragments to be processed; copying the voice fragment to be processed under the condition that the non-voice data is at the non-head-tail position of the voice fragment to be processed; and splicing the copied voice fragments to be processed with the adjacent voice fragments respectively.

According to an embodiment of the present disclosure, retraining an intermediate speech recognition model with a plurality of speech segments, obtaining a speech recognition model includes: labeling the voice fragments respectively to obtain a plurality of labeled voice fragments; and retraining the intermediate voice recognition model by adopting a plurality of marked voice fragments to obtain a voice recognition model.

According to an embodiment of the present disclosure, retraining an intermediate speech recognition model with a plurality of labeled speech segments, the obtaining a speech recognition model includes: extracting MFCC features and FBank features of a plurality of labeled voice fragments respectively; respectively carrying out text transcription on the plurality of labeled voice fragments to obtain a plurality of transcribed texts; and retraining the intermediate speech recognition model by adopting the MFCC features and FBank features and the transcribed text corresponding to the labeled speech fragment to obtain the speech recognition model.

According to an embodiment of the present disclosure, labeling a plurality of voice segments, respectively, to obtain a plurality of labeled voice segments includes: labeling each voice segment for multiple times to obtain multiple labeling results; and determining the result with the largest proportion among the labeling results as a labeling voice fragment.

According to an embodiment of the present disclosure, before retraining the intermediate speech recognition model, the speech recognition method further comprises: similarity judgment is carried out on the plurality of marked voice fragments so as to determine repeated voice data; and performing de-duplication processing on the repeated voice data.

According to an embodiment of the present disclosure, the voice recognition method further includes: preprocessing the training voice data set, wherein the preprocessing comprises the following steps: any one or more of audio enhancement, noise removal, audio repair, and sample expansion.

According to embodiments of the present disclosure, the speech recognition models include an ASR speech recognition model, a small language speech recognition model, a dialect speech recognition model, and a professional field speech recognition model.

A second aspect of the present disclosure provides a voice recognition apparatus, comprising: the acquisition module is used for acquiring voice data to be recognized, wherein the voice data to be recognized at least partially comprises specific voices and/or professional vocabularies; the recognition module is used for recognizing the voice data to be recognized by adopting a pre-trained voice recognition model to obtain a recognition result; the voice recognition model is obtained by iterative training of a plurality of different data sets.

According to an embodiment of the disclosure, the speech recognition device further comprises a training module for training the speech recognition model; wherein, training module includes: the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a training voice data set, the training voice data set comprises a first voice data set and a second voice data set, the number of samples of the first voice data set is larger than that of samples of the second voice data set, and the second voice data set is a specific professional voice data set; the fragmentation unit is used for carrying out fragmentation processing on the second voice data set by adopting a window sliding method to obtain a plurality of voice fragments; the first training unit is used for training the initial voice recognition model by adopting a first voice data set to obtain an intermediate voice recognition model; and the second training unit is used for retraining the intermediate voice recognition model by adopting a plurality of voice fragments to obtain a voice recognition model.

A third aspect of the present disclosure provides an electronic device, comprising: one or more processors; and a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the speech recognition method described above.

A fourth aspect of the present disclosure also provides a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform the above-described speech recognition method.

A fifth aspect of the present disclosure also provides a computer program product comprising a computer program which, when executed by a processor, implements the above-described speech recognition method.

Compared with the prior art, the voice recognition method, the voice recognition device, the electronic equipment, the storage medium and the program product have at least the following beneficial effects:

(1) According to the voice recognition method, aiming at recognition of the voice to be recognized comprising specific voices and professional vocabularies, the voice recognition model is trained through iteration of different data sets, so that flexible and accurate recognition of various specific voices can be realized.

(2) According to the voice recognition method, when the voice recognition model is trained, the large-scale general voice training set is adopted to pretrain the voice recognition model, so that the middle voice recognition model capable of accurately recognizing general voice is obtained, then the small-scale professional voice data set is adopted to finely tune and retrain the middle voice recognition model according to the voice recognition requirement of a specific field, so that the voice model can be flexibly trained to be suitable for recognition of various specific voices, the low generalization is overcome, mobility schemes of languages (dialects and the like) of the masses are flexible, a better model training effect can be achieved on the basis of small-sample voice data, the voice recognition accuracy is improved, and meanwhile, the model training cost is greatly reduced.

(3) According to the voice recognition method, the sliding window is adopted, the small-scale professional voice data set is fragmented according to a certain overlapping rate, a plurality of voice fragments for training the voice recognition model can be obtained, and voice features can be conveniently extracted and labeled.

Drawings

The foregoing and other objects, features and advantages of the disclosure will be more apparent from the following description of embodiments of the disclosure with reference to the accompanying drawings, in which:

FIG. 1 schematically illustrates an application scenario diagram of a speech recognition method, apparatus, device, medium and program product according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow chart of a speech recognition method according to an embodiment of the present disclosure;

FIG. 3 schematically illustrates a flow chart of a method of training a speech recognition model in accordance with an embodiment of the present disclosure;

FIG. 4A schematically illustrates a method flow diagram for fragmenting a second set of speech data in accordance with an embodiment of the present disclosure; FIG. 4B schematically illustrates an operator interface diagram for fragmenting a second set of speech data in accordance with an embodiment of the present disclosure;

FIG. 5 schematically illustrates a method flow diagram for fragmenting a second set of speech data in accordance with another embodiment of the present disclosure;

FIG. 6 schematically illustrates a flow chart of a method of retraining an intermediate speech recognition model according to an embodiment of the disclosure;

FIG. 7 schematically illustrates a flow chart of a method of retraining an intermediate speech recognition model according to another embodiment of the disclosure;

FIG. 8 schematically illustrates a flow chart of a method of labeling a plurality of speech segments according to an embodiment of the disclosure;

FIG. 9 schematically illustrates a flow chart of a method of retraining an intermediate speech recognition model according to another embodiment of the disclosure;

FIG. 10 schematically illustrates a flow chart of a speech recognition method according to another embodiment of the present disclosure;

FIG. 11A schematically illustrates a block diagram of a speech recognition apparatus according to an embodiment of the present disclosure; FIG. 11B schematically illustrates a block diagram of a speech recognition apparatus according to another embodiment of the present disclosure; FIG. 11C schematically illustrates a block diagram of a training module according to an embodiment of the present disclosure; and

Fig. 12 schematically illustrates a block diagram of an electronic device adapted to implement a speech recognition method according to an embodiment of the disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is only exemplary and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the present disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.

Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a convention should be interpreted in accordance with the meaning of one of skill in the art having generally understood the convention (e.g., "a system having at least one of A, B and C" would include, but not be limited to, systems having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

Embodiments of the present disclosure provide a voice recognition method, apparatus, device, medium, and program product, which may be used in the financial field or other fields. It should be noted that the speech recognition method, apparatus, device, medium and program product of the present disclosure may be used in the financial field, and may also be used in any field other than the financial field, and the application fields of the speech recognition method, apparatus, device, medium and program product of the present disclosure are not limited.

In the technical solution of the present disclosure, the related user information (including, but not limited to, user personal information, user image information, user equipment information, such as location information, etc.) and data (including, but not limited to, data for analysis, stored data, displayed data, etc.) are information and data authorized by the user or sufficiently authorized by each party, and the related data is collected, stored, used, processed, transmitted, provided, disclosed, applied, etc. in compliance with relevant laws and regulations and standards, necessary security measures are taken, no prejudice to the public order colloquia is provided, and corresponding operation entries are provided for the user to select authorization or rejection.

In the scenario of using personal information to make an automated decision, the method, the device and the system provided by the embodiment of the disclosure provide corresponding operation inlets for users, so that the users can choose to agree or reject the automated decision result; if the user selects refusal, the expert decision flow is entered. The expression "automated decision" here refers to an activity of automatically analyzing, assessing the behavioral habits, hobbies or economic, health, credit status of an individual, etc. by means of a computer program, and making a decision. The expression "expert decision" here refers to an activity of making a decision by a person who is specializing in a certain field of work, has specialized experience, knowledge and skills and reaches a certain level of expertise.

The embodiment of the disclosure provides a voice recognition method, which comprises the following steps: acquiring voice data to be recognized, wherein the voice data to be recognized at least partially comprises specific voices and/or specialized vocabularies; recognizing the voice data to be recognized by adopting a pre-trained voice recognition model to obtain a recognition result; the voice recognition model is obtained by iterative training of a plurality of different data sets. The voice recognition model is trained through iteration of different data sets, so that flexible and accurate recognition of various specific voices can be realized.

Fig. 1 schematically illustrates an application scenario diagram of a speech recognition method, apparatus, device, medium and program product according to an embodiment of the present disclosure.

As shown in fig. 1, an application scenario 100 according to this embodiment may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as shopping class applications, web browser applications, search class applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only) may be installed on the terminal devices 101, 102, 103.

The terminal devices 101, 102, 103 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a background management server (by way of example only) providing support for websites browsed by users using the terminal devices 101, 102, 103. The background management server may analyze and process the received data such as the user request, and feed back the processing result (e.g., the web page, information, or data obtained or generated according to the user request) to the terminal device. In particular, the server 105 may be a server providing a speech recognition service, provided with a speech recognition model, and may perform model training and speech recognition on speech data input by the user using the terminal devices 101, 102, 103.

It should be noted that the voice recognition method provided by the embodiments of the present disclosure may be generally performed by the server 105. Accordingly, the voice recognition apparatus provided by the embodiments of the present disclosure may be generally provided in the server 105. The speech recognition method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105. Accordingly, the voice recognition apparatus provided by the embodiments of the present disclosure may also be provided in a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

First, the related terms of the embodiments of the present disclosure are explained as follows:

and (3) voice model: the speech model is one of deep learning models for processing speech signal data and tasks of speech data, such as speech recognition, speech transcription, speech cloning, voiceprint recognition, and the like.

Small sample speech data: a limited amount of speech data is available in speech related tasks. Because of the high cost of acquiring and labeling voice data, it is often difficult to obtain large-scale voice data sets.

The voice recognition method of the disclosed embodiment will be described in detail below with reference to fig. 2 to 10 based on the scenario described in fig. 1.

Fig. 2 schematically illustrates a flow chart of a speech recognition method according to an embodiment of the present disclosure.

As shown in fig. 2, the voice recognition method of this embodiment includes, for example, operations S210 to S220, and the voice recognition method may be executed by a computer program on corresponding computer hardware.

In operation S210, voice data to be recognized, which at least partially includes specific voices and/or specialized vocabularies, is acquired.

In operation S220, the speech data to be recognized is recognized by using the pre-trained speech recognition model, and a recognition result is obtained. The voice recognition model is obtained by iterative training of a plurality of different data sets.

In some embodiments, the voice data to be identified is obtained from various sources, such as user recordings, public voice datasets, and the like. Such data includes, at least in part, specific phonetic and/or specialized words, such as medical terms, legal terms, technical terms, and the like. The specific speech may be, for example, a small language, dialect, etc.

The acquired speech data to be recognized is processed and recognized using a pre-trained speech recognition model, such as a deep learning based model, e.g., a Recurrent Neural Network (RNN), long Short Term Memory (LSTM), or transducer model, etc. The training process of the model includes iterative training using a plurality of different data sets to enhance the generalization ability of the model under different environments.

The model converts the speech data to be recognized into a textual representation, such as a word, phrase, or sentence. This process typically involves feature extraction of the speech signal and then classification or sequence prediction of these features using a model.

It should be noted that the above embodiment only provides an example of one possible speech recognition method, and does not constitute a limitation on the method. In practical applications, adjustments and optimizations may be required according to specific needs and conditions.

In operation S210, the voice data to be recognized is acquired, which may involve acquiring user information, such as user voiceprint information, etc.

In embodiments of the present disclosure, the user's consent or authorization may be obtained prior to obtaining the user's information. For example, before operation S210, a request to acquire user information may be issued to the user. In case the user agrees or authorizes that the user information can be acquired, the operation S210 is performed.

After the recognition result is obtained, processing/decision of subsequent action execution may be performed by using the user information to achieve operation of the terminal or acquisition of information, etc. in operation S220.

In the embodiment of the disclosure, a corresponding operation entrance can be provided for the user to choose to agree or reject the automated decision result. That is, before performing the processing/decision of the subsequent action execution on the user information, an instruction of the user to perform the processing/decision through agreement or rejection input through the corresponding operation portal may be obtained. If the user agrees to make the processing/decision, the processing/decision of the subsequent action execution is made on the user information, that is, step S220 is performed. If the user refuses to make the process/decision, an expert decision flow is entered.

FIG. 3 schematically illustrates a flow chart of a method of training a speech recognition model according to an embodiment of the present disclosure.

According to an embodiment of the present disclosure, as shown in fig. 3, the speech recognition model is trained, for example, through operations S321 to S324.

In operation S321, a training speech data set is acquired, the training speech data set including a first speech data set and a second speech data set, the first speech data set having a greater number of samples than the second speech data set, the second speech data set being a specific professional speech data set.

In operation S322, the second speech data set is fragmented by using a window sliding method to obtain a plurality of speech fragments.

In operation S323, the initial speech recognition model is trained using the first speech data set to obtain an intermediate speech recognition model. And

In operation S324, the intermediate speech recognition model is retrained with the plurality of speech segments to obtain a speech recognition model.

In some embodiments, first, a training speech data set is required. This dataset may include two subsets: a first set of speech data and a second set of speech data. The first speech data set may be a large-scale, diverse, generic speech data set for providing a basis for model training. The second speech data set is a specific professional speech data set, and includes speech samples of specific fields, such as medical terms, legal language, scientific vocabulary, financial field vocabulary, etc. The general voice data set has high quality, and the voice data in the specific financial field needs to be acquired by itself, for example, a remote bank acquires the voice data through a process and the like.

Since the second speech data set may contain some long sentences or complex language structures, in order to better train the model, a window sliding method may be used to fragment the second speech data set into a plurality of short speech segments. These segments may more directly reflect the speech characteristics of a particular domain.

An initial speech recognition model is trained using the first speech data set. This process typically involves steps such as feature extraction, model training, and optimization.

For example, there are many model training methods for speech recognition, including rule-based methods, statistical-based methods, deep learning-based methods, and the like.

Statistical-based model training methods are one of the widely employed approaches. This approach relies primarily on a large amount of speech data to train and uses statistical learning algorithms to build models. Common methods include, for example, hidden markov models (Hidden Markov Model, HMM) and deep neural networks (Deep Neural Networks, DNN), among others. HMM is a state transition based model that enables recognition of speech signals by building a probabilistic mapping between states and observations. DNN is a more complex and powerful model that achieves higher accuracy by layering multiple neural networks.

Another approach is a deep learning based speech recognition model training approach. Such methods typically use deep learning models such as Recurrent Neural Networks (RNNs), long Short Term Memory (LSTM), or transducers. These models can automatically learn useful features from a large amount of speech data and can handle long-term dependencies in the sequence data. Training deep learning models typically requires a lot of computational resources and expertise, but their performance is often superior to statistical-based approaches.

In addition to statistical and deep learning-based methods, there are other model training methods, such as rule-based methods, support Vector Machine (SVM) -based methods, and the like.

Regardless of the method used, model training is a very important step in speech recognition. During the training process, attention is paid to the problems of data quality and diversity, model complexity and generalization capability, parameter adjustment and optimization and the like. At the same time, cross-validation and performance assessment are also required to ensure accuracy and reliability of the model.

With the initial model, the plurality of speech segments previously fragmented are used to retrain the model. Because the segments are from a particular specialized speech data set, the model retraining process can be more focused on domain-specific speech recognition, thereby improving the accuracy of the model in the domain.

For example, during initial model fine tuning, models trained on existing large-scale speech data sets can be applied to small sample speech data using the weights and knowledge of the initial model. Through transfer learning, the model can learn general speech features and patterns from rich data and then fine-tune on small sample data. In the training iteration process of the voice model, the performance of the model is continuously evaluated until the effect (evaluation indexes such as precision) of the model reaches a certain threshold value to ensure the training effect of the voice model, and a regularization method is used for avoiding overfitting.

By the method, not only is a large-scale and diversified first voice data set utilized to establish a basic model, but also a second voice data set in a specific field is utilized to conduct targeted retraining. This helps to improve the speech recognition performance of the model in a particular area. By the method for fragmenting the voice data, the sample data set can be expanded to a certain extent after labeling, and the data labeling workload is reduced.

It should be noted that the foregoing embodiments only provide one possible training method, and may need to be adjusted and optimized according to specific requirements and conditions in practical applications.

Fig. 4A schematically illustrates a method flow diagram for fragmenting a second set of speech data in accordance with an embodiment of the present disclosure. Fig. 4B schematically illustrates an operational interface diagram for fragmenting a second set of speech data in accordance with an embodiment of the present disclosure.

According to an embodiment of the present disclosure, as shown in fig. 4A, the second speech data set is subjected to a fragmentation process, for example, through operations S4221 to S4223, resulting in a plurality of speech fragments.

In operation S422 1, the length of the sliding window and the window overlap rate are determined.

In operation S4222, a plurality of training samples in the second speech data set are determined.

In operation S4223, sliding windows are respectively slid according to the window overlapping rate to divide the plurality of training samples to obtain a plurality of speech segments.

In some embodiments, the sliding window method segments a long speech signal into a plurality of windows of the same length, each window containing a continuous length of speech signal, the windows overlap, there is a fixed sliding step (overlap) between adjacent windows, each small window serves as a small segment, and the number and length of the small segments can be controlled by adjusting the window length and the overlap rate using the sliding window method in the present disclosure. Typically the speech signal is sampled at 16000 or 8000 samples per second, and thus the length of each frame is typically 10 or 20 seconds, so the window length can be set according to the audio sampling rate, the overlap rate representing the proportion of overlap between adjacent windows, typically 50 or 75%.

For example, the length of the sliding window and the window overlap ratio are determined:

The selection of an appropriate window length is critical. For example, 20 seconds or 30 seconds may be selected as the window length, depending on the characteristics of the speech signal and the accuracy required.

To capture subtle changes in the speech signal, a 50% overlap rate can be selected, meaning that there is 10 seconds or 15 seconds overlap between adjacent windows.

Determining a plurality of training samples in the second speech data set:

A plurality of training samples is selected from the second speech data set. These samples may be sentences, phrases or words, depending on the particular application requirements. For example, if the goal is to identify a particular medical term, the training sample selected may be a medically relevant term or phrase.

For each selected training sample, it is segmented using a sliding window method. For example, for a voice data duration of 2 minutes, it may be split into a number of 20 second windows and ensure a 10 second overlap between each adjacent window.

As shown in fig. 4B, by this sliding window method, the original speech signal can be split into a plurality of small segments, each of which contains a speech signal of a fixed length, and the sliding window is continuously split until the window covers the entire speech data to be split.

By this sliding window approach, multiple short speech segments can be obtained from each training sample. These segments can reflect the characteristics of the speech signal more closely, especially when processing speech in specialized vocabulary or in a specific domain. This approach helps to improve the speech recognition performance of the model in a particular area. The voice signal is segmented by a sliding window method, and a long-segment voice signal is segmented into a plurality of short-time voice signal segments, so that subsequent feature extraction, modeling and recognition can be facilitated.

Fig. 5 schematically illustrates a flow chart of a method of fragmenting a second set of speech data in accordance with another embodiment of the present disclosure.

According to an embodiment of the present disclosure, as shown in fig. 5, the second speech data set is subjected to a fragmentation process, for example, through operations S5221 to S5225, resulting in a plurality of speech fragments.

In operation S5221, voice activity detection is performed on the plurality of voice segments, respectively, using a voice activity detection algorithm to determine a to-be-processed voice segment containing non-voice data.

In operation S5222, the position of the non-voice data in the voice segment to be processed is determined.

In operation S5223, in the case where the non-voice data is at the beginning and end positions of the voice clip to be processed, the non-voice data is rejected.

In operation S5224, in the case where the non-voice data is in a non-end-to-end position of the to-be-processed voice segment, the to-be-processed voice segment is copied. And

In operation S5225, the copied to-be-processed speech segments are spliced with adjacent speech segments, respectively.

In some embodiments, there are some segments, such as pauses, that do not contain voice data in the voice data samples that are typically obtained. The present disclosure uses voice activity detection (VAD, voice Activity Detection) algorithm to extract the speech portion from the original speech signal and divide it into small segments, and uses VAD based on sliding window method only, which can avoid using non-speech portion as training data, and can improve training effect in the subsequent model training process.

Meanwhile, since the size of the window is fixed, segmentation may be performed at the voice activity place during the sliding process of the window, so that the characteristic information of voice data at the right boundary of the window is lost. Considering the model training of the present disclosure for small sample speech data, in order not to lose the characteristic information of the speech data at the right boundary of the window, the VAD method is used to detect when the speech segment is segmented, if the speech segment is in an active state, the sliding new window is spliced to the current window, so that the characteristic information is retained although the length of the window is enlarged.

For example, a voice activity detection algorithm may be used to detect voice activity for each of a plurality of voice segments:

voice Activity Detection (VAD) is a voice detection technique that is used to determine the presence or absence of a voice signal. By using the VAD algorithm it can be identified which speech segments contain speech activity and which do not.

By the VAD algorithm, segments that do not contain voice activity, which are typically considered non-voice data or background noise, can be marked.

After the non-speech data is identified, its position in the entire speech segment needs to be further determined. This helps to decide how to process these non-speech data.

If the non-speech data is located at the first or last of the speech segments to be processed, it can be rejected directly. For example, if the beginning or ending portion of a speech segment is marked as non-speech, then that portion of data is not used for model training.

If the non-voice data is located in the middle of the voice segment to be processed, the voice segment to be processed needs to be copied and spliced to the voice segments adjacent to each other, and the splicing process needs to ensure the continuity and consistency of the data. This approach may ensure the integrity of the statements of the training model.

By the method, non-voice data can be processed, and the model is ensured to be trained by using pure data containing actual voice activities. This helps to improve the accuracy and reliability of the model, especially when processing professional speech data in a particular area.

FIG. 6 schematically illustrates a flow chart of a method of retraining an intermediate speech recognition model in accordance with an embodiment of the disclosure.

According to an embodiment of the present disclosure, as shown in fig. 6, the intermediate speech recognition model is retrained, for example, by operations S6241 to S6242, resulting in a speech recognition model.

In operation S6241, the plurality of voice segments are labeled, respectively, to obtain a plurality of labeled voice segments.

In operation S6242, the intermediate speech recognition model is retrained using the plurality of labeled speech segments to obtain a speech recognition model.

In some embodiments, labeling is an important step in speech recognition, which involves converting speech signals into textual representations. For each speech segment, transcription needs to be performed by a professional or using an automatic labeling tool, and accuracy and consistency of transcription are ensured.

The labeling process may be performed manually or may be accomplished using semi-automatic or Automatic Speech Recognition (ASR) tools. Either way, the labeling results need to be carefully checked and collated to ensure accuracy.

For example, for each small segment, it is necessary to annotate its corresponding transcribed text, and select or develop an appropriate annotation tool. This may be special voice markup software or general purpose audio editing software such as Audacity. Ensuring that the tool supports real-time listening, transcription and labeling functions. Or the voice data is marked by a manual marking method, and each small segment is manually listened to and transcribed into a text as a marking result by the manual marking method.

Opening the labeling tool, and loading the voice fragments one by one.

The speech segments are played and transcribed simultaneously or afterwards. In the transcription process, the objective is maintained as much as possible, and the content in the voice is accurately reflected.

After transcription is completed, the transcription content is carefully checked to ensure accuracy.

If errors are found, correction and re-verification are performed. This step is critical because erroneous labeling data can cause model training to deviate from the correct direction.

The marked voice fragments are stored into a specific format, such as a JSON or CSV file, so that the subsequent model training is convenient.

After the labeling of all the speech segments is completed, a series of labeled speech segments will be obtained. These segments will be used for subsequent model retraining.

And retraining the intermediate model by using the labeled voice fragments. This process typically involves the use of a supervised learning algorithm, in which data is annotated as the target output, the predicted outcome of the model is compared to the target output, and optimization and adjustment is performed based on the differences.

For example, the obtained labeled speech segment is used as training data. Ensuring that the data format is consistent with the format required for model training.

According to the model and the data characteristics, proper super parameters such as learning rate, batch size, iteration times and the like are set.

Model retraining is performed using appropriate optimization algorithms (e.g., random gradient descent, adam, etc.) and loss functions (e.g., cross entropy loss).

After training, the model learns according to the labeling data, and parameters are continuously optimized to reduce prediction errors.

In the training process, indexes such as the value, the accuracy and the like of the loss function need to be monitored in real time, and the model training is ensured to be normally carried out.

Depending on the monitoring results, training parameters may be adjusted or other optimization strategies employed if the model is found to perform poorly in some way. For example, different learning rates may be tried, integrated learning may be employed, etc. to improve model performance.

When the model reaches satisfactory performance or reaches the preset training round number, the training is stopped. The model obtained at this time is a retrained speech recognition model for the specific domain.

Through the retraining process, the intermediate model is adjusted and optimized according to the new annotation data. This process can enhance the performance of the model in a particular area and increase its accuracy and reliability.

Through the steps, the middle model can be retrained by using the labeled multiple voice fragments, so that a more accurate and reliable voice recognition model aiming at a specific field is obtained. Such domain-specific model training methods may improve the performance of speech recognition, especially when processing challenging specialized speech data.

It should be noted that during the retraining process, the super parameters, the optimization algorithm and the loss function need to be reasonably selected to ensure improvement of the model performance. Meanwhile, the problems of over-fitting and under-fitting need to be noted, and corresponding regularization measures are adopted to prevent over-fitting.

In addition, in order to ensure generalization ability of the model, cross-validation or the like may be used to evaluate the performance of the model, and adjustment and optimization may be performed according to the evaluation result.

FIG. 7 schematically illustrates a flow chart of a method of retraining an intermediate speech recognition model according to another embodiment of the disclosure.

According to an embodiment of the present disclosure, as shown in fig. 7, the intermediate speech recognition model is retrained, for example, by operations S7421 to S7423, resulting in a speech recognition model.

In operation S7421, MFCC features and FBank features of the plurality of labeled speech segments are extracted, respectively.

In operation S7422, text transcription is performed on the plurality of labeled speech segments, respectively, to obtain a plurality of transcribed texts. And

In operation S7423, the intermediate speech recognition model is retrained using the MFCC features and FBank features, and the transcribed text corresponding to the labeled speech segment, to obtain a speech recognition model.

In some embodiments, MFCC features and FBank features of the plurality of labeled speech segments are extracted separately:

MFCCs (Mel-frequency cepstral coefficients) and FBank (linear predictive cepstral coefficients or filter bank spectrum) are features in speech recognition. The features can effectively capture key information such as frequency, energy and the like of an audio signal in the voice signal for subsequent model training.

For each labeled speech segment, its MFCC and FBank features are extracted using the speech processing kit Librosa and converted into a feature vector or matrix. This step typically involves pre-emphasis, framing, windowing, fast Fourier Transform (FFT), mel-filter bank, and the like.

Each labeled speech segment is converted to a corresponding text representation using speech recognition techniques. This may be accomplished by existing speech recognition tools or services, or may be done using semi-automated methods.

In the transfer process, the transfer accuracy is ensured, and the influence of noise and other interference factors is reduced as much as possible.

The extracted features and the transcribed text are used as input and labels for model retraining. The consistency of the features and text data is ensured and model optimization is performed using appropriate training algorithms and loss functions.

For example, the application of these new features in a speech recognition model mainly involves the following steps:

Input data preparation:

The extracted MFCC features and FBank features are used as input data, ready for model training. Ensuring that the feature data format is consistent with the format required for model training.

Characteristic pretreatment:

The extracted features are subjected to necessary pretreatment including normalization, standardization and other operations to eliminate scale differences and distribution differences among the features. This step helps to improve the training efficiency and accuracy of the model.

Model input layer:

and taking the preprocessed features as input, and transmitting the preprocessed features to an input layer of the voice recognition model. The input layer is responsible for receiving the feature data and converting it into an internal representation that the model can handle.

Feature mapping:

after the input layer, the model maps features to a higher dimensional space. This step typically involves linear transformations and nonlinear activation functions to enhance the representation capabilities of the model.

Hidden layer processing:

The hidden layer is the core part of the model, responsible for receiving features from the input layer and performing complex transformations and processing. The hidden layer may contain multiple neurons and complex network structures for capturing complex patterns and associations in speech.

Output layer and prediction:

the output layer of the model converts the output of the hidden layer into the final prediction result, i.e. the transcribed text representation. The output layer typically uses a softmax function or other classification method to transform the output of the hidden layer into a probability distribution to determine the most likely transcription result.

Back propagation and parameter update:

During the training process, the model calculates the gradient of the loss function by a back propagation algorithm and updates the parameters of the model according to the gradient descent or other optimization algorithm. This step is the core of model training, ensuring that the model can gradually adapt to new features and data distributions.

Training and verification:

in the whole retraining process, the extracted features and the corresponding transcription text are used for a training set, and the verification set is used for evaluating and adjusting the model performance. The performance of the model is continuously optimized by means of adjusting super parameters, optimizing algorithms and the like.

Model evaluation and deployment:

After training is completed, the final model is evaluated using the test set, ensuring that it performs well on unseen data. Once the model achieves satisfactory performance, it can be deployed into a practical application for speech recognition tasks.

Through the steps, new features can be effectively applied to the voice recognition model, and the performance and accuracy of the model in a specific field are improved. This helps to develop a more accurate and reliable speech recognition system that meets the needs of different fields.

Through the steps, the characteristics of the marked voice fragments are further extracted, text transcription is carried out, and then the data are used for retraining the intermediate model. The method is helpful to improve the performance and recognition accuracy of the model, so that the model is better suitable for the voice data in the specific field.

Fig. 8 schematically illustrates a flow chart of a method of labeling a plurality of speech segments according to an embodiment of the disclosure.

According to an embodiment of the present disclosure, as shown in fig. 8, a plurality of voice clips are labeled, for example, through operations S8411 to S8412, respectively, to obtain a plurality of labeled voice clips.

In operation S8411, each voice clip is labeled multiple times, so as to obtain multiple labeling results.

In operation S8412, it is determined that the result having the largest ratio among the plurality of labeling results is the labeled speech segment.

In some embodiments, annotating a speech segment generally involves converting the speech signal to a textual representation. Because of the complexity and uncertainty of speech recognition, multiple labeling of the same speech segment is necessary. Thus, the information in the voice can be more comprehensively captured, and the accuracy of labeling is improved.

For example, in order to optimize the labeling effect, in this step, a multi-person labeling method is adopted for small sample voice data, the "best vote" is performed from the multiple labeling results, and the nearest real labeling result (for example, a segment, 8 persons are labeled as a result, 2 persons are labeled as other results, and the a result is taken).

Selecting or developing a suitable annotation tool ensures that the tool supports real-time listening, transcription and annotation functions.

A separate labeling task is created for each speech segment. Ensuring that enough speech data is contained in each task for accurate labeling.

A labeling team is built and team members may include linguists, speech recognition experts, etc. Ensuring that team members have the necessary language and speech knowledge. Each team member independently annotates the same speech segment. This may take the form of a block annotation or an overall annotation, depending on the requirements of the task and team preferences. Team members are encouraged to follow the same labeling specifications and criteria during the labeling process to ensure comparability between different labeling results.

And collecting the labeling results of each team member, and sorting and organizing the results. According to the multiple labeling, multiple labeling results can be obtained. These results may vary due to subjective judgment, linguistic knowledge, and experience differences of the annotators. In order to obtain the final labeling result, the following methods may be used to process a plurality of labeling results:

All the collected labeling results are compared to see differences and consistency between them. Those labeled results that are similar or nearby are combined to reduce the diversity of the results. And counting the duty ratio of each labeling result in all results. This can be done by simple counting or using more complex statistical methods. And finding out the labeling result with the largest duty ratio, and taking the labeling result as a final labeled voice fragment. This ensures that model training is based on the most extensive and consistent annotation data.

And (3) checking and correcting the final labeling result to ensure the accuracy and consistency of the final labeling result. This may involve rechecking certain specific parts or inviting third parties to review.

And storing the final annotation data as a file in a proper format, such as JSON, CSV and the like, so as to facilitate the subsequent model training. When the data are tidied, the integrity and consistency of the data are ensured.

Through the steps, each voice segment can be marked for multiple times, and the result with the largest duty ratio is selected from the voice segments to serve as the final marked voice segment. The method is helpful to improve the accuracy and reliability of labeling, thereby better training the voice recognition model.

FIG. 9 schematically illustrates a flow chart of a method of retraining an intermediate speech recognition model according to another embodiment of the disclosure.

According to an embodiment of the present disclosure, as shown in fig. 9, the speech recognition method further includes, for example, operations S9241 to S9242, before retraining the intermediate speech recognition model.

In operation S9241, similarity judgment is performed on the plurality of labeled speech segments to determine repetitive speech data. And

In operation S9242, the duplicate voice data is subjected to the duplicate removal process.

In some embodiments, duplicate data may occur when speech segments are copied and spliced. In order to ensure that the training of the model is not affected by the repeated data, the labeled speech segments need to be deduplicated prior to retraining.

For example, similarity determination is performed on a plurality of labeled speech segments. The purpose of the similarity determination is to identify duplicate or highly similar speech data.

A suitable similarity measure is selected or developed for comparing the similarity between different speech segments. This may be based on feature comparison, text comparison, or other related techniques. And comparing all the marked voice fragments pairwise by using a similarity measurement method. And recording the voice fragment pairs with higher similarity for subsequent processing.

Based on the results of the similarity determination, it may be determined which speech segments are repetitive or highly similar. A threshold is set to distinguish between duplicate and similar speech segments based on the result of the similarity determination. For example, two speech segments may be considered repetitive if they are more than 90% similar.

And screening repeated voice fragments from the similarity judgment result according to the set threshold value. Ensuring that these duplicate data are removed or processed in subsequent training. The purpose of the deduplication process is to eliminate duplicate data, ensuring the accuracy and effectiveness of model training.

And integrating the screened repeated voice fragments to ensure that each unique voice fragment only appears once, and avoiding introducing redundant information in model training.

And re-labeling the voice fragments after the duplication removal, and ensuring the accuracy and consistency of labeling.

Through the steps, the effectiveness and the accuracy of model retraining can be ensured by combining similarity judgment and duplicate removal processing, and the influence of repeated data on the performance of the model is avoided. The method is helpful to improve the generalization capability and robustness of the model, so that the model is better suitable for different scenes and changes of data distribution.

Fig. 10 schematically illustrates a flow chart of a speech recognition method according to another embodiment of the present disclosure.

According to an embodiment of the present disclosure, as shown in fig. 10, the voice recognition method further includes, for example, operation S1010.

In operation S1010, a training speech data set is preprocessed, the preprocessing including: any one or more of audio enhancement, noise removal, audio repair, and sample expansion.

In some embodiments, when processing speech data in a real application, the quality of the data may be affected by a number of factors, such as background noise, poor recording quality, unclear speech, etc. To improve the accuracy and robustness of speech recognition, a pre-processing of the training speech data set is often required. Preprocessing may include audio enhancement, noise removal, audio restoration, and sample expansion techniques. The following are some specific examples of applications of these techniques:

audio enhancement aims to improve the quality of speech signals, making them clearer and easier to recognize.

For example, enhancing the sound characteristics of a speech signal to improve the audibility of speech includes uniformly adjusting the volume, increasing the clarity and brightness of the sound, improving pitch and prosody, and the like.

The volume of all speech segments is adjusted to the same standard level, ensuring that they have a consistent volume range. The treble and bass portions of the speech signal are adjusted to be more balanced throughout the dynamic range. Echo in the voice signal is removed through a digital filter or a special algorithm, so that the definition of voice is improved.

And increasing the clarity and audibility of speech through audio enhancement techniques. For example, a speech enhancement filter is used to enhance the spectral content of the speech signal.

Noise removal, the purpose of which is to remove background noise from voice signals and improve the purity of voice.

For example, background noise is removed from the speech signal using a noise suppression algorithm, such as spectral subtraction or Wiener filters. By detecting speech and non-speech activity, only active segments containing speech are retained, removing meaningless noise segments.

In addition, a voice enhancement model of a Deep Neural Network (DNN) is used for carrying out noise reduction, reverberation removal, dynamic range increase and the like on voice, so that the identifiability and the robustness of voice signals are improved.

Audio repair, which involves repairing or filling damaged or poorly quality speech segments.

For short periods of silence, noise or other non-speech content, suitable filling techniques may be used, such as silence filling or filling using pre-trained audio models. For corrupted portions of speech, audio repair algorithms or techniques, such as interpolation or reconstruction, may be used to restore the speech quality of the corrupted portions.

Sample expansion, which creates new samples by transforming the original speech segments to increase the generalization ability of the model.

For example, the method of time disturbance, frequency disturbance, speed disturbance, speech speed disturbance and the like is used for transforming the voice data, transforming and synthesizing the original voice data, and generating more training samples (expansion samples), so that the diversity and the number of the data are increased.

By varying the speed of speech without changing its fundamental characteristics, time stretched and compressed samples are created. New samples are created by changing the spectral characteristics of the speech, such as using mel-filter banks or cepstral coefficients transforms. Sound recordings under different circumstances are simulated, for example speech samples at different distances, directions or background noise. Under the condition of small samples, more various samples are provided through steps such as transformation data processing and the like, and the model is helped to learn generalization capability better. In the reasoning use of various voice models, the models can obtain better reasoning effect on the basis of small sample voice data.

For example, in practical applications, the above techniques may be combined to form a complete pretreatment flow:

the original audio file is read from the storage device.

Noise segments are first identified using a noise detection algorithm and then the effects of noise are removed or reduced using appropriate noise suppression techniques. For damaged or unclear speech segments, an audio repair algorithm is used to repair or fill in.

All speech segments are time stretched or compressed to increase the generalization ability of the model.

The volume and dynamic range of all speech segments are adjusted to ensure that they are of consistent quality.

And storing the preprocessed audio data on a disk for subsequent model training.

Training a speech recognition model using the preprocessed speech data set.

According to embodiments of the present disclosure, speech recognition models include, for example, ASR speech recognition models, small language speech recognition models, dialect speech recognition models, and professional field speech recognition models.

In some embodiments, the speech recognition technology is widely applied in many fields, and can meet the requirements of various specific scenes by combining different speech recognition models. The following are several examples of application scenarios for different speech recognition models:

ASR speech recognition model

An intelligent assistant: using the ASR speech recognition model, intelligent helper applications are developed that allow users to control devices, query information, set reminders, etc. through voice commands.

Searching voice: by combining with ASR technology, the user can input keywords through voice to search, and related information on the Internet can be quickly obtained.

Vehicle navigation system: the use of an ASR speech recognition model in a car navigation system allows a driver to enter a destination through speech, thereby navigating more safely.

Small language speech recognition model

Multilingual support: for places where small languages are used, developing small language speech recognition models can help local people more conveniently use speech recognition technology.

Cross-culture communication: in cross-cultural interactions, the small language speech recognition model may help the translator to quickly obtain and translate the small language speech content.

Cultural heritage protection: the method is used for recording and identifying the cultural heritage of the small language area, and is helpful for preserving and transmitting the local language and culture.

Dialect speech recognition model

Regional feature service: and dialect voice recognition service is provided for users in specific areas, so that user experience is enhanced, and local culture and characteristics are more similar.

Cultural inheritance: in multi-dialect regions, developing dialect speech recognition models can help record and inherit the speech characteristics of various dialects.

Social media and advertising: the dialect voice recognition model is used in social media or advertisements, so that users in specific areas can be attracted, and brand influence is enhanced.

Professional field speech recognition model

Medical diagnostic assistant: in the medical field, a voice recognition model aiming at medical terms is developed, so that doctors can be helped to quickly input medical records, prescribe and the like, and the working efficiency is improved.

Legal consultation service: in the legal field, a voice recognition model for legal terms is used, so that lawyers can be helped to more conveniently sort case data, inquire legal provision and the like.

Financial investment consultants: in the financial field, speech recognition models for financial terms are developed, which can provide investors with personalized investment advice, stock analysis, etc.

By combining different voice recognition models and application scenes, the potential of the voice recognition technology in various fields can be fully exerted, and the user experience and the working efficiency are improved.

It will be appreciated that the application scenario of the above-described speech recognition method is merely exemplary, and does not limit the scope of protection of the method of the present disclosure, and the speech recognition method of the present disclosure may also be used for speech recognition in other scenarios, such as transcription, synthesis, cloning, and the like of speech.

In summary, the present disclosure provides a method for speech recognition, which constructs a segmented feature vector from an input speech through small sample training data, and uses a segmented feature vector combination as an input feature, so as to improve training effect of a model, reduce dependence of the model on labeling, and reduce workload.

Based on the voice recognition method, the disclosure also provides a voice recognition device. The voice recognition apparatus will be described in detail with reference to fig. 11.

Fig. 11A schematically shows a block diagram of a voice recognition apparatus according to an embodiment of the present disclosure. Fig. 11B schematically shows a block diagram of a voice recognition apparatus according to another embodiment of the present disclosure. Fig. 11C schematically illustrates a block diagram of a training module according to an embodiment of the disclosure.

As shown in fig. 11A, the voice recognition apparatus 1100 of this embodiment includes, for example: an acquisition module 1110 and an identification module 1120.

The obtaining module 1110 is configured to obtain voice data to be recognized, where the voice data to be recognized at least partially includes specific voices and/or specialized vocabularies. In an embodiment, the acquiring module 1110 may be configured to perform the operation S210 described above, which is not described herein.

The recognition module 1120 is configured to recognize the voice data to be recognized by using a pre-trained voice recognition model, so as to obtain a recognition result. The voice recognition model is obtained by iterative training of a plurality of different data sets. In an embodiment, the identification module 1120 may be used to perform the operation S220 described above, which is not described herein.

As shown in fig. 11B, the voice recognition apparatus 1100 of this embodiment further includes, for example: training module 1130 is configured to train the speech recognition model. As shown in fig. 11C, the training module 1130 includes, for example: an acquisition unit 1131, a fragmentation unit 1132, a first training unit 1133, and a second training unit 1134.

The obtaining unit 1131 is configured to obtain a training voice data set, where the training voice data set includes a first voice data set and a second voice data set, and the number of samples of the first voice data set is greater than the number of samples of the second voice data set, and the second voice data set is a specific professional voice data set. In an embodiment, the obtaining unit 1131 may be used to perform the operation S321 described above, which is not described herein.

The fragmentation unit 1132 is configured to perform fragmentation processing on the second speech data set by using a window sliding method, so as to obtain a plurality of speech fragments. In an embodiment, the fragmentation unit 1132 may be used to perform the operation S322 described above, and will not be described herein.

The first training unit 1133 is configured to train the initial speech recognition model by using the first speech data set, so as to obtain an intermediate speech recognition model. In an embodiment, the first training unit 1133 may be used to perform the operation S323 described above, which is not described herein.

The second training unit 1134 is configured to retrain the intermediate speech recognition model by using a plurality of speech segments to obtain a speech recognition model. In an embodiment, the second training unit 1134 may be used to perform the operation S324 described above, which is not described herein.

Any of the acquisition module 1110, the identification module 1120, and the training module 1130 may be combined in one module to be implemented, or any of the modules may be split into multiple modules, according to embodiments of the present disclosure. Or at least some of the functionality of one or more of the modules may be combined with, and implemented in, at least some of the functionality of other modules. According to embodiments of the present disclosure, at least one of the acquisition module 1110, the identification module 1120, and the training module 1130 may be implemented at least in part as hardware circuitry, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system-on-chip, a system-on-substrate, a system-on-package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware, such as any other reasonable manner of integrating or packaging the circuitry, or in any one of or a suitable combination of any of three implementations of software, hardware, and firmware. Or at least one of the acquisition module 1110, the identification module 1120, and the training module 1130 may be at least partially implemented as a computer program module which, when executed, may perform the corresponding functions.

As shown in fig. 12, an electronic device 1200 according to an embodiment of the present disclosure includes a processor 1201, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1202 or a program loaded from a storage section 1208 into a Random Access Memory (RAM) 1203. The processor 1201 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. Processor 1201 may also include on-board memory for caching purposes. The processor 1201 may include a single processing unit or multiple processing units for performing the different actions of the method flows according to embodiments of the disclosure.

In the RAM 1203, various programs and data necessary for the operation of the electronic apparatus 900 are stored. The processor 1201, the ROM 1202, and the RAM 1203 are connected to each other through a bus 1204. The processor 1201 performs various operations of the method flow according to the embodiments of the present disclosure by executing programs in the ROM 1202 and/or RAM 1203. Note that the program may be stored in one or more memories other than the ROM 1202 and the RAM 1203. The processor 1201 may also perform various operations of the method flow according to embodiments of the present disclosure by executing programs stored in the one or more memories.

According to an embodiment of the disclosure, the electronic device 900 may also include an input/output (I/O) interface 1205, with the input/output (I/O) interface 1205 also connected to the bus 1204. The electronic device 900 may also include one or more of the following components connected to the I/O interface 1205: an input section 1206 including a keyboard, a mouse, and the like; an output portion 1207 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 1208 including a hard disk or the like; and a communication section 1209 including a network interface card such as a LAN card, a modem, or the like. The communication section 1209 performs communication processing via a network such as the internet. The drive 1210 is also connected to the I/O interface 1205 as needed. A removable medium 1211 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on the drive 1210 so that a computer program read out therefrom is installed into the storage section 1208 as needed.

The present disclosure also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs that, when executed, implement a speech recognition method according to an embodiment of the present disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example, but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, the computer-readable storage medium may include the ROM 1202 and/or the RAM 1203 and/or one or more memories other than the ROM 1202 and the RAM 1203 described above.

Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the methods shown in the flowcharts. The program code means for causing a computer system to carry out the speech recognition method provided by the embodiments of the present disclosure when the computer program product is run on the computer system.

The above-described functions defined in the system/apparatus of the embodiments of the present disclosure are performed when the computer program is executed by the processor 1201. The systems, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.

In one embodiment, the computer program may be based on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program can also be transmitted, distributed over a network medium in the form of signals, and downloaded and installed via a communication portion 1209, and/or from a removable medium 1211. The computer program may include program code that may be transmitted using any appropriate network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

In such an embodiment, the computer program can be downloaded and installed from a network via the communication portion 1209, and/or installed from the removable media 1211. The above-described functions defined in the system of the embodiments of the present disclosure are performed when the computer program is executed by the processor 1201. The systems, devices, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.

According to embodiments of the present disclosure, program code for performing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, such computer programs may be implemented in high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. Programming languages include, but are not limited to, such as Java, c++, python, "C" or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Those skilled in the art will appreciate that the features recited in the various embodiments of the disclosure and/or in the claims may be provided in a variety of combinations and/or combinations, even if such combinations or combinations are not explicitly recited in the disclosure. In particular, the features recited in the various embodiments of the present disclosure and/or the claims may be variously combined and/or combined without departing from the spirit and teachings of the present disclosure. All such combinations and/or combinations fall within the scope of the present disclosure.

The embodiments of the present disclosure are described above. These examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described above separately, this does not mean that the measures in the embodiments cannot be used advantageously in combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be made by those skilled in the art without departing from the scope of the disclosure, and such alternatives and modifications are intended to fall within the scope of the disclosure.

Claims

1. A method of speech recognition, the method comprising:

Acquiring voice data to be recognized, wherein the voice data to be recognized at least partially comprises specific voices and/or professional vocabularies;

the voice data to be recognized are recognized by adopting a pre-trained voice recognition model, and a recognition result is obtained;

the voice recognition model is obtained by iterative training of a plurality of different data sets.

2. The method of claim 1, wherein the training of the speech recognition model comprises:

Acquiring a training voice data set, wherein the training voice data set comprises a first voice data set and a second voice data set, the number of samples of the first voice data set is larger than that of the second voice data set, and the second voice data set is a specific professional voice data set;

fragmenting the second voice data set by adopting a window sliding method to obtain a plurality of voice fragments;

training an initial voice recognition model by adopting the first voice data set to obtain an intermediate voice recognition model; and

And retraining the intermediate voice recognition model by adopting the plurality of voice fragments to obtain the voice recognition model.

3. The method of claim 2, wherein fragmenting the second speech data set using window sliding comprises:

determining the length of a sliding window and the window overlapping rate;

determining a plurality of training samples in the second speech data set;

And respectively sliding the sliding windows according to the window overlapping rate so as to divide the training samples and obtain the voice fragments.

4. The method of claim 3, wherein fragmenting the second speech data set using window sliding further comprises:

Respectively carrying out voice activity detection on the plurality of voice fragments by adopting a voice activity detection algorithm so as to determine a voice fragment to be processed containing non-voice data;

judging the position of the non-voice data in the voice fragment to be processed;

Rejecting the non-voice data under the condition that the non-voice data is at the head-tail position of the voice fragment to be processed;

copying the voice fragment to be processed under the condition that the non-voice data is at a non-head-tail position of the voice fragment to be processed; and

And splicing the copied voice fragments to be processed with the adjacent voice fragments respectively.

5. The method of claim 4, wherein retraining the intermediate speech recognition model using the plurality of speech segments to obtain the speech recognition model comprises:

Labeling the voice fragments respectively to obtain labeled voice fragments;

and retraining the intermediate voice recognition model by adopting the plurality of marked voice fragments to obtain the voice recognition model.

6. The method of claim 5, wherein retraining the intermediate speech recognition model with the plurality of labeled speech segments to obtain the speech recognition model comprises:

Extracting MFCC features and FBank features of the plurality of labeled speech fragments respectively;

respectively carrying out text transcription on the plurality of labeled voice fragments to obtain a plurality of transcription texts; and

And retraining the intermediate voice recognition model by adopting the MFCC features, the FBank features and the transcription text corresponding to the marked voice fragment to obtain the voice recognition model.

7. The method according to claim 5 or 6, wherein labeling the plurality of speech segments, respectively, to obtain a plurality of labeled speech segments comprises:

Labeling each voice fragment for multiple times to obtain multiple labeling results;

And determining the result with the largest proportion among the labeling results as the labeling voice fragment.

8. The method according to claim 5 or 6, wherein prior to retraining the intermediate speech recognition model, the method further comprises:

Performing similarity judgment on the plurality of marked voice fragments to determine repeated voice data; and

And performing de-duplication processing on the repeated voice data.

9. The method according to any one of claims 2 to 8, further comprising:

Preprocessing the training speech data set, the preprocessing comprising: any one or more of audio enhancement, noise removal, audio repair, and sample expansion.

10. The method of claim 9, wherein the speech recognition models include an ASR speech recognition model, a small language speech recognition model, a dialect speech recognition model, and a professional field speech recognition model.

11. A speech recognition device, the device comprising:

The system comprises an acquisition module, a recognition module and a recognition module, wherein the acquisition module is used for acquiring voice data to be recognized, and the voice data to be recognized at least partially comprises specific voices and/or professional vocabularies; and

The recognition module is used for recognizing the voice data to be recognized by adopting a pre-trained voice recognition model to obtain a recognition result;

12. The apparatus of claim 11, further comprising a training module for training the speech recognition model;

wherein, training module includes:

An acquisition unit configured to acquire a training speech data set, where the training speech data set includes a first speech data set and a second speech data set, the number of samples of the first speech data set is greater than the number of samples of the second speech data set, and the second speech data set is a specific professional speech data set;

the fragmentation unit is used for carrying out fragmentation processing on the second voice data set by adopting a window sliding method to obtain a plurality of voice fragments;

the first training unit is used for training the initial voice recognition model by adopting the first voice data set to obtain an intermediate voice recognition model; and

And the second training unit is used for retraining the intermediate voice recognition model by adopting the plurality of voice fragments to obtain the voice recognition model.

13. An electronic device, comprising:

one or more processors;

Storage means for storing one or more computer programs,

Characterized in that the one or more processors execute the one or more computer programs to implement the steps of the method according to any one of claims 1 to 10.

14. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, realizes the steps of the method according to any one of claims 1-10.

15. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the steps of the method according to any one of claims 1-10.