CN117334188A

CN117334188A - Speech recognition method, device, electronic equipment and storage medium

Info

Publication number: CN117334188A
Application number: CN202311233453.4A
Authority: CN
Inventors: 马坤; 臧阳光; 金雯; 王波
Original assignee: Yuanbao Kechuang Beijing Technology Co ltd
Current assignee: Yuanbao Kechuang Beijing Technology Co ltd
Priority date: 2023-09-21
Filing date: 2023-09-21
Publication date: 2024-01-02

Abstract

The invention provides a voice recognition method, a voice recognition device, electronic equipment and a storage medium, and relates to the technical field of artificial intelligence, wherein the method comprises the following steps: acquiring voice to be recognized; extracting a logfbank characteristic of the voice to be recognized under the condition that the voice to be recognized comprises a voice fragment corresponding to the target field; inputting the logfbank features into a pre-trained voice recognition model to obtain a voice recognition text output by the voice recognition model, wherein the voice recognition model is obtained by training based on a plurality of first voice samples in the target field, a first text sample corresponding to each first voice sample, a plurality of second voice samples in the general field and a second text sample corresponding to each second voice sample, and the first voice samples comprise a plurality of sample professional terms in the target field; and outputting the voice recognition text. The speech recognition model obtained through training can recognize the speech to be recognized comprising the technical terms, and the accuracy of speech recognition is high.

Description

Speech recognition method, device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of artificial intelligence technologies, and in particular, to a method and apparatus for speech recognition, an electronic device, and a storage medium.

Background

Speech recognition technology (Automatic Speech Recognition, ASR) is a technology that transcribes human speech into text with the goal of converting speech content in human speech into readable text content.

The existing voice text transcription service is mainly aimed at the scene of universal voice text transcription, when the existing voice text transcription service is used for text transcription of universal vocabulary or sentences in voice, the output text content is closer to the real content of the voice, and the accuracy of voice recognition can meet the requirements of users.

However, in the non-general professional field, when the existing voice text transcription service is used for recognizing the voice with stronger profession, professional vocabulary in the voice cannot be correctly recognized, so that the difference between the output text content and the real content of the voice is larger, and the accuracy of voice recognition is lower.

Disclosure of Invention

The invention provides a voice recognition method, a voice recognition device, electronic equipment and a storage medium, which are used for solving the defect of low accuracy of voice recognition in the prior art and achieving the purpose of improving the accuracy of voice recognition.

The invention provides a voice recognition method, which comprises the following steps:

acquiring voice to be recognized;

Extracting a logfbank characteristic of the voice to be recognized under the condition that the voice to be recognized comprises a voice fragment corresponding to the target field;

inputting the logfbank features into a pre-trained voice recognition model to obtain a voice recognition text output by the voice recognition model, wherein the voice recognition model is obtained by training based on a plurality of first voice samples in a target field, a first text sample corresponding to each first voice sample, a plurality of second voice samples in a general field and a second text sample corresponding to each second voice sample, and the first voice samples comprise a plurality of sample technical terms in the target field;

outputting the voice recognition text.

According to the voice recognition method provided by the invention, the plurality of first voice samples are determined based on the following modes:

performing voice synthesis on the first training text by adopting a TTS technology to obtain first training voice;

acquiring a second training voice input by a user based on a second training text; the first training text and the second training text comprise sample professional terms in the target field;

the first training speech and the second training speech are determined as the plurality of first speech samples.

According to the voice recognition method provided by the invention, the voice to be recognized comprises dialogue voice;

the extracting the logfbank characteristic of the voice to be recognized comprises the following steps:

extracting each tone characteristic in the dialogue voice;

clustering voices corresponding to the same tone color feature to obtain clustered voices corresponding to the tone color feature;

respectively extracting the logfbank characteristics of each clustered voice;

inputting the logfbank characteristic into a pre-trained voice recognition model to obtain a voice recognition text output by the voice recognition model, wherein the method comprises the following steps:

inputting the logfbank characteristics of each clustered voice into a pre-trained voice recognition model to obtain a voice recognition text output by the voice recognition model.

According to the voice recognition method provided by the invention, the voice to be recognized comprises at least two voice fragments and time stamps corresponding to the voice fragments; the voice recognition text comprises a plurality of sub-texts and time stamps corresponding to the sub-texts, and the voice fragments correspond to the sub-texts one by one;

the outputting the speech recognition text includes:

sequencing each sub-text based on the sequence of the time stamp corresponding to each voice fragment and the time stamp of each sub-text in the voice recognition text to obtain a sequencing result;

Based on each tone characteristic, adding corresponding identification information to each sub-text in the sequencing result to obtain a target voice recognition text; the identification information is used for identifying different users;

and outputting the target voice recognition text.

According to the voice recognition method provided by the invention, the voice to be recognized is obtained, which comprises the following steps:

acquiring an initial voice to be recognized;

and determining a mute segment in the initial voice to be recognized, and deleting the mute segment in the initial voice to be recognized to obtain the voice to be recognized.

According to the voice recognition method provided by the invention, the outputting of the voice recognition text comprises the following steps:

carrying out semantic analysis on the voice recognition text to obtain a semantic analysis result;

adding punctuation marks into the voice recognition text based on the semantic analysis result;

and outputting the voice recognition text added with the punctuation.

According to the voice recognition method provided by the invention, the voice recognition model is trained based on the following modes:

inputting the first voice sample and the second voice sample into an initial voice recognition model to obtain a first predicted text corresponding to the first voice sample and a second predicted text corresponding to the second voice sample;

Determining a first penalty based on the first predicted text and the first text sample, and determining a second penalty based on the second predicted text and the second text sample;

and adjusting model parameters of the initial voice recognition model based on the first loss and the second loss to obtain the voice recognition model.

The invention also provides a voice recognition device, comprising:

the acquisition module is used for acquiring the voice to be recognized;

the extraction module is used for extracting the logfbank characteristic of the voice to be recognized under the condition that the voice to be recognized comprises the voice fragment corresponding to the target field;

the processing module is used for inputting the logfbank characteristic into a pre-trained voice recognition model to obtain a voice recognition text output by the voice recognition model, wherein the voice recognition model is obtained by training based on a plurality of first voice samples in a target field, a first text sample corresponding to each first voice sample, a plurality of second voice samples in a general field and a second text sample corresponding to each second voice sample, and the first voice samples comprise a plurality of sample professional terms in the target field;

And the output module is used for outputting the voice recognition text.

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing any one of the above-mentioned speech recognition methods when executing the program.

The invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a speech recognition method as described in any of the above.

The invention also provides a computer program product comprising a computer program which, when executed by a processor, implements a speech recognition method as described in any of the above.

The invention provides a voice recognition method, a device, electronic equipment and a storage medium, wherein in the method, a voice recognition model for carrying out voice recognition on voice to be recognized is obtained by training based on a plurality of first voice samples in the target field, a first text sample corresponding to each first voice sample, a plurality of second voice samples in the general field and a second text sample corresponding to each second voice sample, and the first voice samples comprise a plurality of sample professional terms in the target field, so that the voice recognition model obtained by training can accurately recognize the professional terms in the voice in the target field on the basis of accurately recognizing the voice content in the general field, the voice recognition text output by the voice recognition model is closer to the real content in the voice to be recognized, the accuracy of the voice recognition text is higher, and the accuracy of the voice recognition is improved.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a speech recognition method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a speech recognition method provided by an embodiment of the present invention;

FIG. 3 is a schematic diagram of a voice recognition device according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be noted that, in the present invention, the numbers of the described objects, such as "first", "second", etc., are only used to distinguish the described objects, and do not have any sequence or technical meaning.

The rapid development of information technology has led to explosive growth of data generated in the internet. These data include not only numerical structured data, but also massive unstructured data, of which sound, text and images are the three more common types of unstructured data. The vast amount of unstructured data has a tremendous application value, and how to convert the unstructured data into a machine-understandable language is a concern.

ASR speech recognition technology can convert sound into text, and computers understand text information more easily than voice information. Based on the text data obtained by voice conversion, deep mining and application, such as automatic quality inspection, intelligent documents or precision marketing, can be performed. Therefore, a high-accuracy ASR speech recognition method is an important basis for data development and application.

The universal ASR interface service can realize text transcription of the speech in the non-professional field, and the transcription result meets the basic application requirement. However, for the professional field with deeper verticality, the voice information contains a large amount of professional terms, when the universal ASR interface service is adopted to perform text transcription on the voice in the professional field, the difference between the transcribed text content and the real content in the voice information is larger, and the error rate of the voice recognition result is higher. For example, in the insurance industry, a large number of terms related to insurance and diseases are generally used, and when a generic ASR interface service performs text transcription on speech containing terms, terms and sentences including terms cannot be correctly recognized, resulting in a high error rate of output text content, which affects subsequent applications and development.

In view of the above problems, an embodiment of the present invention provides a method for recognizing speech, which performs speech recognition on an acquired speech to be recognized based on a speech recognition model, where the speech recognition model is obtained by training based on a first speech sample including a plurality of sample terms in a target domain, so that the speech recognition model can correctly recognize terms in the target domain in the speech to be recognized, and accuracy of speech recognition is high. The following describes a speech recognition method according to an embodiment of the present invention with reference to fig. 1 and 2.

Fig. 1 is a schematic flow chart of a speech recognition method provided by the embodiment of the present invention, and the embodiment of the present invention may be applicable to any scene requiring speech recognition, for example, a speech recognition scene in a professional field with a deep verticality. The execution main body of the method can be electronic equipment such as a mobile phone, a tablet personal computer, a smart watch, a smart sound box, a translator, a computer or specially designed voice recognition equipment, and the like, and can also be a voice recognition device arranged in the electronic equipment, wherein the voice recognition device can be realized by software, hardware or a combination of the two. As shown in fig. 1, the voice recognition method includes steps 110 to 140.

Step 110, obtain the voice to be recognized.

Specifically, the voice to be recognized is an object that needs to be voice-recognized, for example, a recording in an audio file, a recording in a video file, or a voice of a speaker at the time of a real-time conversation, or the like. The speech to be recognized may include speech content composed of general words, speech content composed of terms of art, or speech content composed of general words together with terms of art.

The voice to be recognized may be obtained by a voice acquisition device such as a microphone, or may be obtained by calling an audio file or a video file in a database, or may be obtained by other methods.

Step 120, extracting the logfbank feature of the voice to be recognized under the condition that the voice to be recognized includes the voice segment corresponding to the target field.

Specifically, the target field may be a field in each of the professional fields, which may be understood as each of the vertical fields or each of the subdivision fields, for example, each of the professional fields including a medical field, an insurance field, a financial field, an educational field, or the like.

For example, the speech to be recognized may be divided into a plurality of speech segments by performing a segmentation process according to a word, a word or a sentence. And respectively identifying each voice segment, wherein when the voice segment is identified to comprise the professional terms in the target field, the voice segment is the voice segment corresponding to the target field. And extracting the logfbank characteristic of the voice to be recognized under the condition that the voice to be recognized comprises the voice fragment corresponding to the target field.

For example, the audio signal of the voice to be recognized is preprocessed, and the preprocessed audio signal is subjected to fast fourier transform to obtain an energy spectrum corresponding to the voice to be recognized, wherein the preprocessing can include framing, pre-emphasis and windowing. Further, the corresponding FBank characteristics of the voice to be recognized can be extracted by carrying out Mel filtering on the amplitude spectrum in the energy spectrum. And carrying out logarithmic transformation based on the FBank characteristics to obtain the logfbank characteristics corresponding to the voice to be recognized. The feature dimension of the logfbank feature may be, for example, an 80-dimensional logfbank feature. The correlation between the logfbank features and the FBank features is high, and the calculation amount is smaller when the logfbank features of the speech to be recognized are extracted, compared with calculating Mel frequency cepstrum coefficients (Mel-Frequency Cepstral Coefficients, MFCC) of the speech to be recognized. The information contained in the logfbank feature of the voice to be recognized can be more sufficiently applied to the neural network model, so that the accuracy of voice recognition is improved.

Step 130, inputting the logfbank feature into a pre-trained voice recognition model to obtain a voice recognition text output by the voice recognition model, wherein the voice recognition model is obtained by training based on a plurality of first voice samples in the target field, a first text sample corresponding to each first voice sample, a plurality of second voice samples in the general field and a second text sample corresponding to each second voice sample, and the first voice samples comprise a plurality of sample professional terms in the target field.

In particular, the first speech sample may be a speech sample comprising a plurality of sample terms within the target area. For example, when the target field is the insurance field, the first voice sample includes a plurality of sample terms in the insurance field, where the sample terms may be professional words or idioms such as "insured person", "beneficiary", "hesitation", "survival gold" or "insurance giving with big danger".

And the first text sample corresponding to the first voice sample is text obtained by converting the first voice sample into text. The second speech sample in the general field may be a speech sample including a non-technical term, and the second text sample corresponding to the second speech sample is text obtained by converting the second speech sample into text.

The voice recognition model may be a neural network model obtained by training based on an initial voice recognition model, a plurality of first voice samples in a target field, first text samples corresponding to each first voice sample, a plurality of second voice samples in a general field, and second text samples corresponding to each second voice sample. The initial speech recognition model may be, for example, an ASR deep learning Conformer model.

For example, for a Conformer model, model training is performed on an initial speech recognition model based on a plurality of first speech samples in a target field, first text samples corresponding to each first speech sample, a plurality of second speech samples in a general field, and second text samples corresponding to each second speech sample to obtain a speech recognition model. For example, based on the PaddleSpecech model library, A100 GPU is used for training.

The Conformer model builds a deep neural network architecture through an attention mechanism, long-term dependency in voice information can be better captured, the advantages of the Transformer model and the CNN convolutional neural network (Convolutional Neural Network, CNN) model are combined, a hybrid architecture is adopted, the calculation efficiency is higher, the model size is lower, and the powerful language modeling capability and the tighter semantic representation of the model can improve the machine reading understanding and language generating performance.

The extracted logfbank features are input into a pre-trained voice recognition model, so that a voice recognition text output by the voice recognition model can be obtained, wherein the voice recognition text is the content of the voice to be recognized, which is expressed in a text form.

Step 140, outputting the speech recognition text.

After the voice recognition model outputs the voice recognition text, the voice recognition text can be output, so that the subsequent application is facilitated. For example, after the voice recognition model outputs the voice recognition text, the voice recognition text is displayed through a display screen, so that a user can conveniently read the voice recognition text; or after the voice recognition model outputs the voice recognition text, the voice recognition text is stored in a target database, so that an application program can conveniently call the voice recognition text and the like.

According to the voice recognition method provided by the embodiment of the invention, the voice recognition model for carrying out voice recognition on the voice to be recognized is obtained by training based on a plurality of first voice samples in the target field, a plurality of first text samples corresponding to the first voice samples, a plurality of second voice samples in the general field and a plurality of second text samples corresponding to the second voice samples, and the first voice samples comprise a plurality of sample professional terms in the target field, so that the voice recognition model obtained by training can accurately recognize the professional terms in the voice in the target field on the basis of accurately recognizing the voice content in the general field, the voice recognition text output by the voice recognition model is closer to the real content in the voice to be recognized, the accuracy of the voice recognition text is higher, and the accuracy of the voice recognition is improved.

The first voice sample comprises a plurality of sample terms in the target field, and the voice recognition model obtained based on the training of the first voice sample can enable the voice recognition model to recognize the terms in the voice to be recognized, so that the first voice sample comprising the plurality of sample terms in the target field needs to be determined before the training of the voice recognition model is achieved.

In an embodiment, the plurality of first speech samples are determined based on: performing voice synthesis on the first training text by adopting a TTS technology to obtain first training voice; acquiring a second training voice input by a user based on a second training text; the first training text and the second training text both comprise sample professional terms in the target field; the first training speech and the second training speech are determined as a plurality of first speech samples.

Specifically, text may be converted into Text-To-Text corresponding Speech using Speech synthesis technology (TTS). The first training text may be a training text including sample terms within the target area, for example, a paragraph or sentence including the target area terms in the target area specific file is extracted as the first training text. And performing voice synthesis on the first training text in the text form by adopting a TTS technology, so that first training voice in the voice form corresponding to the first training text can be obtained. The first training speech may be determined to be a first speech sample.

Alternatively, the audio of the target area may be transcribed into text as the first training text when the first training text is acquired. For example, the ASR interface service is used to convert insurance sales audio into text to obtain the first training text of the insurance domain.

The second training voice can be a training voice input by a user and comprising sample professional terms in the target field, and the text corresponding to the second training voice is the second training text. The second training speech input by the user based on the second training text is acquired, for example, the second training speech generated by the speech acquisition device by acquiring the second training text read by the user. The second training speech may be determined to be the first speech sample.

Optionally, when the first training speech is obtained by speech synthesis not suitable for the TTS technology, the second training speech input by the user based on the second training text may be obtained by manually reading the professional vocabulary. The case where the first training speech is obtained by speech synthesis using the TTS technique may be, for example, a case where the speech quality of the first training speech obtained by using the TTS technique is poor.

In this embodiment, a first voice sample may be flexibly determined for the target domain according to the requirement of voice recognition, where the determined first voice sample includes a plurality of sample terms in the target domain, so that the first voice sample has a relatively strong correlation with the target domain. The method comprises the steps that a voice recognition model can be obtained through training based on a first voice sample, the professional terms in the voice to be recognized can be recognized through voice recognition based on the voice recognition model, the voice recognition text with high accuracy is output, and the professional requirements of ASR transcription service in a business scene are met.

In order to improve the intelligentization level of speech recognition, when the speech to be recognized includes dialogue speech, the logfbank feature of the speech to be recognized is extracted, which can be achieved by the following way: extracting tone characteristics in dialogue voice; clustering voices corresponding to the same tone color feature to obtain clustered voices corresponding to the tone color feature; and respectively extracting the logfbank characteristics of each clustered voice.

Specifically, the dialogue speech may be real-time dialogue speech or recorded dialogue speech, where the dialogue speech includes speech with at least two timbres. In the case where the voice to be recognized includes a dialogue voice, a corresponding tone characteristic may be extracted from tone information of each voice in the dialogue voice, wherein the tone characteristic may be a characteristic value extracted based on a waveform of sound. Clustering is carried out based on tone features, voices corresponding to the same tone feature are clustered into the same voice type, and clustering voices corresponding to each tone feature can be obtained, so that the logfbank features can be extracted for each clustering voice.

For example, the voices to be recognized include the voices of the first speaker, the second speaker and the third speaker, the tone features are extracted based on the waveforms of the voices emitted by the speakers, and the voices in the voices to be recognized are marked by using the tone features as marks of the speakers, so that the speakers corresponding to the voices can be distinguished. And clustering the voices corresponding to the same tone color characteristics to obtain clustered voices corresponding to the tone color characteristics, namely distinguishing the voices of the first speaker, the second speaker and the third speaker. And extracting the logfbank characteristics of the voice of the first speaker, the voice of the second speaker and the voice of the third speaker respectively.

Further, the logfbank features are input into a pre-trained speech recognition model to obtain a speech recognition text output by the speech recognition model, and specifically, the logfbank features of each clustered speech are input into the pre-trained speech recognition model to obtain the speech recognition text output by the speech recognition model.

By way of example, the logfbank features of each clustered voice are input into a pre-trained voice recognition model, so that voice recognition texts corresponding to each clustered voice output by the voice recognition model can be obtained, and the purpose of distinguishing voice recognition texts of different dialogs in the voice to be recognized is achieved.

For example, the logfbank features of the voices of the first speaker, the second speaker and the third speaker are respectively input into a pre-trained voice recognition model, so that voice recognition texts corresponding to the voices of the first speaker, the second speaker and the third speaker can be respectively obtained, the output voice recognition texts are respectively corresponding to the speakers, and the output voice recognition texts are read and understood.

In this embodiment, for the case that the voice to be recognized includes dialogue voice, each tone characteristic in the dialogue voice is extracted based on each dialogue voice, each clustering voice is obtained by clustering voices corresponding to the same tone characteristic, and the logfbank characteristics of each clustering voice extracted respectively are input into a pre-trained voice recognition model, so that a voice recognition text output by the voice recognition model can be obtained. Therefore, the voice content corresponding to different voice colors in the voice to be recognized can be distinguished and recognized, the voice recognition text corresponding to each voice color is obtained, the intelligent level of voice recognition is improved, the transcribed text corresponding to different dialogues can be generated rapidly, convenience and rapidness are realized, the follow-up application of the voice recognition text is facilitated, and the use efficiency is improved.

In order to further improve the intelligentization level of speech recognition, the content sequence in the speech recognition text corresponding to the speech to be recognized is kept consistent with the content sequence in the speech to be recognized, and the content in the speech recognition text can be ranked based on the time stamp.

In one embodiment, the voice to be recognized includes at least two voice fragments and time stamps corresponding to the voice fragments; the voice recognition text comprises a plurality of sub-texts and time stamps corresponding to the sub-texts, and voice fragments correspond to the sub-texts one by one; the voice recognition text is output, which can be realized by the following steps:

sequencing each sub-text based on the sequence of the time stamps corresponding to each voice fragment and the time stamps of each sub-text in the voice recognition text to obtain a sequencing result; based on each tone characteristic, adding corresponding identification information to each sub-text in the sequencing result to obtain a target voice recognition text; the identification information is used for identifying different users; and outputting target voice recognition text.

Specifically, the voice to be recognized includes at least two voice segments and time stamps corresponding to the voice segments, where the time stamps corresponding to the voice segments may be marks representing time sequences of the voice segments, for example, may be custom time marks, or may be time marks corresponding to real time, etc.

For example, when the time stamp is a custom time stamp, the time stamp may be respectively stamped on each voice segment in the voice to be recognized when the start time of the voice to be recognized is 0 hour 0 minutes 0 seconds. For example, the time duration between the time of the start of a speech segment and the time of the start of the speech to be recognized is marked as the time stamp of the speech segment. The time stamp of the method can visually represent the time length between each voice segment and the starting time of the voice to be recognized, so that the time position of each voice segment in the voice to be recognized and the time interval between each voice segment can be conveniently determined.

When the time stamp is a time stamp corresponding to the real time, the real time can be used as the time stamp to mark each voice segment in the voice to be recognized, for example, the date and time when each voice segment in the voice to be recognized is generated are used as the time stamp corresponding to each voice segment. The time stamp in this manner may facilitate determining the actual time when each speech segment was generated. The time stamp corresponding to each voice segment can be at least one of a custom time stamp or a real time stamp, and the time dimension of each voice segment can be ordered through the time stamp corresponding to each voice segment.

The voice fragments are in one-to-one correspondence with the sub-texts, which can be understood that the sub-texts corresponding to the voice fragments are output after the voice fragments are subjected to voice recognition, the voice fragments are in one-to-one correspondence with the sub-texts, and then the time stamps corresponding to the voice fragments are the time stamps corresponding to the sub-texts corresponding to the voice fragments, and the time stamps of the voice fragments and the sub-texts can be the same time stamp.

Based on the sequence of the time stamps corresponding to the voice fragments and the time stamps of the sub-texts in the voice recognition text, sequencing the sub-texts, and sequencing the sub-texts according to the time sequence of the voice fragments to obtain a sequencing result, namely, obtaining the sub-texts sequenced according to the time sequence.

The identification information may be used to identify different users, for example, the identification information may be the name, code, or avatar of the user, etc. Further, based on the tone characteristics, corresponding identification information is added to each sub-text in the sequencing result, and the target voice recognition text can be obtained. According to the tone characteristics, users corresponding to the voice fragments can be distinguished, so that the sub-texts corresponding to the voice fragments can be distinguished and marked through the identification information, and the users corresponding to the sub-texts can be distinguished conveniently.

Each sub text in the target voice recognition text can comprise the corresponding identification information and the corresponding timestamp thereof, the target voice recognition text is output, the target voice recognition text can be more conveniently consulted or read, the intelligent level of voice recognition is improved, the subsequent application of the voice recognition text is convenient, the application efficiency is improved, and compared with manual transcription, the time can be saved by 90%.

In order to further improve the efficiency and accuracy of voice recognition, the mute segment in the voice to be recognized can be deleted, so that the voice to be recognized with higher continuity is obtained.

In an embodiment, the obtaining the voice to be recognized may specifically be: acquiring an initial voice to be recognized; and determining a mute segment in the initial voice to be recognized, and deleting the mute segment in the initial voice to be recognized to obtain the voice to be recognized.

Specifically, the initial voice to be recognized may be obtained through a voice acquisition device such as a microphone, or may be obtained by calling an audio file or a video file in a database, or may be obtained in other manners.

After the initial voice to be recognized is obtained, a mute segment in the initial voice to be recognized can be determined based on a voice signal of the voice to be recognized, wherein the mute segment can be a segment without voice content in the initial voice to be recognized. When determining the silence segments in the initial speech to be recognized, the silence segments may be determined based on a preset time period. And determining the segments without voice contents in the preset duration to be mute segments. For example, when the preset time length is set to 100ms, a segment which is less than or equal to 100ms and has no voice content in the initial voice to be recognized is determined as a mute segment. Further, after the silence segments determined in the initial voice to be recognized are deleted, the voice to be recognized can be obtained.

In this embodiment, a silence segment in the initial speech to be recognized is determined, and the silence segment in the initial speech to be recognized is deleted, so that the speech to be recognized can be obtained, and the speech to be recognized is recognized, so that the efficiency of speech recognition can be improved; meanwhile, the voice recognition is carried out on the voice to be recognized, so that the influence of a mute segment on the logfbank characteristic of the voice to be recognized is avoided, and the accuracy of the voice recognition is improved.

In order to further improve the intelligentization level of voice recognition, semantic analysis can be performed on the voice recognition text, and punctuation marks are added to the voice recognition text after the semantic analysis, so that the voice recognition text is convenient for a user to read and understand.

In one embodiment, outputting the speech recognition text may be accomplished by: carrying out semantic analysis on the voice recognition text to obtain a semantic analysis result; based on semantic analysis results, punctuation marks are added in the voice recognition text; and outputting the voice recognition text added with the punctuation.

Specifically, the semantic analysis can be performed on the voice recognition text through a semantic analysis model, so as to obtain a semantic analysis result, and the semantic analysis model can be, for example, a network model for analyzing the semantics of the text. The voice recognition text which needs to be subjected to semantic analysis is input into a semantic analysis model, and the semantic analysis model can output a semantic analysis result. Aiming at the semantic analysis result, punctuation marks can be added into the voice recognition text after semantic analysis according to the punctuation mark use rule, and the voice recognition text after punctuation addition can be output.

By way of example, the semantic analysis model may be trained based on the initial semantic analysis model in the following training manner.

Extracting training word and sentence samples for model training from a corpus, and labeling sample labels on the semantics of the training word and sentence samples. Performing supervised training on the initial semantic analysis model, respectively inputting training word and sentence samples into the initial semantic analysis model to obtain target semantic tags output by the initial semantic analysis model, calculating corresponding loss function values based on the target semantic tags and the sample tags of the training word and sentence samples, performing parameter tuning on each parameter of the initial semantic analysis model according to the loss function values, and finally obtaining the trained semantic analysis model. The initial semantic analysis model may be at least one neural network model of a deep neural network (Deep Neural Networks, DNN), a CNN convolutional neural network, a cyclic neural network (Recurrent Neural Networks, RNN), a Long short-term memory (LSTM) neural network, or the like, for example, but is not limited thereto.

In the embodiment, semantic analysis is performed on the voice recognition text, punctuation marks are added into the voice recognition text based on semantic analysis results, the semantics of the voice recognition text can be clearer by adding the punctuation marks, reading obstacles caused by the absence of the punctuation marks during reading are avoided, and therefore the efficiency of reading and understanding the voice recognition text by a user can be improved, and the intelligentization level of the method is improved.

When speech recognition is performed on speech to be recognized, a speech recognition model is required. In order to obtain a speech recognition model with high accuracy, in an embodiment, the speech recognition model may be trained based on the following manner:

inputting the first voice sample and the second voice sample into an initial voice recognition model to obtain a first predicted text corresponding to the first voice sample and a second predicted text corresponding to the second voice sample; determining a first penalty based on the first predicted text and the first text sample, and determining a second penalty based on the second predicted text and the second text sample; based on the first loss and the second loss, model parameters of the initial speech recognition model are adjusted to obtain a speech recognition model.

Specifically, the speech recognition model may be a neural network model that is trained based on the initial speech recognition model, the first speech sample, and the second speech sample. Wherein the first voice sample may be a voice sample including a plurality of sample terms in the target domain, and the second voice sample may be a voice sample including words in the general domain. The initial speech recognition model may be an initial neural network model for speech recognition, for example, but not limited to, at least one of an ASR deep learning conformation model, a DNN deep neural network, a CNN convolutional neural network model, an RNN cyclic neural network, an LSTM long short-term memory neural network, and the like.

Inputting the first voice sample into an initial voice recognition model to obtain a first predicted text output by the initial voice recognition model, and determining a first loss corresponding to the first predicted text based on the first predicted text and a first text sample corresponding to the first voice sample; and inputting the second voice sample into the initial voice recognition model to obtain a second predicted text output by the initial voice recognition model, and determining a second loss corresponding to the second predicted text based on the second predicted text and a second text sample corresponding to the second voice sample. The first loss and the second loss may be determined by a loss function used in training, and the loss function may be, for example, a cross entropy loss function or the like.

Based on the first loss and the second loss, model parameters of the initial speech recognition model are adjusted to obtain a speech recognition model, for example, the model parameters can be determined according to a set parameter adjustment threshold. For example, when at least one of the obtained first loss or second loss is greater than the parameter adjustment threshold, it may be determined that the current model parameter is not optimal and needs to be adjusted continuously; when the obtained first loss and second loss are smaller than or equal to the parameter adjustment threshold, the current model parameters can be determined to be better, the adjustment of the model parameters of the initial speech recognition model can be stopped, and the current initial speech recognition model is the speech recognition model obtained after training.

In this embodiment, the initial speech recognition model is model-trained based on the first speech sample and the second speech sample, so that a speech recognition model can be obtained. The speech recognition model obtained through training can accurately recognize the professional terms in the speech in the target field on the basis of accurately recognizing the speech content in the universal field, so that the speech recognition text output by the speech recognition model is closer to the real content in the speech to be recognized, the accuracy of the speech recognition text is higher, and the accuracy of the speech recognition is improved.

Fig. 2 is a flow chart of a speech recognition method according to an embodiment of the present invention, as shown in fig. 2, where an initial speech recognition model may be trained based on a general domain corpus with a total speech duration of 10000 hours and a target domain corpus with a total speech duration of 10000 hours, so as to obtain a speech recognition model. The 10000-hour universal domain corpus may be, for example, 10000-hour WenetSpeech corpus, and the 10000-hour target domain corpus may be a vertical domain corpus. And respectively extracting the characteristics of the voice samples in the corpus, for example, extracting 80-dimensional logfbank characteristics. The extracted features are input into an initial speech recognition model for model training, and a trained speech recognition model can be obtained, wherein the initial speech recognition model can comprise an encoder and a Decoder, the initial speech recognition model can be an ASR deep learning Conformer model, the encoder can be Transformer Encoder, and the Decoder can be a CTC beam Decoder.

And carrying out segmentation processing on the voice to be recognized to obtain at least two voice fragments, respectively extracting the characteristics of the voice fragments, inputting the extracted characteristics into a trained voice recognition model, and obtaining a voice recognition text corresponding to the voice to be recognized, which is output by the voice recognition model. Further, corresponding punctuation marks can be added to the voice recognition text, so that the voice recognition text is convenient for a user to read and understand, and is tidied into a transcription text with a time stamp and speaker identification information to be packaged and output.

The embodiment of the invention can rapidly collect and determine the voice sample in different modes, and provides a voice recognition method for the voice to be recognized aiming at the vertical target field. The voice recognition method provided by the embodiment of the invention can be used for converting the recording data waiting for recognition voice in the service into the text data with higher accuracy, thereby being convenient for subsequent development and use. Compared with a universal voice recognition method. The scheme can be well expanded to the non-universal vertical field, and the accuracy of voice recognition can be improved.

Based on a voice recognition model obtained by training a first voice sample comprising a plurality of sample technical terms in the target field, the problem that private data such as company sound recordings are revealed due to the use of an external voice recognition service interface can be avoided, and the data security is greatly ensured. Meanwhile, unstructured voice data is converted into text data for digital storage by using the method, so that the method is convenient to check and analyze, can flexibly develop various applications, is suitable for more service scenes, achieves the purposes of cost reduction and efficiency improvement, and can provide a voice recognition solution in the vertical field for external enterprises.

The following describes a voice recognition device provided by an embodiment of the present invention, and the voice recognition device described below and the voice recognition method described above may be referred to correspondingly.

Fig. 3 is a schematic structural diagram of a voice recognition device according to an embodiment of the present invention, and referring to fig. 3, a voice recognition device 300 includes:

an acquisition module 310, configured to acquire a voice to be recognized;

an extracting module 320, configured to extract a logfbank feature of the voice to be recognized when the voice to be recognized includes a voice segment corresponding to the target domain;

the processing module 330 is configured to input the logfbank feature into a pre-trained speech recognition model to obtain a speech recognition text output by the speech recognition model, where the speech recognition model is obtained by training based on a plurality of first speech samples in a target domain, a first text sample corresponding to each first speech sample, a plurality of second speech samples in a general domain, and a second text sample corresponding to each second speech sample, and the first speech samples include a plurality of sample terms in the target domain;

and an output module 340 for outputting the speech recognition text.

In one example embodiment, the plurality of first speech samples are determined based on:

acquiring a second training voice input by a user based on a second training text; the first training text and the second training text both comprise sample professional terms in the target field;

the first training speech and the second training speech are determined as a plurality of first speech samples.

In one example embodiment, the speech to be recognized comprises conversational speech;

the extraction module 320 is specifically configured to: extracting tone characteristics in dialogue voice; clustering voices corresponding to the same tone color feature to obtain clustered voices corresponding to the tone color feature; respectively extracting the logfbank characteristics of each clustered voice;

the processing module 330 is specifically configured to: and inputting the logfbank characteristics of each clustered voice into a pre-trained voice recognition model to obtain a voice recognition text output by the voice recognition model.

In an example embodiment, the speech to be recognized includes at least two speech segments and a timestamp corresponding to each speech segment; the voice recognition text comprises a plurality of sub-texts and time stamps corresponding to the sub-texts, and voice fragments correspond to the sub-texts one by one;

the output module 340 is specifically configured to:

Sequencing each sub-text based on the sequence of the time stamps corresponding to each voice fragment and the time stamps of each sub-text in the voice recognition text to obtain a sequencing result;

and outputting target voice recognition text.

In an example embodiment, the obtaining module 310 is specifically configured to:

acquiring an initial voice to be recognized;

In one example embodiment, the output module 340 is specifically configured to:

based on semantic analysis results, punctuation marks are added in the voice recognition text;

and outputting the voice recognition text added with the punctuation.

In one example embodiment, the speech recognition model is trained based on:

based on the first loss and the second loss, model parameters of the initial speech recognition model are adjusted to obtain a speech recognition model.

The apparatus of the present embodiment may be used to execute the method of any one of the embodiments of the speech recognition method, and its specific implementation process and technical effects are similar to those of the embodiments of the speech recognition method, and specific reference may be made to the detailed description of the embodiments of the speech recognition method, which is not repeated herein.

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, as shown in fig. 4, the electronic device may include: processor 410, communication interface (Communications Interface) 420, memory 430 and communication bus 440, wherein processor 410, communication interface 420 and memory 430 communicate with each other via communication bus 440. The processor 410 may invoke logic instructions in the memory 430 to perform a speech recognition method comprising: acquiring voice to be recognized; extracting a logfbank characteristic of the voice to be recognized under the condition that the voice to be recognized comprises a voice fragment corresponding to the target field; inputting the logfbank features into a pre-trained voice recognition model to obtain a voice recognition text output by the voice recognition model, wherein the voice recognition model is obtained by training based on a plurality of first voice samples in the target field, a first text sample corresponding to each first voice sample, a plurality of second voice samples in the general field and a second text sample corresponding to each second voice sample, and the first voice samples comprise a plurality of sample professional terms in the target field; and outputting the voice recognition text.

Further, the logic instructions in the memory 430 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, embodiments of the present invention also provide a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the speech recognition method provided by the above methods, the method comprising: acquiring voice to be recognized; extracting a logfbank characteristic of the voice to be recognized under the condition that the voice to be recognized comprises a voice fragment corresponding to the target field; inputting the logfbank features into a pre-trained voice recognition model to obtain a voice recognition text output by the voice recognition model, wherein the voice recognition model is obtained by training based on a plurality of first voice samples in the target field, a first text sample corresponding to each first voice sample, a plurality of second voice samples in the general field and a second text sample corresponding to each second voice sample, and the first voice samples comprise a plurality of sample professional terms in the target field; and outputting the voice recognition text.

In yet another aspect, embodiments of the present invention further provide a computer program product comprising a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of performing the speech recognition method provided by the methods as described above, the method comprising: acquiring voice to be recognized; extracting a logfbank characteristic of the voice to be recognized under the condition that the voice to be recognized comprises a voice fragment corresponding to the target field; inputting the logfbank features into a pre-trained voice recognition model to obtain a voice recognition text output by the voice recognition model, wherein the voice recognition model is obtained by training based on a plurality of first voice samples in the target field, a first text sample corresponding to each first voice sample, a plurality of second voice samples in the general field and a second text sample corresponding to each second voice sample, and the first voice samples comprise a plurality of sample professional terms in the target field; and outputting the voice recognition text.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of speech recognition, comprising:

acquiring voice to be recognized;

outputting the voice recognition text.

2. The method of claim 1, wherein the plurality of first speech samples are determined based on:

3. The method of claim 1, wherein the speech to be recognized comprises conversational speech;

extracting each tone characteristic in the dialogue voice;

respectively extracting the logfbank characteristics of each clustered voice;

4. A method of speech recognition according to claim 3, wherein the speech to be recognized comprises at least two speech segments and a time stamp corresponding to each of the speech segments; the voice recognition text comprises a plurality of sub-texts and time stamps corresponding to the sub-texts, and the voice fragments correspond to the sub-texts one by one;

The outputting the speech recognition text includes:

and outputting the target voice recognition text.

5. The method for recognizing speech according to any one of claims 1 to 4, wherein the acquiring speech to be recognized includes:

acquiring an initial voice to be recognized;

6. The method of any of claims 1-4, wherein the outputting the speech recognition text comprises:

And outputting the voice recognition text added with the punctuation.

7. The method according to any one of claims 1-4, wherein the speech recognition model is trained based on:

8. A speech recognition apparatus, comprising:

the acquisition module is used for acquiring the voice to be recognized;

And the output module is used for outputting the voice recognition text.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the speech recognition method according to any one of claims 1 to 7 when the program is executed by the processor.

10. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the speech recognition method according to any one of claims 1 to 7.