CN117496983A

CN117496983A - Speech recognition method and device, electronic equipment and storage medium

Info

Publication number: CN117496983A
Application number: CN202310772696.9A
Authority: CN
Inventors: 孟庆林; 蒋宁; 吴海英; 陆全; 夏粉; 刘敏
Original assignee: Mashang Xiaofei Finance Co Ltd
Current assignee: Mashang Xiaofei Finance Co Ltd
Priority date: 2023-06-27
Filing date: 2023-06-27
Publication date: 2024-02-02

Abstract

The application provides a voice recognition method and a device thereof, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring at least one voice role in the audio and a voice period corresponding to the voice role, wherein each voice role is used for representing one speaker in the audio; identifying each voice data in the audio, and performing time positioning on the voice data to obtain a target voice period; correcting the voice time periods corresponding to each voice role according to the target voice time periods to obtain corrected voice time periods corresponding to each voice role; converting the voice data corresponding to the corrected voice time period into text data to obtain speaking text data of each voice role; and identifying the character type of each voice character according to the speaking text data of each voice character, wherein the character type represents the identity of a speaker. According to the method and the device for identifying the character types of the different voice characters in the audio, the character types of the different voice characters in the audio can be accurately identified.

Description

Speech recognition method and device, electronic equipment and storage medium

Technical Field

The embodiment of the application relates to the field of voice recognition, in particular to a voice recognition method and device, electronic equipment and storage medium.

Background

In some situations, the type of each voice character in the audio needs to be judged, for example, in the seat quality inspection scene, quality inspection is required for the seat conversation, and the conversation and recording audio generally includes voice data of a plurality of voice characters, so before the quality inspection of the seat voice, it is required to firstly judge which section of voice in the audio belongs to the seat and which section of voice belongs to the customer, and after judging the voice sections belonging to the seat, the subsequent quality inspection can be performed on the voice sections. However, the current voice recognition scheme can only recognize that one or more voice characters exist in a section of audio, and it is difficult to determine the type of the voice characters, for example, only recognize that two voice characters exist in a section of audio, but cannot recognize which of the two voice characters is an agent and which is a user.

Disclosure of Invention

The application provides a voice recognition method and device, electronic equipment and storage medium, which can accurately recognize types of different voice roles in audio.

In a first aspect, the present application provides a method of speech recognition, the method may include:

acquiring at least one voice role in audio to be processed and a voice period corresponding to the voice role, wherein each voice role is used for representing one speaker in the audio to be processed;

Identifying voice data in the audio to be processed, and performing time positioning on the voice data to obtain a target voice period;

correcting the voice time periods corresponding to each voice role according to the target voice time periods to obtain corrected voice time periods corresponding to each voice role;

converting the voice data corresponding to the corrected voice time period of each voice character into text data to obtain speaking text data of each voice character; and identifying the role type corresponding to each voice role according to the speaking text data of each voice role, wherein the role type is used for representing the identity of a speaker.

In a second aspect, the present application provides a speech recognition apparatus, which may include:

the role division module is used for acquiring at least one voice role in the audio to be processed and a voice period corresponding to the voice role, wherein each voice role is used for representing one speaker in the audio to be processed;

the end point detection module is used for identifying voice data in the audio to be processed and carrying out time positioning on the voice data to obtain a target voice period;

the time period correction module is used for correcting the voice time period corresponding to each voice role according to the target voice time period to obtain a corrected voice time period corresponding to each voice role;

The character recognition module is used for converting the voice data corresponding to the corrected voice time period of each voice character into text data to obtain the speaking text data of each voice character; and identifying the role type corresponding to each voice role according to the speaking text data of each voice role, wherein the role type is used for representing the identity of a speaker.

In a third aspect, the present application provides an electronic device, which may include:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores one or more computer programs executable by the at least one processor, one or more of the computer programs being executable by the at least one processor to enable the at least one processor to perform the speech recognition method described above.

In a fourth aspect, the present application provides a computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the above-described speech recognition method.

According to the embodiment provided by the application, the voice roles in the audio can be acquired, the preliminary division of different speakers in the audio is realized, however, voice data are not extracted when the audio is processed, so that the acquired frame data corresponding to the voice roles possibly contain non-voice data (namely, voice data corresponding to the non-voice roles, such as noise data), the voice data in the audio and the voice periods of the voice data, namely, the target voice time period, are extracted, the periods of multi-frame data corresponding to the voice roles are corrected through the target voice time period, the voice data can be screened out from the multi-frame data corresponding to the voice roles, the voice periods corresponding to the voice data are obtained, namely, the corrected voice time period, the hit rate of the effective period of the voice roles is improved, the corrected voice time period is used as the voice period corresponding to the voice roles, the accuracy of the voice time period is guaranteed, a technical basis is provided for accurately identifying the types of the roles, and the accurate identification of different types in the audio is realized through the text conversion of the voice data and the identification of the text data based on the accurate division of the voice periods corresponding to the voice roles.

It should be understood that the description of this section is not intended to identify key or critical features of the embodiments of the application or to delineate the scope of the application. Other features of the present application will become apparent from the description that follows.

Drawings

The accompanying drawings are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate the application and together with the embodiments of the application, and not constitute a limitation to the application. The above and other features and advantages will become more readily apparent to those skilled in the art by describing in detail exemplary embodiments with reference to the attached drawings, in which:

FIG. 1 is a schematic diagram of a related art speaker judgment method based on voiceprint clustering;

FIG. 2 is a flowchart of a method for speech recognition according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a first neural network according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a second neural network according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a third neural network according to an embodiment of the present application;

fig. 6 is a schematic diagram of a third neural network using a Conformer model as an encoder according to an embodiment of the present application;

Fig. 7 is a block diagram of a voice character recognition device according to an embodiment of the present application;

fig. 8 is a block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

For a better understanding of the technical solutions of the present application, the following description of exemplary embodiments of the present application is made with reference to the accompanying drawings, in which various details of embodiments of the present application are included to facilitate understanding, and they should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the absence of conflict, embodiments and features of embodiments herein may be combined with one another.

As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used herein, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and this application and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

In the existing speech recognition scheme, as shown in fig. 1, a speaker judgment system 100 based on voiceprint clustering firstly acquires a speech signal by using an acquisition module 101, sends the speech signal to a voiceprint extraction module 102, sends output data of the voiceprint extraction module 102 to a clustering module 103, sends output signals of the clustering module 103 to a correction module 104, and sends output data of the correction module 104 to an output module 105. Wherein, the collection module 101 is used for collecting talking voice of the talking person and the talking person; sonar extraction module 102 is used to extract voiceprint features from conversational speech; the clustering module 103 is configured to divide the talking voice into a first voice corresponding to the talking person and a second voice corresponding to the talking person according to the extracted first voice characteristic and second voice characteristic; the correction module 104 is configured to determine whether the standard deviation of the first time delay and the standard deviation of the second time delay are both smaller than the time delay threshold: if the standard deviation of the first time delay and the standard deviation of the second time delay are smaller than the time delay threshold, sending an instruction of outputting the result to the output module 105; otherwise, sending a re-clustering instruction to the clustering module 103; the output module 105 is configured to output the first voice and the second voice. Although the scheme can realize the segmentation and clustering of the speaker, the method can not locate the roles (such as identifying the agents and the clients), and the scheme has lower real-time performance and poorer identification effect.

The embodiment of the application introduces a voice recognition method, which is based on a speaker segmentation and clustering method to acquire voice roles in audio, so that different speakers in the audio are primarily divided, and the speaker segmentation and clustering method cannot distinguish voice data from non-voice data in the audio, so that the non-voice data possibly exist in the audio data corresponding to the voice roles, the audio data corresponding to the voice roles acquired by the speaker segmentation and clustering method are not pure enough, and the voice data cannot be guaranteed to be completely effective, thereby influencing the recognition accuracy in the subsequent scheme when the role type recognition is performed according to the audio data; according to the voice endpoint detection method, voice data in audio and voice time periods (namely target voice time periods) of the voice data can be extracted, the time periods of multi-frame audio data corresponding to voice roles are corrected through the target voice time periods, the voice time periods (namely corrected voice time periods, the corrected voice time periods are effective voice time periods) corresponding to the voice data in the multi-frame audio data corresponding to the voice roles can be obtained, the effective voice time periods are extracted from the voice time periods corresponding to the voice roles, the accuracy of the voice time periods is guaranteed, a technical basis is provided for accurate recognition of role types, and the accurate recognition of different role types in the audio is achieved through text conversion of the voice data and recognition of the text data based on the accurate division of the voice time periods corresponding to the voice roles. The method solves the problems that the character type cannot be identified, the real-time performance is low and the identification effect is poor in the current voice identification technology.

The following describes the speech recognition method according to the embodiment of the present application in detail.

Fig. 2 is a flowchart of a voice recognition method according to an embodiment of the present application. Referring to fig. 2, the method may include steps S11-S14:

s11, at least one voice role in the audio to be processed and a voice period corresponding to the voice role are acquired, and each voice role is used for representing one speaker in the audio to be processed.

At least one voice character in the audio to be processed and a voice period corresponding to the voice character can be obtained according to the speaker segmentation and clustering method.

The speaker segmentation and clustering method can segment the audio at the frame level, then cluster the data frames according to the voiceprint characteristics of each segmented frame of data, cluster the data frames of the same voice roles (namely speakers) together, realize the voice role recognition function, determine the starting time and/or the ending time of each frame of voice data and correspondingly obtain the voice time period corresponding to the voice roles.

The voice roles and the corresponding voice periods thereof can be obtained according to the speaker segmentation and clustering method through a preset speaker segmentation and clustering model.

Acquiring at least one voice character and a voice period corresponding to the voice character in the audio to be processed, wherein the voice period comprises the following steps:

Carrying out frame-level segmentation on the audio to be processed through a preset speaker segmentation and clustering model to obtain multi-frame data;

extracting voiceprint features of each frame of data through speaker segmentation and clustering models, and gathering data frames with similarity of the voiceprint features being greater than or equal to a preset similarity threshold value into one type, wherein multi-frame data in the same type corresponds to the same voice role;

and detecting the starting time and/or the ending time of each frame of data in the multi-frame data corresponding to each voice role through the speaker segmentation and clustering model, and obtaining the at least one voice role and the voice time period corresponding to the voice role.

The voice roles in the audio are acquired based on the speaker segmentation and clustering method, the preliminary division of different speakers in the audio is realized, the preliminary determination of the voice time periods corresponding to the voice roles is realized, and a technical basis is provided for the recognition of the voice role types.

The speaker segmentation and clustering model is obtained by the following steps:

acquiring first training data;

inputting the first training data into a pre-created first neural network, and acquiring a speaker segmentation and clustering model when the loss value of the first neural network meets a first preset condition;

The first training data are audio data, and each frame of data in the audio data is marked with a corresponding voice role and first time data; the first time data includes: start time and/or end time; the first neural network is used for classifying voice roles of the input audio data at a frame level and determining a voice period corresponding to each voice role;

the first neural network includes: the system comprises a first feature extraction layer, a first local area network layer, a plurality of converter layers, a second local area network layer and a first loss calculation layer;

a first feature extraction layer for extracting first data features of each frame of data in the input audio data; the first data characteristic comprises a voiceprint characteristic of each frame of data and a start time and/or an end time;

and the plurality of transducer layers are used for encoding and decoding according to the first data characteristics based on the self-attention mechanism, and acquiring at least one voice role and a voice period corresponding to the voice role.

A voice conversation of a plurality of voice characters within a period of time (for example, 5000 hours) [ the voice conversation may be a mono wav (sound waveform file) format audio ] may be prerecorded, different voice characters of each frame of data in the voice conversation are marked, corresponding time data (for example, a start time and/or an end time) of each frame of data are marked, and the marked voice conversation is used as first training data.

The first neural network with the data segmentation and clustering functions can be created in advance, and in order to improve the coding performance and the coding speed, an encoder in the first neural network can be realized by adopting a transducer model.

The transducer model is a time sequence model based on a self-attention mechanism, can effectively encode time sequence information in an encoder part, has much better processing capacity than LSTM (long-short term memory, long-short-term memory model) and has high speed. The transducer model is an Encoder-Decoder model, and the Encoder of the transducer is composed of a superposition of a plurality of (e.g. 6) encoders, which are identical in structure but do not share weights; the Decoder is also composed of a superposition of multiple (e.g. 6) decoders, the structure of which is identical, but they do not share weights. Each encoder is divided into two sub-layers, the first sub-layer being a self-attention layer that helps the encoder to see other words when it encodes a particular word, and the second sub-layer being a feed-forward neural network layer. Each decoder also has these two layers, but there is also an attention layer to help the decoder focus on the relevant part of the encoder input sentence.

As shown in fig. 3, the first neural network 10 may include: a first feature extraction layer 11, a first local area network layer 12, a plurality of first convertor layers 13, a second local area network layer 14, and a first loss calculation layer 15;

the input port of the first feature extraction layer 11 is configured to receive input data, the output port of the first feature extraction layer 11 is connected to the input port of the first local area network layer 12, the output port of the first local area network layer 12 is connected to the input port of a first transducer layer 13 of the plurality of first transducers layers 13, the output port of a last first transducer layer 13 of the plurality of first transducers layers 13 is connected to the input port of the second local area network layer 14, and the output port of the second local area network layer 14 is connected to the input port of the first loss calculation layer 15; the input ports of the first converter layers 13 except the first converter layer 13 are connected with the output ports of the last first converter layer 13, and the output ports of the first converter layers 13 except the last first converter layer 13 are connected with the input ports of the next first converter layer 13;

A first feature extraction layer 11 for extracting first data features of each frame of data in the input audio data; the first data characteristic includes a voiceprint characteristic of each frame of data and a start time and/or an end time.

The first lan layer 12 is configured to forward the first data feature to a first one of the plurality of first convertor layers 13.

And the plurality of transducer layers 13 are used for encoding and decoding according to the first data characteristics based on the self-attention mechanism to obtain at least one voice role and a voice period corresponding to the voice role.

The second lan layer 14 is configured to forward at least one voice role of the output of the last first converter layer 13 of the plurality of first converters layers 13 and a voice period corresponding to the voice role to the first loss calculation layer 15;

a first loss calculation layer 15, configured to calculate a loss value according to the received at least one voice character and a voice period corresponding to the voice character;

the first neural network further includes: a second loss calculation layer 16; the second loss calculation layer 16 is configured to calculate loss values of at least one voice character and a voice period corresponding to the voice character output by each of the first convertors layers 13 except the last first convertor layer 13 among the plurality of first convertors layers 13.

The plurality of first transducer models 13 may include: the first transducer models 13-1, …, the first transducer model 13-k, k being a positive integer greater than 1, e.g., k=2, the plurality of first transducer models 13 comprises: a first transducer model 13-1 and a first transducer model 13-2.

The first feature extraction model 11 may be used to extract MFCC features of the input data, for example, MFCCs 23 (23 MFCC coefficients).

Wherein, MFCC is the abbreviation of Mel-scaleFrequency Cepstral Coefficients, namely, mel frequency cepstrum coefficient. Mel frequencies are proposed based on human ear hearing characteristics, and MFCCs are characterized by using a set of features to create key coefficients of mel cepstrum, so that the cepstrum of MFCC features and the human nonlinear auditory system are more closely related, which has a nonlinear correspondence with Hz (hertz) frequencies. The Mel Frequency Cepstrum Coefficient (MFCC) utilizes the relation between the MFCC and the Hz frequency to calculate the Hz frequency spectrum characteristics, and is mainly used for characteristic extraction of voice data and reduction of operation dimensionality. For example: for one frame of 512-dimensional (sampling point) data, the most important 40-dimensional (general) data can be extracted after the MFCC, and the purpose of dimension reduction is achieved. And the MFCC features retain some content that is semantically related, filtering out irrelevant information such as background noise.

The first LAN model 12 and the second LAN model 14 are both LAN (Local Area Network ) models.

The first loss calculation model 15 may be a CE (Cross Entropy) loss model and the second loss calculation model 16 may be a self-supervising cluster loss calculation model.

The first neural network combines a plurality of first converters models 13 with a plurality of loss models (such as CE loss models and self-supervision clustering loss calculation models) so that modeling is more perfect and accuracy is higher.

The first neural network is trained through the first training data, and after loss convergence (namely, the loss value of the first neural network meets a first preset condition), a model can be saved and used as a speaker segmentation and clustering model.

S12, recognizing voice data in the audio to be processed, and performing time positioning on the voice data to obtain a target voice period.

The voice data in the audio to be processed can be identified according to the voice endpoint detection method, and the voice data is positioned in time to obtain the target voice period.

The End-point Detection (EPD) method can perform End-point Detection based on a time domain or a frequency domain, and distinguish between voice data and non-voice data of an audio signal (e.g., a piece of voice), and a start position and an End position (i.e., a start time and an End time) of each piece of voice data and each piece of non-voice data, and the End-point Detection plays an important role in voice processing and recognition.

The target voice period can be obtained by a preset voice endpoint detection model according to the speaker segmentation and clustering method.

Identifying voice data in the audio to be processed, and performing time positioning on the voice data to obtain a target voice period, wherein the method comprises the following steps:

the method comprises the steps of performing frame-level segmentation on audio to be processed through a preset voice endpoint detection model to obtain multi-frame data;

identifying voice data from multi-frame data through a voice endpoint detection model, and detecting the starting time and/or the ending time of each frame of voice data;

and obtaining the target voice time period according to the starting time and/or the ending time of the voice data of each frame through a voice endpoint detection model.

It is known that voice characters (e.g., speaker 1 and speaker 2, but at this time, the character types of speaker 1 and speaker 2 cannot be determined, for example, belonging to a seat or a user) in audio to be processed can be accurately segmented by a speaker segmentation and clustering method, and one or more voice periods corresponding to the voice characters are acquired. Although the individual speaker segmentation and clustering method can also obtain one or more voice periods of voice characters in the audio, the voice data is not extracted when the audio is processed through the speaker segmentation and clustering method, so that the frame data corresponding to the acquired voice characters possibly contains non-voice data (i.e. voice data corresponding to the non-voice characters, such as noise data), but the voice endpoint detection method is based on the recognition capability of the voice data and the non-voice data, the voice data can be extracted from the audio, because only the voice data is effective data in voice recognition, each section of voice data in the audio can be accurately and temporally positioned through the addition of the voice endpoint detection method, the effective voice period in the audio is determined, the effective voice period is taken as a target voice period, the speaker segmentation and clustering method has a good auxiliary effect, the voice period corresponding to each voice character can be accurately corrected according to the target voice period, the non-voice data is prevented from being mixed, the voice periods of different voice characters are prevented from being mixed, and the voice periods of the different voice characters are prevented from being interfered by the recognition of the voice characters, and the voice endpoint detection method is based on the voice endpoint detection method, so that the voice segmentation and the voice character type is accurately corrected.

The method for obtaining the voice endpoint detection model comprises the following steps:

acquiring second training data;

inputting second training data into a pre-created second neural network, and acquiring a voice endpoint detection model when the loss value of the second neural network meets a second preset condition;

the second training data is audio data marked with second time data; the second training data is frame level data; the second time data includes: start time and/or end time; the second neural network is used for performing time positioning on voice data in the input audio data;

the second neural network includes: the device comprises a second feature extraction layer, a plurality of time delay neural network TDNN layers and a classifier layer;

a second feature extraction layer for extracting a second data feature of each frame of data in the input audio data; the second data features include time domain features and/or frequency domain features of each frame of data; the second data characteristic further comprises a start time and/or an end time of each frame of data;

the TDNN layers are used for obtaining the probability that each frame of data in the audio data is voice data according to the second data characteristics and performing time positioning on each frame of data;

and the classifier layer is used for dividing the voice data according to the probability that each frame of data is the voice data and acquiring a target voice period according to the time positioning of the voice data.

Creating the speech endpoint detection model may include:

acquiring second training data;

the second training data is audio data marked with a voice period; the second neural network is used to time-locate the input data.

A voice conversation of a plurality of voice characters may be prerecorded for a period of time (e.g., 5000 hours), and the voice conversation marked with corresponding voice periods (e.g., start time and end time) may be marked with different voice data in the voice conversation as second training data.

In the embodiment of the present application, as shown in fig. 4, the second neural network 20 includes: a second feature extraction layer 21, a plurality of time-lapse neural network TDNN layers 22, and a classifier layer 23;

the input port of the second feature extraction layer 21 is configured to receive input data, the output port of the second feature extraction layer 21 is connected to the input port of the first TDNN layer 22 of the plurality of TDNN layers 22, and the output port of the last TDNN layer 22 of the plurality of TDNN layers 22 is connected to the input port of the first classifier layer 23; the input ports of the other TDNN layers 22 except the first TDNN layer 22 in the plurality of TDNN layers 22 are all connected with the output port of the previous TDNN layer 22, and the output ports of the other TDNN layers 22 except the last TDNN layer 22 in the plurality of TDNN layers 22 are all connected with the input port of the next TDNN layer 22;

A second feature extraction layer 21 for extracting a second data feature of each frame of data in the input audio data; the second data features include time domain features and/or frequency domain features of each frame of data; the second data characteristic further comprises a start time and/or an end time of each frame of data;

a plurality of TDNN layers 22 for obtaining the probability of each frame of data being speech data in the audio data according to the second data characteristics, and performing time localization on each frame of data;

and a classifier layer 23, configured to divide the voice data according to the probability that each frame of data is voice data, and obtain a target voice period according to time positioning of the voice data.

The TDNN (time delay neural network) network is multi-layer, each layer has strong abstract capability to the characteristics, has the capability of expressing the relation of voice characteristics in time, has time invariance, does not require accurate time positioning to the learned marks in the learning process, can share the weight, and is convenient to learn.

The second feature extraction layer 21 may be used to extract MFCC23 features of the input data.

The plurality of TDNN layers 22 may include: TDNN22-1, …, TDNN22-n, n being a positive integer greater than 1, e.g., n=3, the plurality of TDNN layers 2 includes: TDNN22-1, TDNN22-2, and TDNN22-3.

The classifier model 23 may include a Softmax classifier.

The second neural network is trained through the second training data, and after the loss converges (namely, the loss value of the second neural network meets the second preset condition), a model can be saved and used as a voice endpoint detection model. The speaker segmentation is combined with the clustering model and the voice endpoint detection model, so that the reasoning speed is obviously improved, and the hit accuracy in the effective time is effectively improved.

S13, correcting the voice time periods corresponding to each voice role according to the target voice time periods to obtain corrected voice time periods corresponding to each voice role.

Correcting the voice time period corresponding to each voice role according to the target voice time period to obtain a corrected voice time period corresponding to each voice role, including:

comparing the voice time period corresponding to each voice character with the target voice time period;

and deleting the voice periods which are not included in the target voice period from the voice periods corresponding to each voice character, and obtaining the corrected voice periods corresponding to the voice characters. The speech period not included in the target speech period refers to a speech period indicated by a start time and/or an end time of at least one frame of non-speech data.

The data input by the speaker segmentation and clustering model and the voice endpoint detection model are audio to be processed, and MFCC features, such as MFCC23, can be extracted for the audio; the extracted MFCC features are respectively sent to a speaker segmentation and clustering model and a voice endpoint detection model, and the speaker segmentation and clustering model and the voice endpoint detection model respectively output results in the form of voice time periods, and the speaker segmentation and clustering model output results and the voice endpoint detection model output results can be connected in a concat () (which can be called a connection function) mode to form a unified output result. Wherein the concat () method can be used to concatenate two or more arrays to create a new array, and also can be used to concatenate strings, without changing an existing array or string. The unified output result can be used as input data of the period correction module, and the period correction module corrects the voice period corresponding to the voice role to obtain a corrected voice period.

The time period correction module corrects the time period of the multi-frame data corresponding to the voice role by adopting the target voice time period, can screen the voice data from the multi-needle data corresponding to the voice role, and obtains the voice time periods corresponding to the voice data, namely, corrects the voice time period, improves the accuracy of the effective time period of the voice role, takes the corrected voice time period as the voice time period corresponding to the voice role, ensures the accuracy of the voice time period corresponding to the voice role, and provides a technical basis for improving the recognition accuracy of the role types.

S14, converting voice data corresponding to the corrected voice time period of each voice character into text data to obtain speaking text data of each voice character; and identifying the role type corresponding to each voice role according to the speaking text data of each voice role, wherein the role type is used for representing the identity of the speaker.

The voice data of each section of effective voice in the audio can be obtained through the schemes of the steps S11-S13, and the accurate voice roles and the accurate voice periods (namely the speakers to which each section of voice belongs, and the starting time and the ending time of each section of effective voice) corresponding to each section of voice data. At this time, each piece of voice data is still audio data, and data features (for example, fbank80 features) in the voice data can be collected, and a pre-created text conversion model is input to obtain text data corresponding to the voice data.

Converting voice data corresponding to the corrected voice period for each voice character into text data, comprising: converting voice data corresponding to the corrected voice time period of the voice character into text data by adopting a text conversion model;

the text conversion model is obtained by the following steps:

acquiring third training data;

Inputting third training data into a pre-created third neural network, and acquiring a text conversion model when the loss value of the third neural network meets a third preset condition;

the third training data is voice data, and the voice data is marked with corresponding text data; the third neural network is used for converting voice data into text data;

the third neural network includes: a plurality of encoder layers, a first intermediate loss function layer, a first loss function layer, a plurality of decoder layers, a second intermediate loss function layer, a second loss function layer, and a total loss function layer;

the encoder layer comprises a convolutional enhanced transducer layer or a transducer layer and/or the decoder layer comprises a transducer layer or a long-short-term memory network LSTM layer;

an encoder layer for converting input voice data into vectors;

and the decoder layer is used for extracting text characteristics of the received input data.

The voice dialogs of a plurality of voice roles in a period of time (for example, 5000 hours) can be pre-recorded, the voice dialogs are segmented, corresponding text data is marked on each voice segment, the text data is the voice content (or called voice data) of each voice segment, and the voice dialogs marked are used as third training data.

A third neural network having a voice-to-text conversion function may be created in advance to achieve end-to-end voice recognition; the goal of speech recognition is to convert the vocabulary content in human speech into text content, and the end-to-end speech recognition uses a pure neural network instead of the traditional mixed and separate training method of an alignment model, an acoustic model, a language model and the like.

As shown in fig. 5, the third neural network 30 includes: a plurality of encoder layers 31, a first intermediate loss function layer 32, a first loss function layer 33, a plurality of decoder layers 34, a second intermediate loss function layer 35, a second loss function layer 36, and a total loss function layer 37;

the plurality of encoder layers 31 are sequentially connected in series, the plurality of decoder layers 34 are sequentially connected in series, and a first output port of the encoder layer 31 positioned at the last bit of the plurality of encoder layers 31 is connected with an input port of the decoder layer 34 positioned at the first bit of the plurality of decoder layers 34;

the input port of the encoder layer 31 located at the first bit of the plurality of encoder layers 31 is used as the input port for externally inputting data, the first output port of the encoder layer 31 located at the last bit of the plurality of encoder layers 31 is connected with the input port of the first loss function layer 33, and the output port of the first loss function layer 33 is connected with the input port of the total loss function layer 37;

The second output ports of the other encoder layers 31 except the encoder layer 31 positioned at the first bit and the encoder layer 31 positioned at the last bit among the plurality of encoder layers 31 are connected to the input port of the first intermediate loss function layer 2, and the output port of the first intermediate loss function layer 2 is connected to the input port of the total loss function layer 37;

the first output port of the decoder layer 34 located at the last one of the plurality of decoder layers 34 is connected to the input port of the second loss function layer 36, and the output port of the second loss function layer 36 is connected to the input port of the total loss function layer 37;

a second output port of the decoder layers 34 other than the decoder layer 34 positioned at the first bit and the decoder layer 34 positioned at the last bit among the plurality of decoder layers 34 is connected to an input port of a second intermediate loss function layer 55, and an output port of the second intermediate loss function layer 35 is connected to an input port of the total loss function layer 37;

encoder layer 31 includes a convolutional enhanced Transformer layer or a Transformer layer and/or decoder layer includes a Transformer layer or a long-short-term memory network LSTM layer;

an encoder layer 31 for converting input data into vectors;

a decoder layer 34 for extracting text features from the input data;

A first intermediate loss function layer 32 for calculating loss values of the encoder layers 31 other than the encoder layer 31 located at the first bit and the encoder layer 31 located at the last bit among the plurality of encoder layers 31;

a first loss function layer 33 for calculating loss values of the plurality of encoder layers 31;

a second intermediate loss function layer 35 for calculating loss values of decoder layers 34 other than the decoder layer 34 located at the first bit and the decoder layer 34 located at the last bit among the plurality of decoder layers 34;

a second loss function layer 36 for calculating loss values of the plurality of decoder layers;

the total loss function layer 37 is used for calculating the total loss value of the first intermediate loss function layer 32, the first loss function layer 33, the second intermediate loss function layer 35 and the second loss function layer 36.

The Conformer model is a target detection algorithm with a convolution-attention mechanism, and can be regarded as a convolution enhanced type transducer model, the Conformer model combines the transducer model and a CNN (Convolutional Neural Networks, convolutional neural network), the transducer model is good at capturing global interactions based on content, the CNN effectively utilizes local features, the Conformer model has better modeling on long-term global interaction information and the local features, the audio sequence can be locally and globally modeled, and the convolution operation and the self-attention mechanism are utilized to enhance representation learning, so that the local features and the global representations are accurately embedded into each other.

The third neural network can acquire the intermediate layer loss by setting the first intermediate loss function layer 32 and the second intermediate loss function layer 35, so that the intermediate layer information is utilized, and the text conversion accuracy is higher.

The plurality of encoder layers 31 may include: the encoder layers 31-1, …, the encoder layers 31-m, m being positive integers greater than 1, e.g., m=16, the plurality of encoder layers 31 includes: encoder layers 31-1, …, encoder layers 31-8, …, encoder layers 31-16.

The plurality of decoder layers 34 may include: decoder layers 34-1, …, decoder layers 34-p, p being a positive integer greater than 1, e.g., p=16, then the plurality of decoder layers 34 includes: decoder layers 34-1, …, decoder layers 34-8, …, decoder layer 34-16.

The encoder layer 31 in the third neural network 30 may be implemented using a Conformer model, as shown in FIG. 6, which is a schematic diagram of the third neural network 30 using a Conformer model as an encoder in the embodiments disclosed herein. By adding the Conformer model, the local features and the global representation can be accurately embedded into each other, and the text conversion accuracy is further improved.

The third neural network 30 is trained by the third training data, and after the loss converges (i.e., the loss value of the third neural network satisfies the third preset condition), the model can be saved as a text conversion model.

After the text conversion model outputs text data corresponding to the voice data of the audio, the text data may be input into a character recognition model, through which character types of one or more voice characters in the audio are recognized from the text data.

The character recognition model may be pre-created prior to entering text data into the character recognition model; the method for creating the character recognition model can comprise the following steps:

acquiring fourth training data;

inputting the fourth training data into a fourth neural network which is created in advance, and acquiring a role recognition model when the loss value of the fourth neural network meets a fourth preset condition;

the fourth training data is text data marked with character type information; the fourth neural network realizes the type identification and division of the text data based on the text classification layer and the loss function layer.

The text data about the voice dialogue content can be obtained in advance, corresponding role type labels are marked on different voice dialogue contents in the text data, and the marked text data are used as fourth training data.

A fourth neural network having a role classification function may be created in advance, a text classification layer in the fourth neural network may be implemented using a textCNN (text convolutional neural network) model, and a loss function layer may be implemented using a softmax model.

Compared with the CNN network of the traditional image, the textCNN has no change in network structure, is even simpler, has only one layer of convolution, has one layer of max-pooling (maximum pooling layer), and finally outputs an external softmax layer for classification.

The fourth neural network is trained by the fourth training data, and after the loss converges (i.e., the loss value of the fourth neural network meets the fourth preset condition), the model can be saved as a character recognition model.

The character recognition module is used for recognizing text data of the input voice data, so that character type information of the voice characters corresponding to each voice period in the audio data can be obtained, for example, the voice characters are agents or clients, and accurate recognition of the voice characters is achieved.

In the embodiment of the application, the speaker segmentation and clustering model, the voice endpoint detection model, the voice recognition model and the role recognition model of the embodiment of the application realize the role type recognition of the audio, and the scheme is simple and easy to implement, so that the role classification and the type recognition of the mono audio are possible, and the recognition accuracy is higher.

In the embodiment of the application, after the character type information of a section of audio is acquired through the actual case scheme of the application in the quality inspection scene, the voice content and/or the corresponding text content of the seat corresponding to the voice period in the audio to be inspected can be extracted, and the voice content and/or the corresponding text content can be analyzed and judged by using the quality inspection related rule to judge whether the seat voice accords with the regulation or not, so that the seat voice quality inspection is realized.

Fig. 7 is a block diagram illustrating a voice character recognition apparatus 700 according to an embodiment of the present application. Referring to fig. 7, the voice character recognition apparatus 700 includes:

the role division module 701 is configured to obtain at least one voice role in the audio to be processed and a voice period corresponding to the voice role, where each voice role is used to represent a speaker in the audio to be processed;

the endpoint detection module 702 is configured to identify voice data in audio to be processed, and perform time positioning on the voice data to obtain a target voice period;

a period correction module 703, configured to correct a voice period corresponding to each voice character according to the target voice period, so as to obtain a corrected voice period corresponding to each voice character;

the character recognition module 704 is configured to convert the voice data corresponding to the corrected voice period of each voice character into text data, so as to obtain speaking text data of each voice character; and identifying the role type corresponding to each voice role according to the speaking text data of each voice role, wherein the role type is used for representing the identity of the speaker.

Fig. 8 is a block diagram of an electronic device 800 according to an embodiment of the present application.

Referring to fig. 8, an embodiment of the present application provides an electronic device 800, the electronic device 800 including:

At least one processor 801; the method comprises the steps of,

a memory 802 communicatively coupled to the at least one processor 801; wherein,

the memory 802 stores one or more computer programs executable by the at least one processor 801, the one or more computer programs being executed by the at least one processor 801 to enable the at least one processor 801 to perform a voice character recognition method.

In the embodiment of the present application, the electronic device 800 may further include: one or more I/O (input/output) interfaces 803 are coupled between the processor 801 and the memory 802.

The embodiment of the application also provides a computer readable storage medium, on which a computer program is stored, wherein the computer program realizes the above-mentioned voice recognition method when being executed by a processor/processing core. The computer readable storage medium may be a volatile or nonvolatile computer readable storage medium.

Embodiments of the present application also provide a computer program product comprising computer readable code, or a second non-volatile computer readable storage medium carrying computer readable code, which when run in a processor of an electronic device, performs the above-described speech recognition method.

Those of ordinary skill in the art will appreciate that all or some of the steps, systems, functional modules/units in the apparatus, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between the functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed cooperatively by several physical components. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer-readable storage media, which may include computer storage media (or non-transitory media) and communication media (or transitory media).

The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable program instructions, data structures, program modules or other data, as known to those skilled in the art. Computer storage media includes, but is not limited to, random Access Memory (RAM), read Only Memory (ROM), erasable Programmable Read Only Memory (EPROM), static Random Access Memory (SRAM), flash memory or other memory technology, portable compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical disc storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable program instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and may include any information delivery media.

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

Computer program instructions for performing the operations of the present application may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, c++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present application are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information for computer readable program instructions, which may execute the computer readable program instructions.

The computer program product described herein may be embodied in hardware, software, or a combination thereof. In an alternative embodiment, the computer program product is embodied as a computer storage medium, and in another alternative embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), or the like.

Various aspects of the present application are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Example embodiments have been disclosed herein, and although specific terms are employed, they are used and should be interpreted in a generic and descriptive sense only and not for purpose of limitation. In some instances, it will be apparent to one skilled in the art that features, characteristics, and/or elements described in connection with a particular embodiment may be used alone or in combination with other embodiments unless explicitly stated otherwise. It will therefore be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the scope of the present application as set forth in the following claims.

Claims

1. A method of speech recognition, the method comprising:

2. The method for recognizing speech according to claim 1, wherein said obtaining at least one speech character in the audio to be processed and a speech period corresponding to the speech character comprises:

performing frame-level segmentation on the audio to be processed through a preset speaker segmentation and clustering model to obtain multi-frame data;

extracting voiceprint characteristics of each frame of data through the speaker segmentation and clustering model, and gathering data frames with similarity of the voiceprint characteristics being greater than or equal to a preset similarity threshold value into one type, wherein multi-frame data in the same type corresponds to the same voice role;

3. The method according to claim 1, wherein the recognizing the voice data in the audio to be processed and time-locating the voice data to obtain the target voice period includes:

performing frame-level segmentation on the audio to be processed through a preset voice endpoint detection model to obtain multi-frame data;

identifying voice data from the multi-frame data through the voice endpoint detection model, and detecting the starting time and/or ending time of each frame of the voice data;

and obtaining the target voice time period according to the starting time and/or the ending time of the voice data of each frame through the voice endpoint detection model.

4. A method according to claim 2 or 3, wherein correcting the voice period corresponding to each voice character according to the target voice period to obtain the corrected voice period corresponding to each voice character comprises:

comparing the voice time period corresponding to each voice role with the target voice time period;

and deleting the voice time periods which are not included in the target voice time period from the voice time periods corresponding to each voice role, and obtaining the corrected voice time periods corresponding to each voice role.

5. The method of claim 2, wherein the speaker segmentation and clustering model is obtained by:

acquiring first training data;

inputting the first training data into a pre-created first neural network, and acquiring the speaker segmentation and clustering model when the loss value of the first neural network meets a first preset condition;

the first training data are audio data, and each frame of data in the audio data is marked with a corresponding voice role and first time data; the first time data includes: start time and/or end time; the first neural network is used for classifying voice roles of input audio data on a frame level and determining a voice period corresponding to each voice role.

6. A method of speech recognition according to claim 3, wherein the speech endpoint detection model is obtained by:

acquiring second training data;

inputting the second training data into a pre-created second neural network, and acquiring the voice endpoint detection model when the loss value of the second neural network meets a second preset condition;

Wherein the second training data is audio data marked with second time data; the second training data is frame level data; the second time data includes: start time and/or end time; the second neural network is used for time positioning voice data in the input audio data.

7. The method of claim 1, wherein converting the voice data corresponding to the corrected voice period for each voice character into text data comprises:

converting voice data corresponding to the corrected voice time period of the voice character into text data by adopting a text conversion model;

the text conversion model is obtained by the following steps:

acquiring third training data;

inputting the third training data into a pre-created third neural network, and acquiring the text conversion model when the loss value of the third neural network meets a third preset condition;

the third training data is voice data, and the voice data is marked with corresponding text data; the third neural network is used for converting voice data into text data.

8. A speech recognition apparatus, comprising:

9. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores one or more computer programs executable by the at least one processor to enable the at least one processor to perform the speech recognition method of any one of claims 1-7.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the speech recognition method according to any one of claims 1-7.