CN115831125A

CN115831125A - Speech recognition method, device, equipment, storage medium and product

Info

Publication number: CN115831125A
Application number: CN202211665754.XA
Authority: CN
Inventors: 狄东林; 崔晟嘉; 张钋
Original assignee: Baidu com Times Technology Beijing Co Ltd
Current assignee: Baidu com Times Technology Beijing Co Ltd
Priority date: 2022-12-23
Filing date: 2022-12-23
Publication date: 2023-03-21

Abstract

The present disclosure provides a speech recognition method, apparatus, device and storage medium, relating to the technical field of artificial intelligence, in particular to the technical field of deep learning and speech. The voice recognition method comprises the following steps: acquiring voice data; extracting acoustic feature vectors of the voice data according to at least two frame lengths; clustering the acoustic feature vectors to obtain speaker labels and time labels corresponding to the speaker labels; and recognizing the voice data according to the speaker tag and the time tag to generate a voice recognition result with the speaker tag and the time tag. The voice recognition method can improve the distinguishing precision of different speakers in the voice data, improve the accuracy and reliability of the voice recognition result, and can be applied to real-time voice recognition scenes such as conferences, interviews and the like.

Description

Speech recognition method, device, equipment, storage medium and product

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to the field of deep learning and speech technologies, and in particular, to a speech recognition method, apparatus, device, storage medium, and computer program product.

Background

With the rapid development of science and technology and society, information exchange and communication among people become more frequent and more important. Under scenes such as seminar, training class, internal conference, product exhibition, investigation, interview and the like, the recording and arrangement of communication contents consume a large amount of human resources, and the efficiency is low.

The basic function of the voice-to-text product is to automatically transcribe the voice content in scenes such as a conference and the like into a text, thereby greatly reducing the communication content arrangement cost and improving the recording efficiency.

Disclosure of Invention

The present disclosure provides a voice recognition method, apparatus, device, storage medium, and computer program product, which can be applied to a multi-person voice scene such as an intelligent conference, interview, and the like, to improve the ability to distinguish speakers.

According to a first aspect of the present disclosure, there is provided a speech recognition method comprising:

acquiring voice data;

extracting acoustic feature vectors of the voice data according to at least two frame lengths;

clustering the acoustic feature vectors to obtain speaker labels and time labels corresponding to the speaker labels;

and recognizing the voice data according to the speaker tag and the time tag to generate a voice recognition result with the speaker tag and the time tag.

According to a second aspect of the present disclosure, there is provided a speech recognition apparatus comprising:

an acquisition module configured to acquire voice data;

an extraction module configured to extract acoustic feature vectors of the speech data according to at least two frame lengths;

the clustering module is configured to cluster the acoustic feature vectors to obtain speaker tags and time tags corresponding to the speaker tags;

and the recognition module is configured to recognize the voice data according to the speaker tag and the time tag and generate a voice recognition result with the speaker tag and the time tag.

According to a third aspect of the present disclosure, there is provided an electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method provided by the first aspect.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method as provided by the first aspect.

According to a fifth aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method provided according to the first aspect.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 illustrates an exemplary system architecture to which the speech recognition methods of the present disclosure may be applied;

FIG. 2 illustrates a flow diagram of one embodiment of a speech recognition method according to the present disclosure;

FIG. 3 shows a schematic diagram of the process of short-time Fourier transform in a speech recognition method according to the present disclosure;

FIG. 4 is a graph illustrating the effect of time domain resolution on frequency domain resolution in a short time Fourier transform;

FIG. 5 illustrates a comparison of a speech recognition method in the related art and a recognition process of the speech recognition method of the present disclosure;

FIG. 6 shows a flow chart of a second embodiment of a speech recognition method according to the present disclosure;

FIG. 7 illustrates a schematic structural diagram of one embodiment of a speech recognition apparatus according to the present disclosure;

FIG. 8 shows a block diagram of an electronic device for implementing the speech recognition method of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

The product with the characters transcribed by voice can automatically transcribe conference contents such as seminars, training conferences, internal conferences, product display, research, interviews and the like into texts, so that the cost of conference content arrangement is reduced, and the conference recording efficiency is improved. The simple voice transcription product has an important defect that speakers cannot be automatically distinguished. Because the speeches of different speakers are mixed together, the conference recorder needs to manually arrange the transcription results again. More importantly, if the speeches of different speakers are not distinguished, the intentions, the standpoints, the indications and the like of the different speakers cannot be analyzed, and the application of the natural language understanding technology is greatly limited.

Some voice transcription products on the market increase the function of automatically distinguishing speakers, but some products have insignificant distinguishing effect on the speakers, and others need specific hardware support, so that the products are expensive and have greatly limited use conditions, and are difficult to popularize and use in many scenes.

The speech technology in a conference or interview scene needs to realize speech recognition capable of distinguishing speakers so as to distinguish different speakers at a text level, and the requirement is essentially to label different parts in audio according to voiceprint characteristics of the speakers and to form timestamps of speaking periods of different speakers so as to guide the speech recognition.

And the voiceprint segmentation and clustering technology is used for distinguishing the voices of different speakers. In the prior art, the method is mostly realized by adopting a voice separation method, a clustering algorithm and a voiceprint extraction model according to different application scenes.

Methods based on speech separation typically require a segment of registered speaker audio as a reference to separate out specific sounds in the original audio. This results in the method being unable to be applied in a scenario where the number and identity of speakers cannot be determined.

Most of the existing clustering algorithms take the segments obtained by complete recording as input, and cannot process real-time audio streams, so that speaker differentiation cannot be performed in real time in scenes such as conferences, interviews and the like.

The existing voiceprint extraction model utilizes a neural network to extract audio features, and though many innovations are made on a network structure and a training strategy, MFCC (Mel-Frequency Cepstral Coefficients) and a spectrogram are still used as input of the neural network. The extraction of the spectrogram requires framing the input audio, and the length of each frame determines the resolution of the spectrogram in the time domain and the frequency domain. If the length of the audio frame is too long, although the minimum frequency that can be captured will be reduced accordingly to obtain a more accurate spectrum, the number of frames will be reduced, thereby reducing the information in the time domain. The spectrogram in the prior art is limited by a fixed frame length, so that acoustic information cannot be completely reserved, and the feature extraction capability of a neural network is further influenced.

The invention provides a voice recognition method, which adopts at least two frame lengths to respectively divide voice data so as to obtain two spectrogram with at least two frequency domain resolutions, then carries out aggregation to obtain an enhanced spectrogram, then carries out clustering on acoustic characteristic vectors corresponding to the enhanced spectrogram so as to obtain a speaker tag and a time tag corresponding to the speaker tag, thereby marking the voice data through the speaker tag and the corresponding time tag, and then carries out recognition on the voice data so as to obtain a voice recognition result marked by the speaker tag and the corresponding time tag, thereby effectively distinguishing the speaker from the voice recognition result, forming an accurate conversation process and improving the voice recognition accuracy under a real-time conversation scene.

In the technical scheme of the present disclosure, the processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the related users all conform to the regulations of the related laws and regulations, and do not violate the good custom of the public order. The user information obtained in this embodiment is not specific to a specific user, and cannot reflect the personal information of a specific user.

Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of the speech recognition method or speech recognition apparatus of the present disclosure may be applied.

As shown in fig. 1, system architecture 100 may include terminal device 101, network 102, and server 103. Network 102 is used to provide communication links between terminal devices 101 and server 103, and may include various types of connections, such as wired communication links, wireless communication links, or fiber optic cables, among others.

A user may use terminal device 101 to interact with server 103 over network 102 to receive or transmit information or the like. Illustratively, various client applications may be installed on the terminal device 101. The user can send a voice recognition request to the server 103 through the terminal device 101, or send voice data to be recognized, or create a voice recognition task on the server 103 through the terminal device 101, and can also obtain information such as a voice recognition result on the server 103 through the terminal device 101.

The terminal apparatus 101 may be hardware or software. When the terminal device 101 is hardware, it can be various electronic devices including, but not limited to, a smart phone, a tablet computer, a laptop portable computer, a desktop computer, and the like. When the terminal apparatus 101 is software, it can be installed in the above-described electronic apparatus. It may be implemented as multiple pieces of software or software modules, or as a single piece of software or software module. And is not particularly limited herein.

The server 103 may be hardware or software. When the server 103 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 103 is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

The speech recognition method provided by the embodiment of the present disclosure is generally executed by the server 103, and accordingly, the speech recognition apparatus is generally disposed in the server 103.

It should be noted that the numbers of the terminal apparatus 101, the network 102, and the server 103 in fig. 1 are merely illustrative. There may be any number of terminal devices 101, networks 102, and servers 103, as desired for implementation.

In the embodiment of the present disclosure, the voice recognition method is executed by the server 103, and sends the voice recognition result to the terminal device 101 installed with the client, for example, the voice recognition result is sent to the client of the terminal device 101 in text form or other visual form or a display page such as a browser page for presentation.

Fig. 2 shows a flow 200 of an embodiment of a speech recognition method according to the present disclosure, which, referring to fig. 2, comprises the steps of:

in step S201, voice data is acquired.

In the embodiment of the present disclosure, an execution subject of the voice recognition method, for example, the server 103 shown in fig. 1, acquires voice data to be recognized. The voice data may be formed or transmitted in real time, or may be existing complete voice data, which is not limited herein.

The execution body may acquire the voice data in various ways. For example, voice data may be acquired in real-time by one or more voice capture devices (e.g., microphones) at a conference or interview site; for another example, voice data can also be acquired in real time in a communication process (such as voice call or video call) through the intelligent device; for example, the voice data transmitted or stored by the terminal may be acquired via a network or the like.

In the embodiment of the present disclosure, the voice data may be the voice data that is not processed after being collected, or may be the voice data that is subjected to preliminary preprocessing during or after being collected. Wherein, the preliminary preprocessing can include noise reduction, voice enhancement and the like to improve the audio quality of the voice data; it may also include filtering silent segments of the silent speech, etc.

In some optional implementations, functional modules for noise reduction, speech enhancement, and the like may be provided on the speech acquisition device. For example, voice data collected in a market, a restaurant, a station, or other scenes is noisy, and may be collected by a voice collecting device with a noise reduction or voice enhancement function, or may be preprocessed by software or hardware with a noise reduction, voice enhancement, or other function after collection and before output. The noise reduction and enhancement may be optionally performed on the voice data according to the actual voice scene or the requirement on the recognition result, and the like, which is not particularly limited.

In some alternative implementations, audio activity detection may be performed using an interface in an audio processing (librosa) library to extract the active audio portion of the speech data to filter out silent portions of the unvoiced speech.

In the technical scheme of the present disclosure, the processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the related user all conform to the regulations of the relevant laws and regulations, and do not violate the customs of public order. The user information obtained in this embodiment is not specific to a specific user, and cannot reflect the personal information of a specific user.

Step S202, extracting acoustic feature vectors of the voice data according to at least two frame lengths.

In the embodiment of the present disclosure, an execution subject of the speech recognition method, for example, the server 103 shown in fig. 1, extracts the acoustic feature vector of the speech data acquired in step S201 according to at least two frame lengths.

In this disclosure, the execution main body divides the voice data according to at least two different frame lengths to form voice segments with different frame lengths, extracts the voiceprint features in each voice segment, and then aggregates the voiceprint features corresponding to the voice segments with different frame lengths to obtain the acoustic feature vector of the voice data. By means of the method, the voiceprint characteristics corresponding to the voice segments with different frame lengths are specified, time domain resolution and frequency domain resolution are considered for the voice segments, and accuracy of acoustic feature vectors is effectively improved.

Illustratively, the acoustic feature vectors of the speech data include two-dimensional feature vectors of time-domain and frequency-domain information, e.g., spectrogram spectra.

In some optional implementations of the embodiments of the present disclosure, extracting acoustic feature vectors of speech data according to at least two frame lengths includes: respectively dividing the voice data according to at least two frame lengths to generate at least two frame data sets, wherein the frame data sets correspond to the frame lengths; and extracting acoustic feature vectors of the voice data according to the at least two frame data sets.

In some optional implementations, the voice data may be framed by using a short-time fourier transform, and the voiceprint information of each frame of data is extracted in units of a frame length and then aggregated into an acoustic feature vector, for example, an enhanced spectrogram.

Fig. 3 shows a schematic diagram of a short-time fourier transform process 300, as shown in fig. 3, voice data is equally divided by a certain subframe length, then fourier transform is performed on the framed data frame by frame to extract corresponding voiceprint information, and then the fourier transform results are stacked frame by frame to generate an acoustic feature vector of the voice data.

Then, the execution body changes the framing length, and the process is repeated. After the framing length is replaced at least once and the execution of the process is completed, the execution main body aggregates the acoustic feature vectors respectively extracted according to the various framing lengths.

Because voice data formed under scenes of talkings of a plurality of speakers and the like are different in voiceprint information of each part, if the whole voice data is subjected to Fourier transform, most local information can be filtered; and the short-time Fourier transform can greatly reserve local information in the whole audio. The length of the subframe is adjusted by adjusting the time length parameter in the short-time Fourier transform. For example, the framing length of the short-time fourier transform may be controlled in 12.5ms, 18.75ms, 25ms, etc., respectively.

Illustratively, the speech data is "we are Chinese", and different sets of framed data may be generated by dividing according to different frame lengths. For example, a framed data set generated by a frame length of one word includes "i", "a", "is", "middle", "country", "person"; the framing data set generated according to the framing length of two words comprises ' our ', ' is ' middle ' and ' Chinese '; the framing data set generated according to the framing length of three words comprises ' we is ' Chinese '; the framed data set generated by the framing length of six words includes "we are chinese".

The scheme adopts the plurality of different frame lengths to respectively divide the voice data, can extract the voiceprint information of each smaller local part, can also extract the voiceprint information of each larger local part, and can also extract the global voiceprint information, namely the voiceprint information can be respectively extracted from different frequency domain resolutions and different time domain resolutions, thereby effectively ensuring the extraction precision and accuracy of the voiceprint information and further ensuring the accuracy of the acoustic feature vector.

It should be noted that, in the present scheme, when the voice data is divided according to the length of each subframe, equal-length division is performed. Wherein, the part with insufficient length which is segmented finally can be filled by blank spaces or blank contents. For example, if the "our is a chinese" is divided into four character frame lengths, the resulting frame data set includes "our is a chinese" and "chinese" where each "represents the space length of a chinese character.

Fig. 4 is a graph showing the influence of the time domain resolution and the frequency domain resolution in the short-time fourier transform. As shown in the figure, it can be seen that the speech data is divided by a single frame length, and the time domain resolution and the frequency domain resolution of the speech segment cannot be considered at the same time, thereby affecting the accuracy of extracting the acoustic feature vector. In the embodiment of the disclosure, at least two frame lengths, for example, three different frame lengths are controlled by short-time fourier transform at three time lengths of 12.5ms, 18.75ms and 25ms, respectively, and voice data is divided, so as to obtain a frame data set corresponding to each frame length, respectively, thereby effectively considering both the frequency domain resolution and the time domain resolution of the voice data.

After the execution main body is divided into at least two frame data sets, voiceprint information is extracted from the frame data sets corresponding to the at least two frame lengths respectively, voiceprint information with different time domain resolutions is obtained, and then the extracted voiceprint information is aggregated to form an acoustic feature vector of the voice data.

In some optional implementations of embodiments of the present disclosure, extracting an acoustic feature vector of speech data according to at least two sets of framed data includes: respectively extracting voiceprint information of each frame data in at least two frame data sets; and aggregating the voiceprint information of each frame data to obtain the acoustic feature vector of the voice data.

The voiceprint information corresponding to each piece of frame data can be aggregated according to the length of each frame, and then the aggregation result is aggregated; or directly and simultaneously aggregating the voiceprint information corresponding to all the framing data.

Because the voice data is segmented by adopting at least two different frame lengths, the time-frequency resolution of the voiceprint information extracted correspondingly by different frame data sets is different, namely the voiceprint information is different in length in the time dimension. In some optional implementation manners, in order to avoid that the aggregation of the voiceprint information of different time domain resolutions affects the aggregation result, the execution main body aggregates the voiceprint information corresponding to the framed data of the same frame length according to the framed data set, and then aggregates the aggregation results corresponding to the different framed data sets again to generate the acoustic feature vector of the voice data, so that the enhanced spectrogram of the voice data can be obtained. The realization mode can give consideration to the fusion accuracy of the voiceprint information in the time domain and the frequency domain, and further ensure the accuracy of the acoustic feature vector.

Step S203, clustering the acoustic feature vectors to obtain speaker labels and time labels corresponding to the speaker labels.

In the embodiment of the present disclosure, an executing subject of the speech recognition method, for example, the server 103 shown in fig. 1, clusters the acoustic feature vectors extracted in step S202, and obtains a speaker tag and a time tag corresponding to the speaker tag.

The speaker tag is used for distinguishing and marking different speakers in the voice data; and the time labels corresponding to the speaker labels are used for marking the speaking time of different speakers.

In this embodiment, the executing entity performs clustering on the acoustic feature vectors extracted in step S202, and distinguishes voiceprint information of different speakers and speaking time of different speakers, thereby generating different speaker tags and time tags corresponding to the speaker tags.

In this embodiment, the acoustic feature vectors extracted according to at least two frame lengths in step S202 are clustered, and both the time domain resolution and the frequency domain resolution are taken into consideration, so that the accuracy of the speaker tag and the time tag can be effectively ensured.

If the acoustic feature vector obtained by dividing the voice data by a single frame length and extracting the voiceprint information corresponds to the common spectrogram, the scheme divides the voice data by a plurality of frame lengths and extracts the voiceprint information to obtain the acoustic feature vector which corresponds to the enhanced spectrogram. The two spectrograms can be compared through the clustering result. Table 1 below shows a comparison of clustering indices of a normal spectrogram and an enhanced spectrogram in an exemplary embodiment.

TABLE 1 clustering index comparison of ordinary and enhanced spectrogram

Contrast parameter	Confusion	DER	JER
				Ordinary spectrogram	18.60％	41.93％	37.43％
Enhanced spectrogram	15.90％	39.23％	36.88％

Wherein, the fusion is the mixing rate, which is the mixing rate of the voiceprint information of different speakers, namely the percentage of the duration of the voiceprint information of the speaker, which is not distinguished, in the total effective voice duration of the voice data; DER (speaker separation Error Rate), which is the percentage of the duration of separation Error to the total effective voice duration of voice data; JER (Jaccard Error Rate), which is the average of the separate Error rates of all speaker voiceprint information in the speech data. Therefore, the smaller the numerical values of the three parameters are, the more accurate the clustering result of the spectrogram is.

As can be seen from comparison in table 1, in the embodiment of the present disclosure, at least two frame lengths are used to segment the voice data, extract the acoustic feature vectors, and perform clustering, so as to effectively improve the accuracy of speaker identification in the voice data and improve the accuracy of speaker labels and time labels.

And step S204, recognizing the voice data according to the speaker tag and the time tag, and generating a voice recognition result with the speaker tag and the time tag.

In the embodiment of the present disclosure, the executing entity of the speech recognition method, for example, the server 103 shown in fig. 1, recognizes the speech data according to the speaker tag and the time tag generated in step S203, and generates a speech recognition result with the speaker tag and the time tag.

In the speech recognition result generated by the embodiment, the speaking contents of different speakers are distinguished through speaker tags so as to clearly distinguish the standpoints, attitudes and the like of the different speakers; meanwhile, the speaking time of different speakers is distinguished through the time labels, so that a complete conversation process is formed and the records are distinguished.

In the embodiment, in the process of identifying the voice data according to the speaker tag and the time tag, the complete voice data can be directly identified; or the voice data can be divided into a plurality of parts according to the speaker label or the time label, then the parts are identified and marked one by one, and then all the identification results are combined into a whole.

In some optional implementation manners of the embodiment of the present disclosure, the executing body identifies voice data, and generates text data corresponding to the voice data; and marking the content of the text data according to the speaker tag and the time tag so as to distinguish different speakers and speaking time and form a voice recognition result with the speaker tag and the time tag.

In some optional implementations of the embodiments of the present disclosure, recognizing the speech data according to the speaker tag and the time tag, and generating the speech recognition result with the speaker tag and the time tag includes: segmenting voice data according to the speaker tag and the time tag to obtain statement data corresponding to the speaker tag and the time tag; recognizing statement data and generating a statement recognition result; and generating a voice recognition result with the speaker tag and the time tag according to the speaker tag, the time tag and the sentence recognition result.

The voice data is divided according to the speaker tag and the time tag, or the voice data is divided according to the time tag, so that statement data with the speaker tag and corresponding to different time tags is generated; or the speech data can be segmented according to the speaker tags to generate statement data which is provided with time tags and corresponds to different speaker tags.

Then, the execution subject recognizes each sentence data divided to generate a corresponding sentence recognition result, and then integrates each sentence recognition result according to a certain rule, for example, according to a time sequence to generate a voice recognition result with a speaker tag and a time tag.

In the implementation mode, the voice data is divided by using the speaker tag and the time tag, and the sentence data obtained by division is respectively identified, so that the identification accuracy of the voice data and the matching precision of the voice data, the speaker tag and the time tag are improved.

Fig. 5 shows a comparison between a speech recognition method in the related art and a recognition process of the speech recognition method of the present disclosure, in which (a) is the speech recognition process in the related art, and (b) is the recognition process of the speech recognition method of the present disclosure.

As shown in fig. 5, in the related art, voice data is divided by a framing length, and then acoustic feature vectors are extracted, so as to perform clustering and recognition. In the present disclosure, the short-time fourier transform is used to segment the voice data according to a plurality of different frame lengths, extract the voiceprint information, fuse the voiceprint information to generate the acoustic feature vector, and then perform clustering and recognition. By contrast, the acoustic feature vector generated by the speech recognition method disclosed by the invention is higher in accuracy and reliability.

According to the voice recognition method provided by the embodiment of the disclosure, at least two frame lengths are adopted to respectively divide voice data, acoustic feature vectors are extracted, and then the acoustic feature vectors are clustered, so that the accuracy of a speaker tag and a time tag corresponding to the speaker tag can be effectively improved, the distinguishing accuracy of different speakers in the voice data is improved, the voice data is recognized according to the speaker tag and the time tag, and the accuracy and reliability of a voice recognition result can be effectively improved. The voice recognition method not only can accurately recognize and distinguish the existing voice data according to the speaker, but also can accurately recognize and distinguish different speakers and the speaking content of the voice data generated in real time in scenes such as meetings, interviews and the like, and marks and records the voice data according to the speaker label and the time label.

The speech recognition method provided by the present disclosure may have various ways of clustering the acoustic feature vectors, for example, the clustering process may be completed by a clustering algorithm or a clustering model (e.g., a neural network model).

In some optional embodiments, clustering the acoustic feature vectors to obtain speaker labels and time labels corresponding to the speaker labels includes: and clustering the acoustic feature vectors by adopting a pre-trained clustering model to obtain a speaker label and a time label corresponding to the speaker label.

The cluster model can be an independent neural network model or a neural network model composed of a plurality of sub-models.

The acoustic feature vectors are clustered through the pre-trained clustering model, so that the clustering efficiency can be improved, and the accuracy and reliability of clustering results are ensured.

FIG. 6 shows a flow 600 of an embodiment of a speech recognition method according to the present disclosure, in this embodiment, a cluster model includes at least a generation submodel, an assignment submodel, and a conversion submodel. The generating submodel is used for generating different speaker tags, the distributing submodel is used for distributing sentences in the voice data according to the speaker tags, and the converting submodel is used for determining converting time among different speakers and generating corresponding time tags.

Referring to fig. 6, the voice recognition method includes the steps of:

step S601, acquiring voice data.

Step S602, extracting an acoustic feature vector of the speech data according to at least two frame lengths.

In the embodiment of the present disclosure, the execution subject of the speech recognition method, for example, the server 103 shown in fig. 1, sequentially executes the above steps S601 to S602. Steps S601 to S602 are substantially the same as steps S201 to S202 in the foregoing embodiment, and the specific implementation manner may refer to the foregoing description of steps S201 to S202, which is not described herein again.

Step S603, inputting the acoustic feature vector into a generation sub-model to obtain a speaker tag.

In the embodiment of the present disclosure, the executing subject of the speech recognition method, for example, the server 103 shown in fig. 1, inputs the acoustic feature vectors extracted in step S602 into the generation submodel, so as to obtain different speaker tags.

In some optional implementations of the embodiments of the present disclosure, a neural network model may be selected as the generation submodel.

In training the generating submodels, speech data or voiceprint feature vectors with multiple different speakers and corresponding speaker labels may be used for training. For example, the voiceprint feature vector with a plurality of different speakers is used as the input of the generation submodel, one speaker label is used as the output in sequence, and the generation submodel is trained for a plurality of times until the generation submodel can output the corresponding speaker label according to the voiceprint feature vector with a plurality of different speakers.

Step S604, inputting the speaker tag and the voice data into the distribution submodel to obtain the voice data corresponding to the speaker tag.

In the embodiment of the present disclosure, the executing agent of the speech recognition method, for example, the server 103 shown in fig. 1, inputs the speaker tag and the acoustic feature vector into the assignment sub-model, and obtains speech data corresponding to the speaker tag.

In the scheme, the sub-model is allocated to determine which speaker tag each voiceprint information in the acoustic feature vector belongs to, that is, the voiceprint information in the acoustic feature vector is adapted to and marked with the speaker tag, so that voiceprint sequences corresponding to different speaker tags are output.

Illustratively, the assignment sub-model may employ a Bayesian non-parametric model, such as a Chinese restaurant process. The Chinese restaurant process specifically comprises the following steps: suppose a wireless table is available in a chinese restaurant and a first customer to eat sits on the first table; for each customer coming behind, the table is selected to sit down according to the following rules: and selecting to sit on the table of the existing person according to the first probability, or selecting to sit on the table of the nobody according to the second probability.

In an actual conversation, speakers who speak frequently are more likely to speak frequently in the next conversation, and thus the Chinese restaurant process is adapted to assign voiceprint information in acoustic feature vectors to different speaker tags.

In some alternative implementations of embodiments of the present disclosure, the assignment sub-model employs a chinese restaurant process model. Each time an input is accepted by the assignment sub-model, for example, the input may be a sentence in certain speech data or a certain acoustic feature vector, the input is assigned with an existing speaker tag according to the first probability, and the input is assigned with a brand new speaker tag according to the second probability.

Illustratively, the assignment submodel assigns a first probability that an input is assigned to an existing speaker tag to be positively correlated with the number of inputs assigned under the speaker tag.

Step S605, the voiceprint sequence corresponding to the speaker tag is input into the conversion sub-model, and the time tag corresponding to the speaker tag is obtained.

In the embodiment of the present disclosure, the main body of the speech recognition method, for example, the server 103 shown in fig. 1, inputs the voiceprint sequence corresponding to the speaker tag obtained in step S604 into the conversion sub-model, so as to obtain the time tag corresponding to the speaker tag.

In some optional implementations, the conversion sub-model may select a binomial distribution model to implement the time when two adjacent speaker tags in the voiceprint sequence alternate, for example, the time corresponding to the speaker a finishes speaking and the speaker b starts speaking, that is, the time tags of the start time and the end time of the conversion of the speaker tags in the voiceprint sequence are determined.

In some alternative implementations, the executing entity may only determine the time stamp of the start time of the speaker's tag change in the voiceprint sequence.

Step S606, according to the speaker label and the time label, the voice data is identified, and a voice identification result with the speaker label and the time label is generated.

In the embodiment of the present disclosure, the executing subject of the speech recognition method, for example, the server 103 shown in fig. 1, executes the step S606. Step S606 is substantially the same as step S204 in the foregoing embodiment, and the specific implementation manner may refer to the foregoing description of step S204, which is not described herein again.

According to the voice recognition method provided by the embodiment of the disclosure, the speaker tag is obtained according to the acoustic feature vector by the generation sub-model, the voiceprint sequence corresponding to the speaker tag is obtained according to the speaker tag and the acoustic feature vector by the distribution sub-model, and the time tag corresponding to the speaker tag is obtained according to the voiceprint sequence corresponding to the speaker tag by the conversion sub-model, so that the clustering processing of the acoustic feature vector is completed, the accurate speaker tag and the time tag corresponding to the speaker tag are obtained, the voice data can be accurately marked, and the recognition accuracy of the voice data is improved.

As an implementation of the methods illustrated in the above figures, FIG. 7 illustrates one embodiment of a speech recognition device according to the present disclosure. The speech recognition apparatus corresponds to the method embodiment shown in fig. 2, and the apparatus can be applied to various electronic devices.

Referring to fig. 7, a speech recognition apparatus 700 provided in an embodiment of the present disclosure includes: an acquisition module 701, an extraction module 702, a clustering module 703, and an identification module 704. Wherein the obtaining module 701 is configured to obtain voice data; the extraction module 702 is configured to extract acoustic feature vectors of the speech data in at least two frame lengths; the clustering module 703 is configured to cluster the acoustic feature vectors to obtain speaker tags and time tags corresponding to the speaker tags; the recognition module 704 is configured to recognize the voice data according to the speaker tag and the time tag, and generate a voice recognition result with the speaker tag and the time tag.

In the speech recognition apparatus 700 of this embodiment, the specific processing of the obtaining module 701, the extracting module 702, the clustering module 703 and the recognition module 704 and the technical effects thereof can be respectively referred to the related descriptions of steps S201 to S204 in the corresponding embodiment of fig. 2, and are not repeated herein.

In some optional implementations of embodiments of the present disclosure, the extraction module 702 includes a segmentation unit and an extraction unit. The segmentation unit is configured to segment the voice data according to at least two frame lengths respectively to generate at least two frame data sets, and the frame data sets correspond to the frame lengths; the extraction unit is configured to extract acoustic feature vectors of the speech data from the at least two sets of framed data.

In some optional implementations of embodiments of the present disclosure, the extraction unit is configured to: respectively extracting voiceprint information of each frame data in at least two frame data sets; and aggregating the voiceprint information of each frame data to obtain the acoustic feature vector of the voice data.

In some optional implementations of embodiments of the present disclosure, the clustering module 703 is configured to: and clustering the acoustic characteristic vectors by adopting a pre-trained clustering model to obtain a speaker label and a time label corresponding to the speaker label.

In some optional implementations of embodiments of the present disclosure, the clustering model includes a generation submodel, an assignment submodel, and a conversion submodel, and the clustering module 703 is configured to: inputting the acoustic feature vector into a generation sub-model to obtain a speaker tag; inputting the speaker tag and the acoustic feature vector into an allocation sub-model to obtain a voiceprint sequence corresponding to the speaker tag; and inputting the voiceprint sequence corresponding to the speaker tag into the conversion sub-model to obtain a time tag corresponding to the speaker tag.

In the speech recognition apparatus 700 of this embodiment, the specific processing of the clustering module 703 and the technical effects thereof can refer to the related descriptions of steps S603-S605 in the embodiment corresponding to fig. 6, and are not described herein again.

In some optional implementations of embodiments of the present disclosure, the identifying module 704 is configured to: according to the speaker tag and the time tag, dividing the voice data to obtain statement data corresponding to the speaker tag and the time tag; recognizing statement data and generating a statement recognition result; and generating a voice recognition result with the speaker tag and the time tag according to the speaker tag, the time tag and the sentence recognition result.

The present disclosure also provides an electronic device, a non-transitory computer-readable storage medium storing computer instructions, and a computer program product, in accordance with embodiments of the present disclosure.

Wherein, this electronic equipment includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the speech recognition method.

In some embodiments, a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to execute the above-described speech recognition method.

In some embodiments, a computer program product comprises a computer program which, when executed by a processor, implements the above-described speech recognition method.

FIG. 8 illustrates a schematic block diagram of an example electronic device 800 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not intended to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the apparatus 800 includes a computing unit 801 which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM803, various programs and data required for the operation of the device 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

A number of components in the device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 801 executes the respective methods and processes described above, such as the generation method of the backbone network or the image processing method. For example, in some embodiments, the speech recognition method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program can be loaded and/or installed onto device 800 via ROM 802 and/or communications unit 809. When the computer program is loaded into the RAM803 and executed by the computing unit 801, one or more steps of the above-described generation method of the backbone network or image processing method may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the speech recognition method in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server combining a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A speech recognition method comprising:

acquiring voice data;

and recognizing the voice data according to the speaker tag and the time tag, and generating a voice recognition result with the speaker tag and the time tag.

2. The speech recognition method of claim 1, wherein the extracting acoustic feature vectors of the speech data in at least two frame lengths comprises:

dividing the voice data according to the at least two frame lengths respectively to generate at least two frame data sets, wherein the frame data sets correspond to the frame lengths;

and extracting acoustic feature vectors of the voice data according to the at least two frame data sets.

3. The speech recognition method of claim 2, wherein the extracting acoustic feature vectors of the speech data from the at least two sets of framed data comprises:

respectively extracting voiceprint information of each frame data in the at least two frame data sets;

and aggregating the voiceprint information of each frame data to obtain the acoustic feature vector of the voice data.

4. The speech recognition method according to any one of claims 1-3, wherein the clustering the acoustic feature vectors to obtain speaker labels and time labels corresponding to the speaker labels comprises:

and clustering the acoustic feature vectors by adopting a pre-trained clustering model to obtain the speaker label and a time label corresponding to the speaker label.

5. The speech recognition method of claim 4, wherein the clustering model comprises a generation submodel, an assignment submodel, and a conversion submodel, and

the clustering of the acoustic feature vectors by adopting a pre-trained clustering model to obtain the speaker label and a time label corresponding to the speaker label comprises the following steps:

inputting the acoustic feature vector into the generating sub-model to obtain the speaker label;

inputting the speaker tag and the acoustic feature vector into the distribution submodel to obtain a voiceprint sequence corresponding to the speaker tag;

and inputting the voiceprint sequence corresponding to the speaker tag into the conversion sub-model to obtain a time tag corresponding to the speaker tag.

6. The speech recognition method according to any one of claims 1-5, wherein the recognizing the speech data according to the speaker tag and the time tag, and generating the speech recognition result with the speaker tag and the time tag comprises:

according to the speaker tag and the time tag, segmenting the voice data to obtain statement data corresponding to the speaker tag and the time tag;

recognizing the statement data to generate a statement recognition result;

and generating a voice recognition result with the speaker tag and the time tag according to the speaker tag, the time tag and the sentence recognition result.

7. A speech recognition apparatus comprising:

an acquisition module configured to acquire voice data;

the clustering module is configured to cluster the acoustic feature vectors to obtain speaker labels and time labels corresponding to the speaker labels;

8. The speech recognition device of claim 7, wherein the extraction module comprises:

a dividing unit configured to divide the voice data according to the at least two frame lengths, respectively, and generate at least two frame data sets, where the frame data sets correspond to the frame length 5;

an extraction unit configured to extract acoustic feature vectors of the speech data according to the at least two sets of framed data.

9. The speech recognition device according to claim 8, wherein the extraction unit is configured to 0:

10. The speech recognition device of any of claims 7-9, wherein the class 5 module is configured to: and clustering the acoustic feature vectors by adopting a pre-trained clustering model to obtain the speaker label and a time label corresponding to the speaker label.

11. The speech recognition apparatus of claim 10, wherein the clustering model includes a generation submodel, an assignment submodel, and a conversion submodel, and the clustering module is configured to 0 be:

and inputting the voiceprint sequence corresponding to the speaker tag into the conversion sub-model to obtain a time tag corresponding to the speaker tag of the 5 speakers.

12. The speech recognition device of any of claims 7-11, the recognition module configured to:

according to the speaker tag and the time tag, dividing the voice data to obtain statement data corresponding to 0 of the speaker tag and the time tag;

recognizing the statement data to generate a statement recognition result;

13. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-6.

15. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-6.