CN114220421A

CN114220421A - Method and device for generating timestamp at word level, electronic equipment and storage medium

Info

Publication number: CN114220421A
Application number: CN202111547980.3A
Authority: CN
Inventors: 范红亮; 李轶杰; 梁家恩
Original assignee: Unisound Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd
Priority date: 2021-12-16
Filing date: 2021-12-16
Publication date: 2022-03-22

Abstract

The application relates to a method for generating time stamps at word level, an electronic device and a storage medium, wherein the method comprises the following steps: determining a probability peak for each word during a frame-by-frame decoding process; determining the time corresponding to the tail end point of each word according to the probability peak value of each word; determining the time corresponding to the head point of each character according to the time corresponding to the tail point of each character; and generating a time stamp of the word level according to the time corresponding to the head point of each word and the time corresponding to the tail point of each word. According to the method, the probability peak value of each character is determined through the output score based on the deep neural network and the change rule of the score when each character is output in the decoding process, the time corresponding to the head end of each character and the time corresponding to the tail end of each character are determined according to the probability peak value of each character, the method for acquiring the word-level timestamp is provided, accurate timestamp information on the word level can be output, high-precision boundary information is obtained, and user experience is improved.

Description

Method and device for generating timestamp at word level, electronic equipment and storage medium

Technical Field

The present application relates to the field of timestamp technology, and in particular, to a method and an apparatus for generating a word-level timestamp, an electronic device, and a storage medium.

Background

The conventional kaldi-based speech recognition system can obtain boundary information of each word based on a lattice. Although the current popular end-to-end speech recognition system in the industry exceeds the traditional system in terms of recognition rate, time stamp information is not provided for many systems, or only rough time stamps are provided, such as judging word boundary information directly according to neural network scoring, and no relatively mature algorithm can obtain the time stamp information of each word at present.

Disclosure of Invention

Based on the problem that the timestamp information of each word can be obtained by a current set of relatively mature algorithms, the application provides a word-level timestamp generation method, an electronic device and a storage medium.

In a first aspect, an embodiment of the present application provides a method for generating a word-level timestamp, including:

determining a probability peak for each word during a frame-by-frame decoding process;

determining the time corresponding to the tail end point of each word according to the probability peak value of each word;

determining the time corresponding to the head point of each character according to the time corresponding to the tail point of each character;

and generating a time stamp of the word level according to the time corresponding to the head point of each word and the time corresponding to the tail point of each word.

Further, in the above method for generating a word-level timestamp, determining a time corresponding to a tail point of each word according to a probability peak of each word includes:

comparing the probability peak value of each word with the current probability value of each word;

if the comparison result is that the difference between the probability peak value of each word and the current probability value of each word is greater than or equal to a preset threshold value;

and determining the time corresponding to the current probability value as the time corresponding to the tail end point.

if the current character is continuously finished, the current character is a section of mute section, and the difference between the probability peak value of each character and the current probability value of each character is smaller than a preset threshold value;

and delaying the time corresponding to the probability peak value of each word by a first preset time, and determining the time corresponding to the tail end point of each word.

Further, in the above method for generating a word-level timestamp, a tail end point of each word determines a time corresponding to a head end point of each word, and the method includes:

and delaying the time corresponding to the tail end point of each character by a second preset time, and determining the time corresponding to the head end point of each character.

Further, the method for generating a word-level timestamp further includes:

and determining the time corresponding to the head point of each word according to the probability peak value of each word.

Further, in the above method for generating a word-level timestamp, determining a time corresponding to a head point of each word according to a probability peak of each word includes:

and the time corresponding to the probability peak value of each character is delayed for a first preset time, and the time corresponding to the head end of each character is determined.

Further, in the above method for generating a word-level timestamp, the probability peak is a log probability.

In a second aspect, an embodiment of the present application provides an apparatus for generating a timestamp at a word level, including:

a first determination module: determining the probability peak value of each word in the frame decoding process;

a second determination module: the time corresponding to the tail end point of each word is determined according to the probability peak value of each word;

a third determination module: the time corresponding to the head point of each word is determined according to the time corresponding to the tail point of each word;

a fourth determination module: and the time stamp is used for generating the time stamp of the word level according to the time corresponding to the head end point of each word and the time corresponding to the tail end point of each word.

In a third aspect, an embodiment of the present application further provides an electronic device, including: a processor and a memory;

the processor is used for executing the generation method of the timestamp at the word level by calling the program or the instruction stored in the memory.

In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium stores a program or instructions, and the program or instructions cause a computer to perform the method for generating a timestamp at a word level.

The embodiment of the application has the advantages that: the application relates to a method for generating time stamps at word level, an electronic device and a storage medium, wherein the method comprises the following steps: determining a probability peak for each word during a frame-by-frame decoding process; determining the time corresponding to the tail end point of each word according to the probability peak value of each word; determining the time corresponding to the head point of each character according to the time corresponding to the tail point of each character; and generating a time stamp of the word level according to the time corresponding to the head point of each word and the time corresponding to the tail point of each word. According to the method, the probability peak value of each character is determined through the output score based on the deep neural network and the change rule of the score when each character is output in the decoding process, the time corresponding to the head end of each character and the time corresponding to the tail end of each character are determined according to the probability peak value of each character, the method for acquiring the word-level timestamp is provided, accurate timestamp information on the word level can be output, high-precision boundary information is obtained, and user experience is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments or the conventional technologies of the present application, the drawings used in the descriptions of the embodiments or the conventional technologies will be briefly introduced below, it is obvious that the drawings in the following descriptions are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a first schematic diagram illustrating a method for generating a word-level timestamp according to an embodiment of the present application;

fig. 2 is a schematic diagram illustrating a method for generating a word-level timestamp according to an embodiment of the present application;

fig. 3 is a schematic diagram three illustrating a method for generating a word-level timestamp according to an embodiment of the present application;

fig. 4 is a schematic diagram of an apparatus for generating a word-level timestamp according to an embodiment of the present application;

fig. 5 is a schematic block diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, embodiments accompanying the present application are described in detail below with reference to the accompanying drawings. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of embodiment in many different forms than that described herein and those skilled in the art will be able to make similar modifications without departing from the spirit of the application and therefore should not be limited to the specific embodiments disclosed below.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

The technical background of the present application is first described below:

the neural network model of the end-to-end speech recognition engine typically outputs a matrix of T x M. Where T represents the number of frames of the audio and M represents the size of the dictionary. The element ixj in the matrix represents the probability of the model output j at time i, typically using log probability. Subsequently, on the basis of the matrix, a Decoding algorithm (such as CTC Prefix Beam Search, Time Sync Decoding, Align length Sync Decoding and the like) is used to obtain a final recognition result, and the boundary information timestamp of each word is obtained in the Decoding process.

In the process of decoding frame by frame, each frame has an optimal path, and the fraction of the optimal path is the sum of log probabilities of all the time when the path passes through. Normally, a word covers several frames, and from beginning to end, the probability of a whole path has a rough rule: from small to large and then smooth or jump to the next word. Since the information is initially small, the probability of matching this word is not very large, and as decoding time advances, it becomes more and more "like" the word, i.e., the probability increases. Subsequent probabilities may encounter the next word relatively smoothly or with a transition.

Fig. 1 is a first schematic diagram illustrating a method for generating a word-level timestamp according to an embodiment of the present application.

In a first aspect, an embodiment of the present application provides a method for generating a word-level timestamp, which, with reference to fig. 1, includes four steps S101 to S104:

s101: in the frame-by-frame decoding process, the probability peak for each word is determined.

Specifically, in the embodiment of the present application, in the frame-by-frame decoding process, the probability peak of each word is determined by determining the maximum log probability score when each word appears as the latest word.

S102: and determining the corresponding time of the tail end point of each word according to the probability peak value of each word.

Specifically, in the embodiment of the present application, after the maximum log probability score of each word is determined, the time corresponding to the tail end point of each word is determined according to the maximum log probability score of each word, and the time corresponding to the tail end point of each word is introduced in combination with specific steps below.

S103: and determining the time corresponding to the head point of each word according to the time corresponding to the tail point of each word.

Specifically, in the embodiment of the present application, after the time corresponding to the tail end point of each word is determined, the time corresponding to the tail end point of each word may be shifted forward by approximately one word time, so that the time corresponding to the head end point of each word may be determined, which is described below with reference to a specific example.

S104: and generating a time stamp of the word level according to the time corresponding to the head point of each word and the time corresponding to the tail point of each word.

Specifically, in the embodiment of the present application, the time corresponding to the head point of each word and the time corresponding to the tail point of each word are determined, and the time stamp of each word can be determined according to the time between the head point and the tail point.

Fig. 2 is a schematic diagram illustrating a method for generating a word-level timestamp according to an embodiment of the present application.

Further, in the above method for generating a timestamp at a word level, determining a time corresponding to a tail point of each word according to a probability peak of each word, with reference to fig. 2, includes two steps S201 to S202:

s201: comparing the probability peak value of each word with the current probability value of each word;

s202: if the comparison result is that the difference between the probability peak value of each word and the current probability value of each word is greater than or equal to a preset threshold value;

s203: and determining the time corresponding to the current probability value as the time corresponding to the tail end point.

Specifically, in the embodiment of the present application, if the current word lasts for a period of time, then the next word is immediately jumped to. At the moment, a jump point of the current word needs to be found, a relative 0.1% threshold value is set by comparing the probability peak value of each word with the current probability value of each word, the probability value of the current word is compared with the probability peak value, and if the comparison result is that the difference between the probability peak value of each word and the current probability value of each word is more than or equal to the preset threshold value; and if the difference value is a preset threshold value such as 0.1%, determining that the time corresponding to the current probability value is the time corresponding to the tail end point.

Fig. 3 is a third schematic diagram of a method for generating a word-level timestamp according to an embodiment of the present application.

Further, in the above method for generating a timestamp at a word level, determining a time corresponding to a tail point of each word according to a probability peak of each word, with reference to fig. 3, includes two steps S301 to S302:

s301: if the current character is continuously finished, the current character is a section of mute section, and the difference between the probability peak value of each character and the current probability value of each character is smaller than a preset threshold value;

s302: and delaying the time corresponding to the probability peak value of each word by a first preset time, and determining the time corresponding to the tail end point of each word.

Specifically, in the embodiment of the present application, a mute segment is turned on after the current word is continuously ended. At this time, the score ratio probability peak value is reduced by no more than 0.1% of the preset threshold value and lasts for a long time, and the time of the probability peak value is delayed backwards for a first preset time, such as 120ms, which is about half of the time of a word, so that the time corresponding to the tail end point of each word is determined.

Specifically, in this embodiment of the application, after the time corresponding to the tail point of each word is determined, the time corresponding to the tail point may be further moved forward by a second preset time, such as 240ms, which is approximately the time of one word, so that the time corresponding to the head point of each word may be determined.

Further, the method for generating a word-level timestamp further includes:

Specifically, in the embodiment of the present application, in addition to the above-described comparison between the probability peak value of each word and the current probability value of each word, the time corresponding to the head end point of each word may also be determined by the time corresponding to the probability peak value of each word.

Specifically, in the embodiment of the present application, the time corresponding to the probability peak of each word is shifted forward by a first preset time of 120ms, and the time corresponding to the head point of each word is determined by the time of half a word.

Specifically, in the embodiment of the present application, the neural network model of the end-to-end speech recognition engine generally outputs a matrix of T × M. Where T represents the number of frames of the audio and M represents the size of the dictionary. The element ixj in the matrix represents the probability of the model output j at time i, and typically a log probability is used, so the probability peak is the log probability.

Fig. 4 is a schematic diagram of an apparatus for generating a word-level timestamp according to an embodiment of the present application.

In a second aspect, an embodiment of the present application provides an apparatus for generating a word-level timestamp, which, in conjunction with fig. 4, includes:

the first determination module 401: for use in a frame decoding process, the probability peak for each word is determined.

Specifically, in the embodiment of the present application, in the frame-by-frame decoding process, the first determining module 401 determines the probability peak of each word, which is the maximum log probability score when each word appears as the latest word.

The second determination module 402: for determining the time corresponding to the tail point of each word from the probability peak of each word.

Specifically, in the embodiment of the present application, after determining the maximum log probability score of each word, the second determining module 402 determines the time corresponding to the tail end point of each word according to the maximum log probability score of each word, and the time corresponding to the tail end point of each word is introduced in combination with the above specific steps.

The third determination module 403: and the time corresponding to the head point of each word is determined according to the time corresponding to the tail point of each word.

Specifically, in the embodiment of the present application, after the time corresponding to the tail end point of each word is determined, the time corresponding to the tail end point of each word may be shifted forward by approximately one word time, and the third determining module 403 may determine the time corresponding to the head end point of each word, which is described above with reference to the specific example.

The fourth determination module 404: and the time stamp is used for generating the time stamp of the word level according to the time corresponding to the head end point of each word and the time corresponding to the tail end point of each word.

Specifically, in the embodiment of the present application, the time corresponding to the head point of each word and the time corresponding to the tail point of each word are determined, and the fourth determining module 404 may determine the timestamp of each word according to the time between the head point and the tail point.

Fig. 5 is a schematic block diagram of an electronic device provided by an embodiment of the present disclosure.

As shown in fig. 5, the electronic apparatus includes: at least one processor 501, at least one memory 502, and at least one communication interface 503. The various components in the electronic device are coupled together by a bus system 504. A communication interface 503 for information transmission with an external device. It is understood that the bus system 504 is used to enable communications among the components. The bus system 504 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, the various buses are labeled as bus system 504 in fig. 5.

It will be appreciated that the memory 502 in this embodiment can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory.

In some embodiments, memory 502 stores elements, executable units or data structures, or a subset thereof, or an expanded set thereof as follows: an operating system and an application program.

The operating system includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic services and processing hardware-based tasks. The application programs, including various application programs such as a Media Player (Media Player), a Browser (Browser), etc., are used to implement various application services. A program for implementing any one of the word-level timestamp generation methods provided by the embodiments of the present application may be included in an application program.

In this embodiment of the present application, the processor 501 is configured to execute the steps of the embodiments of the method for generating a timestamp at a word level provided in this embodiment of the present application by calling a program or an instruction stored in the memory 502, which may be specifically a program or an instruction stored in an application program.

Any one of the methods for generating a word-level timestamp provided in the embodiments of the present application may be applied to the processor 501, or implemented by the processor 501. The processor 501 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 501. The Processor 501 may be a general-purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The steps of any one of the methods for generating a word-level timestamp provided in the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software units in the hardware decoding processor. The software elements may be located in ram, flash, rom, prom, or eprom, registers, among other storage media that are well known in the art. The storage medium is located in the memory 502, and the processor 501 reads the information in the memory 502 and, in combination with its hardware, performs the steps of a method for generating a word-level time stamp.

Those skilled in the art will appreciate that although some embodiments described herein include some features included in other embodiments instead of others, combinations of features of different embodiments are meant to be within the scope of the application and form different embodiments.

Those skilled in the art will appreciate that the description of each embodiment has a respective emphasis, and reference may be made to the related description of other embodiments for those parts of an embodiment that are not described in detail.

While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for generating a word-level timestamp, comprising:

2. The method of claim 1, wherein said determining the time corresponding to the end point of each word according to the probability peak of each word comprises:

and determining that the time corresponding to the current probability value is the time corresponding to the tail end point.

3. The method of claim 1, wherein said determining the time corresponding to the end point of each word according to the probability peak of each word comprises:

4. The method of claim 1, wherein determining the time corresponding to the head end point of each word from the tail end point of each word comprises:

and delaying the time corresponding to the tail end point of each word by a second preset time, and determining the time corresponding to the head end point of each word.

5. The method of generating a word-level timestamp as claimed in claim 1, further comprising:

6. The method of claim 5, wherein said determining a time corresponding to a head end of each word from said probability peak of each word comprises:

and delaying the time corresponding to the probability peak value of each word by a first preset time, and determining the time corresponding to the head end of each word.

7. The method of claim 1, wherein the probability peak is a log probability.

8. An apparatus for generating a word-level time stamp, comprising:

9. An electronic device, comprising: a processor and a memory;

the processor is used for executing a method for generating a word-level time stamp according to any one of claims 1 to 7 by calling a program or instructions stored in the memory.

10. A computer-readable storage medium storing a program or instructions for causing a computer to execute a method of generating a word-level time stamp according to any one of claims 1 to 7.