CN115662432A

CN115662432A - Punctuation prediction method and device and voice recognition equipment

Info

Publication number: CN115662432A
Application number: CN202211184502.5A
Authority: CN
Inventors: 马志强; 岳文浩; 张宝军
Original assignee: Hisense Visual Technology Co Ltd
Current assignee: Hisense Visual Technology Co Ltd
Priority date: 2022-09-27
Filing date: 2022-09-27
Publication date: 2023-01-31

Abstract

The disclosure relates to a punctuation prediction method, a punctuation prediction device and a voice recognition device. The method comprises the following steps: the method comprises the steps of obtaining the probability of punctuation marks appearing after each character in a transcribed text and the probability of non-punctuation marks based on a punctuation prediction model, obtaining first text information and second text information corresponding to an original audio, correcting the probability of non-punctuation marks appearing after each character in the transcribed text according to the first text information and the second text information, and obtaining the corrected prediction probability information of the non-punctuation marks. The transcribed text is a text sequence obtained by carrying out voice recognition processing on the original audio; the first text information is obtained by performing audio truncation processing on the original audio, and comprises voice characteristic information and non-voice characteristic information corresponding to the original audio; the second text information is obtained by decoding the original audio, and comprises transcribed literal characters and transcribed empty characters corresponding to the original audio. By adopting the method, the accuracy of punctuation prediction can be improved.

Description

Punctuation prediction method and device and voice recognition equipment

Technical Field

The present disclosure relates to the field of natural language processing technologies, and in particular, to a punctuation prediction method, a punctuation prediction device, and a speech recognition device.

Background

In the scene of man-machine interaction, speech recognition plays a crucial role in natural language understanding and natural language generation. Punctuation prediction is performed on a transcribed text of speech recognition, which is an important work for semantic understanding and interaction. The correct punctuation mark plays a great auxiliary role in understanding the semantics. Fuzzy or mismatched punctuation can play the role of misideation and misleading, thereby influencing the whole process of voice interaction.

In the related art, the punctuation prediction in speech recognition is mostly based on the punctuation of the context semantically output whole sentence of the transcribed text. However, when the transcribed text is too short and the available information is insufficient, the probability of the occurrence of a wrong or missed bid is extremely high in the above manner. Therefore, how to improve the accuracy of the landmark prediction in speech recognition is a problem that needs to be solved urgently at present.

Disclosure of Invention

In order to solve the technical problem or at least partially solve the technical problem, the disclosure provides a display device and a punctuation prediction method, which can determine semantic understanding contents corresponding to participles of various food materials when a user speaks a situation of storing multiple food materials, thereby effectively improving the semantic understanding capability of the display device and further improving user experience.

When a user watches multimedia content, target bullet screen content corresponding to the multimedia content is provided for the user, so that the user can send bullet screen content which the user wants to send, and the watching experience and satisfaction of the user are improved.

In a first aspect, the present disclosure provides a punctuation prediction method, including:

acquiring the probability of punctuation labels appearing after each character in the transcribed text and the probability of non-punctuation labels based on a punctuation prediction model; the transcription text is a text sequence obtained by performing voice recognition processing on original audio, and the punctuation prediction model is used for outputting the probability of punctuation labels and the probability of non-punctuation labels after each character in the transcription text appears;

acquiring first text information corresponding to original audio and second text information corresponding to the original audio; the first text information is obtained by performing audio truncation processing on the original audio, the first text information comprises voice characteristic information and non-voice characteristic information corresponding to the original audio, the second text information is obtained by decoding the original audio, and the second text information comprises transcribed text characters and transcribed empty characters corresponding to the original audio;

and correcting the probability of the non-punctuation mark appearing after each character in the transcribed text according to the first text information and the second text information, and acquiring the corrected prediction probability information of the non-punctuation mark.

As an optional implementation manner of the embodiment of the present disclosure, the correcting, according to the first text information and the second text information, the probability of occurrence of a non-punctuation mark after each character in the transcribed text, and obtaining prediction probability information of the corrected non-punctuation mark includes:

acquiring a first normalized position weight, a second normalized position weight and a third normalized position weight;

the first normalized position weight is the normalized weight of the number of the empty characters behind each character in the first text message, the second normalized position weight is the normalized weight of the number of the empty characters behind each character in the second text message, and the third normalized position weight is the sum of the first normalized position weight and the second normalized position weight;

acquiring the probability of the non-punctuation mark appearing after each corrected character according to the probability of the non-punctuation mark appearing after each character in the transcribed text, the third normalized position weight and the audio intervention adjusting parameter;

and acquiring the prediction probability information of the corrected non-punctuation labels according to the probability of the non-punctuation labels appearing after each corrected character.

As an optional implementation manner of the embodiment of the present disclosure, the method further includes:

the probability of punctuation marks appearing after each character in the transcribed text and the probability of non-punctuation marks appearing after each character in the transcribed text are in negative correlation.

As an optional implementation manner of the embodiment of the present disclosure, the obtaining, according to the probability of occurrence of a non-punctuation mark after each character in the transcribed text, the third normalized position weight, and the audio intervention adjustment parameter, the probability of occurrence of a non-punctuation mark after each character after correction includes:

calculating the probability of non-punctuation marks appearing after each character in the transcribed text, the third normalization position weight and the audio intervention adjusting parameter according to a preset correction mode, and acquiring the probability of non-punctuation marks appearing after each corrected character; the preset correction mode comprises the following steps: a first preset correction mode and a second preset correction mode.

As an optional implementation manner of the embodiment of the present disclosure, when the preset correction manner is a first preset correction manner, the calculating, according to the preset correction manner, a probability that a non-punctuation mark appears after each character in the transcribed text, the third normalized position weight, and the audio intervention adjustment parameter, and obtaining the probability that the non-punctuation mark appears after each character after correction includes:

and determining the probability of the non-punctuation mark appearing after each corrected character according to the probability of the non-punctuation mark appearing after each character in the transcribed text and the sum of the products of the audio intervention adjusting parameter and the third normalized position weight.

As an optional implementation manner of the embodiment of the present disclosure, when the preset modification manner is a second preset modification manner, the calculating, according to the preset modification manner, a probability that a non-punctuation mark appears after each character in the transcribed text, the third normalized position weight, and the audio intervention adjustment parameter, and obtaining the probability that the non-punctuation mark appears after each character after modification includes:

and determining the probability of the non-punctuation mark appearing after each corrected character according to the product of the probability of the non-punctuation mark appearing after each character in the transcribed text, the audio intervention adjusting parameter and the third normalized position weight.

As an optional implementation manner of the embodiment of the present disclosure, when the preset correction manner is a first preset correction manner, the obtaining a first normalized position weight, a second normalized position weight, and a third normalized position weight includes:

acquiring a first normalized position weight according to the sum of the number of the hollow characters in the first text information, the average number of the hollow characters in the first text information and the number of the hollow characters behind each character in the first text information;

acquiring a second normalized position weight according to the sum of the number of the hollow characters in the second text information, the average number of the hollow characters in the second text information and the number of the hollow characters behind each character in the second text information;

and acquiring a third normalized position weight according to the sum of the first normalized position weight and the second normalized position weight.

As an optional implementation manner of the embodiment of the present disclosure, when the preset modification manner is a second preset modification manner, the obtaining the first normalized position weight, the second normalized position weight, and the third normalized position weight includes:

acquiring a first normalized position weight according to the maximum number of empty characters behind each character in the N characters of the first text information, the number of empty characters behind each character in the first text information and a scaling factor;

acquiring a second normalized position weight according to the maximum number of empty characters behind each character in the N characters of the second text information, the number of empty characters behind each character in the second text information and a scaling factor;

acquiring a third normalized position weight according to the sum of the first normalized position weight and the second normalized position weight;

wherein N is an integer of 1 or more.

In a second aspect, a semantic understanding apparatus is provided, and the method includes:

the punctuation probability acquisition module is used for acquiring the probability of punctuation labels and the probability of non-punctuation labels after each character in the transcribed text is acquired based on the punctuation prediction model; the transcription text is a text sequence obtained by performing voice recognition processing on original audio, and the punctuation prediction model is used for outputting the probability of punctuation labels and the probability of non-punctuation labels after each character in the transcription text;

the audio text acquisition module is used for acquiring first text information corresponding to an original audio and second text information corresponding to the original audio; the first text information is obtained by performing audio truncation processing on the original audio, the first text information comprises voice characteristic information and non-voice characteristic information corresponding to the original audio, the second text information is obtained by decoding the original audio, and the second text information comprises transcribed text characters and transcribed empty characters corresponding to the original audio;

and the punctuation probability correction module is used for correcting the probability of the occurrence of non-punctuation labels after each character in the transcribed text according to the first text information and the second text information and acquiring the corrected prediction probability information of the non-punctuation labels.

As an optional implementation manner of the embodiment of the present disclosure, the punctuation probability correction module includes:

a weight acquisition unit configured to acquire a first normalized position weight, a second normalized position weight, and a third normalized position weight;

the first normalized position weight is the normalized weight of the number of empty characters behind each character in the first text information, the second normalized position weight is the normalized weight of the number of empty characters behind each character in the second text information, and the third normalized position weight is the sum of the first normalized position weight and the second normalized position weight;

the character probability correction unit is used for acquiring the probability of the non-punctuation mark appearing after each corrected character according to the probability of the non-punctuation mark appearing after each character in the transcribed text, the third normalized position weight and the audio intervention adjusting parameter;

and the punctuation probability correction unit is used for acquiring the prediction probability information of the corrected non-punctuation label according to the probability of the non-punctuation label appearing after each corrected character.

As an optional implementation manner of the embodiment of the present disclosure, the probability of the punctuation label appearing after each character in the transcribed text and the probability of the non-punctuation label appearing after each character in the transcribed text are in negative correlation.

As an optional implementation manner of the embodiment of the present disclosure, the character probability correction unit is specifically configured to:

calculating the probability of the non-punctuation mark appearing after each character in the transcribed text, the third normalized position weight and the audio intervention adjusting parameter according to a preset correction mode, and acquiring the probability of the non-punctuation mark appearing after each corrected character; the preset correction mode comprises the following steps: a first preset correction mode and a second preset correction mode.

and determining the probability of the non-punctuation label appearing after each corrected character according to the product of the probability of the non-punctuation label appearing after each character in the transcribed text, the audio intervention adjusting parameter and the third normalized position weight.

As an optional implementation manner of the embodiment of the present disclosure, when the preset modification manner is a first preset modification manner, the weight obtaining unit is specifically configured to:

As an optional implementation manner of the embodiment of the present disclosure, when the preset modification manner is a second preset modification manner, the weight obtaining unit is specifically configured to:

acquiring a first normalization position weight according to the maximum number of empty characters behind each character in the N characters of the first text information, the number of empty characters behind each character in the first text information and a scaling factor;

wherein N is an integer of 1 or more.

In a third aspect, a speech recognition device is provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the punctuation prediction method as shown in the first aspect when executing the computer program.

In a fourth aspect, a computer-readable storage medium is provided, comprising: the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the punctuation prediction method as presented in the first aspect.

Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has the following advantages: the method comprises the steps of obtaining the probability of a punctuation label appearing after each character in a transcribed text and the probability of a non-punctuation label based on a punctuation prediction model, obtaining first text information corresponding to an original audio and second text information corresponding to the original audio, correcting according to the probability of the non-punctuation label appearing after each character in the transcribed text is obtained according to the first text information corresponding to the original audio and the second text information corresponding to the original audio, and obtaining the corrected prediction probability information of the non-punctuation label. The transcribed text is a text sequence obtained by performing speech recognition processing on an original audio, the punctuation prediction model is used for outputting the probability of a non-punctuation label of the probability of the punctuation label appearing after each character in the transcribed text, the first text information is the text information obtained by performing audio truncation processing on the original audio, the first text information comprises speech characteristic information and non-speech characteristic information corresponding to the original audio, the second text information is the text information obtained by decoding the original audio, and the second text information comprises transcribed word characters and transcribed empty characters corresponding to the original audio. The probability of non-punctuation appearing behind each character in the original audio is corrected through the non-voice characteristic information corresponding to the original audio and the character characters and the null characters corresponding to the original audio, not only is the context semantic meaning of the transcribed text recognized based on voice, but also the probability of non-punctuation appearing behind each character is predicted by combining the two audio information of the first text information and the second text information, and then the prediction probability information of the corrected non-punctuation label is obtained.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

FIG. 1A is a schematic diagram of a prior art tag prediction process;

fig. 1B is a schematic diagram of an application scenario of a tag prediction process in an embodiment of the present disclosure;

fig. 2A is a block diagram of a hardware configuration of a speech recognition device according to one or more embodiments of the present disclosure;

fig. 2B is a block diagram of a software configuration of a speech recognition device according to one or more embodiments of the present disclosure;

FIG. 2C is a schematic illustration of an icon control interface display of an application program included in a speech recognition device according to one or more embodiments of the present disclosure;

fig. 3A is a schematic flow chart of a punctuation prediction method according to an embodiment of the present disclosure;

FIG. 3B is a schematic diagram illustrating an automatic speech recognition transcription result of an audio signal according to an embodiment of the disclosure;

fig. 3C is a schematic diagram illustrating a result obtained by performing audio truncation processing on an original audio according to an embodiment of the disclosure;

fig. 3D is a schematic diagram illustrating a result obtained by decoding an original audio according to an embodiment of the disclosure;

fig. 4 is a second schematic flowchart of a punctuation prediction method provided by the embodiment of the present disclosure;

fig. 5A is a third schematic flowchart of a punctuation prediction method according to an embodiment of the present disclosure;

fig. 5B is a schematic diagram illustrating another result obtained by performing audio truncation processing on an original audio according to an embodiment of the disclosure;

fig. 5C is a schematic diagram illustrating audio fusion modification after audio truncation and decoding are performed on an original audio according to an embodiment of the disclosure;

fig. 6 is a fourth schematic flowchart of a punctuation prediction method provided by the embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of a tag prediction apparatus according to an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of a speech recognition device according to an embodiment of the present disclosure.

Detailed Description

In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that, in the case of no conflict, the embodiments and features in the embodiments of the present disclosure may be combined with each other.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced in other ways than those described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.

The terms "first," "second," "third," and the like in the description and claims of this disclosure and in the above-described drawings are used for distinguishing between similar or analogous objects or entities and not necessarily for describing a particular sequential or chronological order, unless otherwise noted. It is to be understood that the terms so used are interchangeable under appropriate circumstances.

The terms "comprises" and "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or apparatus that comprises a list of elements is not necessarily limited to all elements expressly listed, but may include other elements not expressly listed or inherent to such product or apparatus.

With the rapid development of intelligent technology and the increasing popularization of intelligent devices, speech recognition technology plays an increasingly important role in a plurality of fields such as household appliances, automotive electronics, consumer electronics and the like. Speech recognition technology is a technology that allows a machine to convert a speech signal into corresponding text or commands through a process of recognition and understanding. In the scene of man-machine interaction, speech recognition plays a crucial role in natural language understanding and natural language generation. The correctness of the transcribed text is the basis and bottleneck for downstream tasks. There is still much exploratory work to do on how to improve the accuracy of speech recognition. And punctuation prediction based on the transcribed text is another important work for semantic understanding and interaction. The correct punctuation mark plays a great auxiliary role in the semantic understanding. Fuzzy or mismatched punctuation can even be misleading, thereby affecting the overall process of voice interaction.

Most of punctuation predictions in speech recognition currently output punctuation of a whole sentence based on context semantics of transcribed text. Fig. 1A is a schematic diagram of a punctuation prediction method in speech recognition in the prior art. As shown in fig. 1A, the main implementation flow is as follows: and processing the original audio to obtain a transcribed text, inputting the transcribed text into a punctuation prediction model, and outputting the probability of outputting punctuation at each position. However, in this method, when the transcribed text is too short and the available text information is insufficient, the probability of the occurrence of wrong or missed bid is very high by adopting the above method, and therefore, the accuracy of punctuation prediction is not high.

In view of the disadvantages in the foregoing method, in the embodiment of the disclosure, first, a probability of a punctuation label appearing after each character in a transcribed text and a probability of a non-punctuation label appearing after each character in the transcribed text are obtained based on a punctuation prediction model, first text information corresponding to an original audio and second text information corresponding to the original audio are obtained, and correction is performed according to the probabilities of the non-punctuation label appearing after each character in the transcribed text is performed according to the first text information corresponding to the original audio and the second text information corresponding to the original audio, so as to obtain prediction probability information of the corrected non-punctuation label. The transcribed text is a text sequence obtained by performing speech recognition processing on an original audio, the punctuation prediction model is used for outputting the probability of a non-punctuation label of the probability of the punctuation label appearing after each character in the transcribed text, the first text information is the text information obtained by performing audio truncation processing on the original audio, the first text information comprises speech characteristic information and non-speech characteristic information corresponding to the original audio, the second text information is the text information obtained by decoding the original audio, and the second text information comprises transcribed word characters and transcribed empty characters corresponding to the original audio. The method comprises the steps of correcting the probability of non-punctuation after each character in original audio through non-voice characteristic information corresponding to the original audio and character characters and null characters corresponding to the original audio, not only recognizing the context semantics of a transcribed text based on voice, but also predicting the probability of non-punctuation after each character by combining the first text information and the second text information, and further acquiring the prediction probability information of the corrected non-punctuation label.

For example, as shown in fig. 1B, fig. 1B is a schematic view of an application scenario of a semantic recognition punctuation prediction process of an intelligent device provided in the present disclosure, and in fig. 1B, the punctuation prediction process in speech recognition may be used in a speech interaction scenario between a user and an intelligent home, for example, the intelligent device in the scenario may be an intelligent device 100 (an example of fig. 1B is an intelligent refrigerator), an intelligent device 101 (an example of fig. 1B is an intelligent washing machine), an intelligent device 102 (an example of fig. 1B is an intelligent display device) and other intelligent devices having a speech recognition function, when the user wants to control the intelligent device in the scenario, the user needs to issue a speech instruction first, and when the intelligent device receives the speech instruction, perform speech recognition on the speech instruction, perform text punctuation prediction based on a transcribed text of the speech recognition, so that a subsequent intelligent device can conveniently display according to a result of the punctuation prediction, which is beneficial to improve readability of the transcribed text, and further improve efficiency of the speech interaction.

The punctuation prediction method provided by the embodiment of the present disclosure may be implemented based on a computer device, or a functional module or a functional entity in the computer device.

The computer device may be a Personal Computer (PC), a server, a mobile phone, a tablet computer, a notebook computer, a mainframe computer, and the like, which is not specifically limited in this disclosure.

Fig. 2A is a block diagram of a hardware configuration of a computer device according to one or more embodiments of the present disclosure. As shown in fig. 2A, the computer apparatus includes: at least one of a tuner 210, a communicator 220, a detector 230, an external device interface 240, a controller 250, a display 260, an audio output interface 270, a memory, a power supply, and a user interface 280. The controller 250 includes a central processing unit, a video processor, an audio processor, a graphic processor, a RAM, a ROM, a first interface to an nth interface for input/output, among others. The display 260 may be at least one of a liquid crystal display, an OLED display, a touch display, and a projection display, and may also be a projection device and a projection screen. The tuner demodulator 210 receives a broadcast television signal through wired or wireless reception, and demodulates an audio/video signal, such as an EPG audio/video data signal, from a plurality of wireless or wired broadcast television signals. The communicator 220 is a component for communicating with an external device or a server according to various communication protocol types. For example: the communicator may include at least one of a Wifi module, a bluetooth module, a wired ethernet module, and other network communication protocol chips or near field communication protocol chips, and an infrared receiver. The computer device may establish transmission and reception of control signals and data signals with a server or a local control device through the communicator 220. The detector 230 is used to collect signals of the external environment or interaction with the outside. The controller 250 and the modem 210 may be located in different separate devices, that is, the modem 210 may also be located in an external device of the main device where the controller 250 is located, such as an external set-top box.

In some embodiments, controller 250 controls the operation of the computer device and responds to user actions through various software control programs stored in memory. The controller 250 controls the overall operation of the computer device. A user may input a user command on a Graphical User Interface (GUI) displayed on the display 260, and the user input interface receives the user input command through the Graphical User Interface (GUI). Alternatively, the user may input the user command by inputting a specific sound or gesture, and the user input interface receives the user input command by recognizing the sound or gesture through the sensor.

Fig. 2B is a schematic software configuration diagram of a computer device according to one or more embodiments of the present disclosure, and as shown in fig. 2B, the system is divided into four layers, which are, from top to bottom, an Application (Applications) layer (referred to as an "Application layer"), an Application Framework (Application Framework) layer (referred to as a "Framework layer"), an Android runtime (Android runtime) and system library layer (referred to as a "system runtime library layer"), and a kernel layer.

Fig. 2C is a schematic diagram illustrating an icon control interface display of an application program included in an intelligent device (mainly, an intelligent playback device, such as an intelligent television, a digital cinema system, or a video server), according to one or more embodiments of the present disclosure, as shown in fig. 2C, an application layer includes at least one application program that can display a corresponding icon control on a display, such as: the system comprises a live television application icon control, a video on demand VOD application icon control, a media center application icon control, an application center icon control, a game application icon control and the like. The live television application program can provide live television through different signal sources. A video on demand VOD application may provide video from different storage sources. Unlike live television applications, video on demand provides a video display from some storage source. The media center application program can provide various applications for playing multimedia contents. The application center can provide and store various applications.

To explain the present solution in more detail, the following will be described with reference to fig. 3A, 4, 5A, and 6 by way of example, and it is understood that, although the steps in the flowcharts of fig. 3A, 4, 5A, and 6 are shown in order as indicated by arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not limited to being performed in the exact order illustrated and, unless explicitly stated herein, may be performed in other orders. Moreover, at least some of the steps in fig. 3A, fig. 4, fig. 5A, and fig. 6 may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least some of the sub-steps or stages of other steps. The punctuation prediction method provided in the embodiments of the present disclosure can be implemented.

As shown in fig. 3A, the method specifically includes the following steps:

s31, acquiring the probability of the punctuation mark after each character in the speech recognition transcription text is recognized based on the punctuation prediction model.

The voice recognition transcribed text is a text sequence obtained by performing voice recognition processing on the original audio. Specifically, the Speech Recognition transcription text is a text sequence obtained by processing an original audio signal based on an ASR (Automatic Speech Recognition) technology. Referring to fig. 3B, the input to speech recognition is typically a speech signal in the time domain, mathematically represented by a series of vectors, and the output is text.

The punctuation prediction model is used for outputting the probability of punctuation labels and the probability of non-punctuation labels after each character in the speech recognition and transcription text is recognized. It should be noted that the punctuation labels can be divided into four categories, which are: comma, period, question mark, exclamation mark. Non-punctuation labels, which can be understood as literal characters.

Specifically, the network structure of the punctuation prediction model is divided into two parts, namely a pre-training language model and a bidirectional LSTM (Long Short-Term Memory) model. For the prediction of text punctuation, the context information of the text is important, so a bidirectional LSTM model is added after a pre-training language model to acquire the global dependency information of the text. The pre-training model of the punctuation prediction model adopts a structure based on a transform, and the attention of a plurality of heads can better code the context information of the text.

The input of the punctuation prediction model is a transcription text of voice recognition, and the output is the probability of punctuation labels and the probability of non-punctuation labels appearing behind each character. For example, the input of the punctuation prediction model may be a transcribed text such as "how much weather today", "your good i is technical support of company a", and so on.

And correcting the punctuation probability by combining the acquired VAD audio information. Combining the audio signals such as pause and speech speed, and the like, adding weight to the positions where the punctuation points are not output in the punctuation prediction model and outputting the punctuation points.

And S32, acquiring first text information corresponding to the original audio and second text information corresponding to the original audio.

The first text information is obtained by performing audio truncation processing on an audio signal of an original audio, the first text information comprises voice characteristic information and non-voice characteristic information corresponding to the original audio, the second text information is obtained by decoding the original audio, and the second text information comprises transcribed text characters and transcribed empty characters corresponding to the original audio.

Specifically, the audio interception processing is audio interception processing performed by VAD (Voice Activity Detection) technology. Voice activity detection, also known as voice endpoint detection, is commonly used to identify the presence and absence of speech in an audio signal. Typically, a VAD algorithm will divide an audio signal into voiced parts, unvoiced parts, and silent parts. For example, the output is 1 when speech is detected, and 0 otherwise. In this embodiment, the first text information is speech feature information and non-speech feature information decoded and output from the original audio signal. For example, a schematic diagram of information included in the first text information is shown in fig. 3C.

Wherein, the decoding process is based on the automatic speech recognition technology to perform decoding operation. In the process of decoding processing, if no character is output at the current moment, the character is output

Indicating, which may be silence, or other conditions. A schematic diagram of information included in the second text information is shown in fig. 3D. The voice decoding adopts a prefix-beam-search algorithm to search N paths, and finally obtains an optimal result. For example, the transcribed text is "that is as if you are no work", and the corresponding intermediate transcribed text may include, but is not limited to, the following ways: (1)

That

Then is turned on

Good taste

You are

Lower part

Class

Feeder

(2)

That

Then it is turned into

Good taste

You are

Lower part

Class

Non-woven fabric

(3)

That is

Good taste

You are

Lower part

Class

There is no. Finally, an optimal result is obtained: for example,

that

Then is turned on

Good taste

You are

Lower part

Class

Feeder

S33, correcting the probability of the non-punctuation mark appearing after each character in the transcribed text according to the first text information and the second text information, and obtaining the corrected prediction probability information of the non-punctuation mark.

Specifically, after the probability that a punctuation label appears after each character in the transcribed text and the probability that a non-punctuation label appears after each character in the transcribed text are obtained, the probability that the non-punctuation label appears after each character in the transcribed text is corrected by combining the first text information and the second text information, so that corrected punctuation information is obtained. One way to implement this is to input the transcribed text into a punctuation prediction model, output the probability that five labels appear at each position, combine the obtained VAD audio information and the audio information transcribed into empty characters continuously by ASR, output punctuation in the transcribed text, add weight to the position where the prediction model does not output punctuation, and output punctuation.

Illustratively, raw audio captured in real-time is decoded by ASR into transcribed text, for example, the transcribed text is: "that is as if you are off duty". The transcribed text is input into the punctuation prediction model, and "0000000 question mark" is output. Processing the original audio to obtain first text information: "

That

Then is turned on

Good taste

You are

Lower part

Class

Non-woven fabric

", the second text information is: "11110001110001111000001000110001111000111111110000", and the result of the intervention of the above two audio information is finally obtained: "is that as good, do you go off work? ".

In some embodiments, the probability of punctuation marks occurring after transcription of each character in the text and the probability of non-punctuation marks occurring after transcription of each character in the text are inversely related.

Specifically, when the probability of a punctuation label appearing after a certain character of the text is transcribed is high, the probability of a non-punctuation label appearing after the character is correspondingly low; when the probability of a punctuation label appearing after a certain character of the transcribed text is low, the probability of a non-punctuation label appearing after the character is correspondingly high. Illustratively, assuming the transcribed text is "hello," the probability of a punctuation label appearing after the character "you" is low, while the probability of a non-punctuation label appearing is high.

In the embodiment of the disclosure, the probability of the occurrence of the non-punctuation mark after each character in the original audio is corrected through the non-speech characteristic information corresponding to the original audio and the character and null character corresponding to the original audio, not only the context semantic meaning of the transcribed text is identified based on the speech, but also the probability of the occurrence of the non-punctuation mark after each character is predicted by combining the two audio information, namely the first text information and the second text information, so as to obtain the prediction probability information of the corrected non-punctuation mark, and since the higher the probability of the occurrence of the non-punctuation mark after each character is, the lower the probability of the occurrence of the punctuation mark is, the accuracy and rationality of punctuation prediction can be improved by obtaining the prediction probability information of the corrected non-punctuation mark, meanwhile, the readability of the transcribed text is enhanced, and the efficiency of speech interaction is improved.

Fig. 4 is a schematic flowchart of another punctuation prediction method provided in the embodiment of the present disclosure. This embodiment is further expanded and optimized based on fig. 3A. Optionally, this embodiment mainly describes the process of step S33 (correcting the probability of the non-punctuation mark appearing after each character in the transcribed text according to the first text information and the second text information, and obtaining the corrected prediction probability information of the non-punctuation mark).

And S431, acquiring a first normalized position weight, a second normalized position weight and a third normalized position weight.

The first normalized position weight is the normalized weight of the number of empty characters behind each character in the first text information, the second normalized position weight is the normalized weight of the number of empty characters behind each character in the second text information, and the third normalized position weight is the sum of the first normalized position weight and the second normalized position weight.

S432, obtaining the probability of the non-punctuation mark appearing behind each corrected character according to the probability of the non-punctuation mark appearing behind each character in the transcribed text, the third normalization position weight and the audio intervention adjusting parameter.

The audio intervention adjusting parameters can adjust the intervention degree of the audio information according to different scenes so as to realize optimization under a specific scene. For example, the value of the audio intervention adjusting parameter may be 0.5,0.2,0.3, etc., or may be other reasonable values, which are not limited herein.

After determining the probability of the non-punctuation mark appearing after each character in the transcribed text, the third normalized position weight and the audio intervention adjusting parameter, obtaining the probability of the non-punctuation mark appearing after each character in the corrected transcribed text includes, but is not limited to, the following ways.

In some embodiments, when the preset correction manner is the first preset correction manner, the probability of the occurrence of the non-punctuation mark after each character in the transcribed text, the weight of the third normalization position, and the audio intervention adjustment parameter are calculated according to the preset correction manner, and the probability of the occurrence of the non-punctuation mark after each character after correction is obtained, which may be implemented by the following manner:

and determining the probability of the non-punctuation label appearing after each corrected character according to the probability of the non-punctuation label appearing after each character in the transcribed text and the sum of the products of the audio intervention adjusting parameter and the third normalized position weight.

Specifically, the probability of the occurrence of the non-punctuation mark after each character after correction can be calculated by the following formula (1):

L _Oi ＝l _Oi +β*δ _i formula (1)

Wherein L is _Oi Indicating that each character is come after correctionProbability of existing non-punctuation labels,/ _Oi Representing the probability of a non-punctuation label appearing after each character output by the punctuation prediction model, beta representing an audio intervention regulation parameter, delta _i Representing a third normalized position weight.

Illustratively, when the probability of the occurrence of the non-punctuation mark after each character output by the punctuation prediction model is 0.2, the audio intervention adjustment parameter is 0.5, and the third normalized position weight is 0.4, the probability of the occurrence of the non-punctuation mark after each character after correction is L _Oi ＝0.2+0.5*0.4＝0.4。

In some embodiments, when the preset correction mode is the second preset correction mode, the probability of the occurrence of the non-punctuation mark after each character in the transcribed text, the third normalized position weight, and the audio intervention adjustment parameter are calculated according to the preset correction mode, and the probability of the occurrence of the non-punctuation mark after each character after correction is obtained, which may be implemented in the following manner:

Specifically, the probability of the occurrence of the non-punctuation mark after each character after correction can be calculated by the following formula (2):

L _Oi ＝l _Oi *β*δ _i formula (2)

Wherein L is _Oi Indicates the probability of the occurrence of a non-punctuation mark after each character after correction, l _Oi Representing the probability of a non-punctuation label appearing after each character output by the punctuation prediction model, beta representing an audio intervention regulation parameter, delta _i Representing the third normalized position weight.

Illustratively, when the probability of the occurrence of a non-punctuation mark after each character output by the punctuation prediction model is 0.2, the audio intervention adjustment parameter is 0.5, and the third normalized position weight is 0.4, the probability of the occurrence of a non-punctuation mark after each character after correction is L _Oi ＝0.2*0.5*0.4＝0.04。

And S433, determining the probability of the punctuation mark at each position after correction according to the probability of the non-punctuation mark appearing after each corrected character.

Specifically, since the probability of the non-punctuation mark appearing after each character in the transcribed text is negatively correlated with the probability of the punctuation mark appearing after each character, that is, the probability of the non-punctuation mark appearing after each character in the pre-transcribed texts increases, the probability of the punctuation mark appearing after each character in the transcribed text decreases. If the probability of the non-punctuation mark appearing after interfering with the transcription of each character in the text is reduced, the probability of the punctuation mark appearing after the transcription of each character in the text is increased. Through the intervention of audio information, the third normalization position weight delta is reduced when more empty characters appear _i So that the probability of the punctuation mark appearing at each position after correction is increased. Increasing the third normalization position weight δ when fewer null characters occur _i So that the probability of the punctuation label appearing at each position after correction is reduced, and the intervention of audio information on punctuation output is realized.

Fig. 5A is a schematic flowchart of another punctuation prediction method provided in the embodiments of the present disclosure. The embodiment is further expanded and optimized on the basis of fig. 4. Alternatively, in this embodiment, when the correction method is an additive intervention method, the process of step S431 (obtaining the first normalized position weight, the second normalized position weight, and the third normalized position weight) will be described.

When the correction mode is an additive intervention mode, the first normalized position weight, the second normalized position weight and the third normalized position weight are obtained, and the method can be realized by the following modes:

s5311, acquiring a first normalization position weight according to the sum of the number of the hollow characters in the first text information, the average number of the hollow characters in the first text information and the number of the hollow characters behind each character in the first text information.

Specifically, the first normalized position weight is calculated by the following formula (3):

wherein alpha is _i A first normalized position weight is represented and,

representing the sum of the number of empty characters in the first text information,

representing an average number of empty characters in the first text information,

indicating the number of empty characters after each character in the first text information.

To realize transcription results in automatic speech recognition

More positions appear, the probability of punctuation is improved, and the automatic speech recognition transcription results can be fused in the following modes including but not limited to the following modes.

Illustratively, the transcribed text is: "that is as if you are off duty". The transcribed text has 7 Chinese characters and 7 positions, and because the beginning generally has no punctuation and does not need to be intervened, the first Chinese character is preceded by the first Chinese character

Are ignored. The position after the first character is marked as position No. 1, and then the position at each position is counted

Of the number of, e.g. optimal results of speech decoding "

That

Then it is turned into

Good taste

You are

Lower part

Class

Non-woven fabric

In

The position code can be expressed as: p _a =2162211. Transcribing the results

The sum of (1) is denoted as

Each position null character

Is counted as

For each position

And normalizing the number of the weights to obtain corresponding normalized weights. Namely, the above equation (3). In the present embodiment, it is preferred that,

the normalized effect is: eliminating the influence of the speaking speed; the non-punctuation mark probability that is favorable is controlled more accurately, and then the punctuation mark probability is interfered effectively.

S5312, acquiring a second normalization position weight according to the sum of the number of the hollow characters in the second text information, the average number of the hollow characters in the second text information and the number of the hollow characters behind each character in the second text information.

Specifically, the second normalized position weight is calculated by the following formula (4):

wherein epsilon _i Representing a second normalized position weight, θ _N Representing the sum of the number of empty characters, theta, of the second text information _avg Representing an average number of empty characters, θ, in the second text information _i Indicating the number of empty characters after each character in the second text information.

The VAD processing is based on the principle that when a sound occurs, the flag of the frame is set to "1", and when no sound occurs, the flag of the frame is set to "0". Based on the position of the punctuation, the speaker mostly pauses more. Statistics begin after the characters are rolled out from the ASR, and the positions labeled "0" for the VAD output are counted. The text is then aligned according to the ASR transcription.

To achieve the high occurrence probability of the punctuation at the positions where "0" appears in the VAD result, the VAD result can be fused in the following ways, including but not limited to. Referring to fig. 5B, in fig. 5B, there are more positions where "0" appears, and the probability of the punctuation is higher. The position vector obtained after counting 0 for each position is: p _v =3353334. The sum of "0" in the VAD result is 3+5+3+ 4=24, theta _N =24, so θ _avg ＝24/7，θ ₁ ＝3，θ ₂ ＝3，θ ₃ ＝5，θ ₄ ＝3，θ ₅ ＝3，θ ₆ ＝3，θ ₇ ＝4。

Because the VAD output has various interferences, and is not like the ideal result, for this reason, the ASR transcription text is combined to merge the same terms of the VAD tags, so as to realize the alignment of the ASR transcription result and the VAD tag result. Referring to FIG. 5C, the bold label "1" in FIG. 5C may be caused by an error in transcoding or other conditions, and the ASR transcription result is combined, when the position in the ASR transcription result is

Then, it is corrected to "0".

S5313, obtaining a third normalized position weight according to the sum of the first normalized position weight and the second normalized position weight.

Specifically, the third normalized position weight is calculated by the following formula (5):

δ _i ＝α _i +ε _i formula (5)

Wherein alpha is _i Denotes the first normalized position weight, ε _i Representing a second normalized position weight.

In order to better fuse the transcription result of ASR and the VAD result information, the method of formula (5) is adopted to perform additive fusion on the weight of the audio information transcribed by ASR and the weight of the audio information transcribed by VAD to obtain delta _i 。

Fig. 6 is a schematic flowchart of another punctuation prediction method provided in the embodiment of the present disclosure. The embodiment is further expanded and optimized on the basis of fig. 4. Optionally, in this embodiment, when the correction mode is a multiplicative intervention mode, a process of step S431 (obtaining the first normalized position weight, the second normalized position weight, and the third normalized position weight) is described.

S6311, obtaining a first normalization position weight according to the maximum number of empty characters behind each character in the N characters of the first text information, the number of empty characters behind each character in the first text information and the scaling factor.

Specifically, the first normalized position weight is calculated by the following formula (6):

wherein alpha is _i Denotes a first normalized position weight, max _1N Representing the maximum number of empty characters after each of the N characters of the first text information,

indicating the number of empty characters after each character in the first text information and gamma indicating the scaling factor.

As an example, γ is a scaling factor, and an optimal value thereof may be 1.5, or may be other reasonable values, which is not limited herein.

S6312, obtaining a second normalization position weight according to the maximum number of empty characters behind each character in the N characters of the second text information, the number of empty characters behind each character in the second text information and the scaling factor.

Specifically, the first normalized position weight is calculated by the following formula (7):

wherein epsilon _i Denotes a second normalized position weight, max _2N Representing the maximum number of empty characters, theta, after each of the N characters of the second text information _i Indicating the number of empty characters after each character in the second text information and gamma indicating the scaling factor.

S6313, obtaining a third normalized position weight according to the sum of the first normalized position weight and the second normalized position weight.

Wherein N is an integer of 1 or more.

Specifically, the third normalized position weight is calculated by the above formula (5):

δ _i ＝α _i +ε _i formula (5)

In the embodiment of the disclosure, the probability of non-punctuation after each character in the original audio is corrected through the non-speech characteristic information corresponding to the original audio and the character and null character corresponding to the original audio, the context semantics of the transcribed text are recognized based on the speech, and the probability of non-punctuation after each character is predicted by combining the two audio information, namely the first text information and the second text information, so as to obtain the prediction probability information of the corrected non-punctuation label.

Fig. 7 is a schematic structural diagram of a semantic understanding apparatus according to an embodiment of the present disclosure. The device is configured in intelligent equipment, and can realize the punctuation prediction method in any embodiment of the disclosure. The apparatus 700 specifically includes the following:

a punctuation probability obtaining module 710, configured to obtain, based on a punctuation prediction model, a probability that a punctuation label appears after each character in the transcribed text and a probability that a non-punctuation label appears; the transcription text is a text sequence obtained by performing voice recognition processing on original audio, and the punctuation prediction model is used for outputting the probability of punctuation labels and the probability of non-punctuation labels after each character in the transcription text appears;

an audio text obtaining module 720, configured to obtain first text information corresponding to an original audio and second text information corresponding to the original audio; the first text information is obtained by performing audio truncation processing on the original audio, the first text information comprises voice characteristic information and non-voice characteristic information corresponding to the original audio, the second text information is obtained by decoding the original audio, and the second text information comprises transcribed characters and transcribed empty characters corresponding to the original audio;

and a punctuation probability correction module 730, configured to correct, according to the first text information and the second text information, the probability that a non-punctuation label appears after each character in the transcribed text, and obtain corrected prediction probability information of the non-punctuation label.

As an optional implementation manner of the embodiment of the present disclosure, the punctuation probability correction module 730 includes:

As an optional implementation manner of the embodiment of the present disclosure, when the modification manner is an additive intervention manner, the weight obtaining unit is specifically configured to:

As an optional implementation manner of the embodiment of the present disclosure, when the modification manner is a multiplicative intervention manner, the weight obtaining unit is specifically configured to:

wherein N is an integer of 1 or more.

In the embodiment of the disclosure, the probability of a punctuation label appearing after each character in a transcribed text and the probability of a non-punctuation label are obtained based on a punctuation prediction model, first text information corresponding to an original audio and second text information corresponding to the original audio are obtained, and the probability of the non-punctuation label appearing after each character in the text is transcribed according to the first text information corresponding to the original audio and the second text information corresponding to the original audio, so as to obtain the corrected prediction probability information of the non-punctuation label. The transcribed text is a text sequence obtained by performing speech recognition processing on an original audio, the punctuation prediction model is used for outputting the probability of a non-punctuation label of the probability of the punctuation label appearing after each character in the transcribed text, the first text information is the text information obtained by performing audio truncation processing on the original audio, the first text information comprises speech characteristic information and non-speech characteristic information corresponding to the original audio, the second text information is the text information obtained by decoding the original audio, and the second text information comprises transcribed word characters and transcribed empty characters corresponding to the original audio. The probability of non-punctuation appearing behind each character in the original audio is corrected through the non-voice characteristic information corresponding to the original audio and the character characters and the null characters corresponding to the original audio, not only is the context semantic meaning of the transcribed text recognized based on voice, but also the probability of non-punctuation appearing behind each character is predicted by combining the two audio information of the first text information and the second text information, and then the prediction probability information of the corrected non-punctuation label is obtained.

The semantic understanding apparatus provided in the embodiments of the present disclosure may execute the punctuation prediction method provided in any embodiment of the present disclosure, has the corresponding functional modules and beneficial effects of the execution method, and is not described herein again to avoid repetition.

The disclosed embodiment provides a computer device, including: one or more processors; a storage device configured to store one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the punctuation prediction method of any one of the embodiments of the present disclosure.

Fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure. As shown in fig. 8, the computer apparatus includes a processor 810 and a storage 820; the number of the processors 810 in the computer device may be one or more, and one processor 810 is taken as an example in fig. 8; the processor 810 and the storage 820 in the computer device may be connected by a bus or other means, and fig. 8 illustrates the connection by a bus as an example.

The storage device 820 is a computer-readable storage medium for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the punctuation prediction method in the embodiments of the present disclosure. The processor 810 executes various functional applications and data processing of the computer device by executing software programs, instructions and modules stored in the storage 820, that is, the punctuation prediction method provided by the embodiment of the present disclosure is implemented.

The storage device 820 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Additionally, storage 820 may include high speed random access storage, and may also include non-volatile storage, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, storage 820 may further include storage remotely located from processor 810, which may be connected to a computer device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The computer device provided by the embodiment can be used for executing the punctuation prediction method provided by any embodiment, and has corresponding functions and beneficial effects.

The embodiments of the present disclosure further provide a storage medium containing computer-executable instructions, where the computer-executable instructions, when executed by a computer processor, implement each process executed by the method provided in any of the above embodiments, and can achieve the same technical effect, and in order to avoid repetition, the details are not repeated here.

The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.

The foregoing description, for purposes of explanation, has been presented in conjunction with specific embodiments. However, the foregoing discussion in some embodiments is not intended to be exhaustive or to limit the embodiments to the precise forms disclosed above. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles and the practical application, to thereby enable others skilled in the art to best utilize the embodiments and various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. A method for punctuation prediction, the method comprising:

acquiring first text information corresponding to original audio and second text information corresponding to the original audio; the first text information is obtained by performing audio truncation processing on the original audio, the first text information comprises voice characteristic information and non-voice characteristic information corresponding to the original audio, the second text information is obtained by decoding the original audio, and the second text information comprises transcribed characters and transcribed empty characters corresponding to the original audio;

2. The method according to claim 1, wherein the correcting, according to the first text information and the second text information, the probability of occurrence of a non-punctuation mark after each character in the transcribed text to obtain corrected prediction probability information of the non-punctuation mark comprises:

obtaining a first normalized position weight, a second normalized position weight and a third normalized position weight;

and acquiring the prediction probability information of the corrected non-punctuation label according to the probability of the non-punctuation label appearing after each corrected character.

3. The method of claim 1, further comprising:

the probability of punctuation marks appearing after each character in the transcribed text is inversely related to the probability of non-punctuation marks appearing after each character in the transcribed text.

4. The method as claimed in claim 2, wherein said obtaining the modified probability of the occurrence of the non-punctuation mark after each character in the transcribed text according to the probability of the occurrence of the non-punctuation mark after each character, the third normalized position weight, and the audio intervention adjusting parameter comprises:

5. The method according to claim 4, wherein when the preset correction mode is a first preset correction mode, the calculating, according to the preset correction mode, a probability that a non-punctuation mark appears after each character in the transcribed text, the third normalization position weight, and the audio intervention adjustment parameter, and obtaining the probability that the non-punctuation mark appears after each character after correction comprises:

6. The method according to claim 4, wherein when the preset correction mode is a second preset correction mode, the calculating, according to the preset correction mode, a probability that a non-punctuation mark appears after each character in the transcribed text, the third normalization position weight, and the audio intervention adjustment parameter, and obtaining the probability that the non-punctuation mark appears after each character after correction comprises:

7. The method according to claim 4, wherein when the preset modification manner is a first preset modification manner, the obtaining the first normalized position weight, the second normalized position weight, and the third normalized position weight includes:

8. The method according to claim 4, wherein when the preset modification manner is a second preset modification manner, the obtaining the first normalized position weight, the second normalized position weight, and the third normalized position weight includes:

wherein N is an integer of 1 or more.

9. An apparatus for punctuation prediction, the apparatus comprising:

the audio text acquisition module is used for acquiring first text information corresponding to original audio and second text information corresponding to the original audio; the first text information is obtained by performing audio truncation processing on the original audio, the first text information comprises voice characteristic information and non-voice characteristic information corresponding to the original audio, the second text information is obtained by decoding the original audio, and the second text information comprises transcribed characters and transcribed empty characters corresponding to the original audio;

10. A speech recognition device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the punctuation prediction method as defined in any one of claims 1 to 8 when executing the computer program.