CN115662432A - Punctuation prediction method and device and voice recognition equipment - Google Patents

Punctuation prediction method and device and voice recognition equipment Download PDF

Info

Publication number
CN115662432A
CN115662432A CN202211184502.5A CN202211184502A CN115662432A CN 115662432 A CN115662432 A CN 115662432A CN 202211184502 A CN202211184502 A CN 202211184502A CN 115662432 A CN115662432 A CN 115662432A
Authority
CN
China
Prior art keywords
punctuation
probability
character
text
position weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211184502.5A
Other languages
Chinese (zh)
Inventor
马志强
岳文浩
张宝军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hisense Visual Technology Co Ltd
Original Assignee
Hisense Visual Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hisense Visual Technology Co Ltd filed Critical Hisense Visual Technology Co Ltd
Priority to CN202211184502.5A priority Critical patent/CN115662432A/en
Publication of CN115662432A publication Critical patent/CN115662432A/en
Pending legal-status Critical Current

Links

Images

Abstract

The disclosure relates to a punctuation prediction method, a punctuation prediction device and a voice recognition device. The method comprises the following steps: the method comprises the steps of obtaining the probability of punctuation marks appearing after each character in a transcribed text and the probability of non-punctuation marks based on a punctuation prediction model, obtaining first text information and second text information corresponding to an original audio, correcting the probability of non-punctuation marks appearing after each character in the transcribed text according to the first text information and the second text information, and obtaining the corrected prediction probability information of the non-punctuation marks. The transcribed text is a text sequence obtained by carrying out voice recognition processing on the original audio; the first text information is obtained by performing audio truncation processing on the original audio, and comprises voice characteristic information and non-voice characteristic information corresponding to the original audio; the second text information is obtained by decoding the original audio, and comprises transcribed literal characters and transcribed empty characters corresponding to the original audio. By adopting the method, the accuracy of punctuation prediction can be improved.

Description

Punctuation prediction method and device and voice recognition equipment
Technical Field
The present disclosure relates to the field of natural language processing technologies, and in particular, to a punctuation prediction method, a punctuation prediction device, and a speech recognition device.
Background
In the scene of man-machine interaction, speech recognition plays a crucial role in natural language understanding and natural language generation. Punctuation prediction is performed on a transcribed text of speech recognition, which is an important work for semantic understanding and interaction. The correct punctuation mark plays a great auxiliary role in understanding the semantics. Fuzzy or mismatched punctuation can play the role of misideation and misleading, thereby influencing the whole process of voice interaction.
In the related art, the punctuation prediction in speech recognition is mostly based on the punctuation of the context semantically output whole sentence of the transcribed text. However, when the transcribed text is too short and the available information is insufficient, the probability of the occurrence of a wrong or missed bid is extremely high in the above manner. Therefore, how to improve the accuracy of the landmark prediction in speech recognition is a problem that needs to be solved urgently at present.
Disclosure of Invention
In order to solve the technical problem or at least partially solve the technical problem, the disclosure provides a display device and a punctuation prediction method, which can determine semantic understanding contents corresponding to participles of various food materials when a user speaks a situation of storing multiple food materials, thereby effectively improving the semantic understanding capability of the display device and further improving user experience.
When a user watches multimedia content, target bullet screen content corresponding to the multimedia content is provided for the user, so that the user can send bullet screen content which the user wants to send, and the watching experience and satisfaction of the user are improved.
In a first aspect, the present disclosure provides a punctuation prediction method, including:
acquiring the probability of punctuation labels appearing after each character in the transcribed text and the probability of non-punctuation labels based on a punctuation prediction model; the transcription text is a text sequence obtained by performing voice recognition processing on original audio, and the punctuation prediction model is used for outputting the probability of punctuation labels and the probability of non-punctuation labels after each character in the transcription text appears;
acquiring first text information corresponding to original audio and second text information corresponding to the original audio; the first text information is obtained by performing audio truncation processing on the original audio, the first text information comprises voice characteristic information and non-voice characteristic information corresponding to the original audio, the second text information is obtained by decoding the original audio, and the second text information comprises transcribed text characters and transcribed empty characters corresponding to the original audio;
and correcting the probability of the non-punctuation mark appearing after each character in the transcribed text according to the first text information and the second text information, and acquiring the corrected prediction probability information of the non-punctuation mark.
As an optional implementation manner of the embodiment of the present disclosure, the correcting, according to the first text information and the second text information, the probability of occurrence of a non-punctuation mark after each character in the transcribed text, and obtaining prediction probability information of the corrected non-punctuation mark includes:
acquiring a first normalized position weight, a second normalized position weight and a third normalized position weight;
the first normalized position weight is the normalized weight of the number of the empty characters behind each character in the first text message, the second normalized position weight is the normalized weight of the number of the empty characters behind each character in the second text message, and the third normalized position weight is the sum of the first normalized position weight and the second normalized position weight;
acquiring the probability of the non-punctuation mark appearing after each corrected character according to the probability of the non-punctuation mark appearing after each character in the transcribed text, the third normalized position weight and the audio intervention adjusting parameter;
and acquiring the prediction probability information of the corrected non-punctuation labels according to the probability of the non-punctuation labels appearing after each corrected character.
As an optional implementation manner of the embodiment of the present disclosure, the method further includes:
the probability of punctuation marks appearing after each character in the transcribed text and the probability of non-punctuation marks appearing after each character in the transcribed text are in negative correlation.
As an optional implementation manner of the embodiment of the present disclosure, the obtaining, according to the probability of occurrence of a non-punctuation mark after each character in the transcribed text, the third normalized position weight, and the audio intervention adjustment parameter, the probability of occurrence of a non-punctuation mark after each character after correction includes:
calculating the probability of non-punctuation marks appearing after each character in the transcribed text, the third normalization position weight and the audio intervention adjusting parameter according to a preset correction mode, and acquiring the probability of non-punctuation marks appearing after each corrected character; the preset correction mode comprises the following steps: a first preset correction mode and a second preset correction mode.
As an optional implementation manner of the embodiment of the present disclosure, when the preset correction manner is a first preset correction manner, the calculating, according to the preset correction manner, a probability that a non-punctuation mark appears after each character in the transcribed text, the third normalized position weight, and the audio intervention adjustment parameter, and obtaining the probability that the non-punctuation mark appears after each character after correction includes:
and determining the probability of the non-punctuation mark appearing after each corrected character according to the probability of the non-punctuation mark appearing after each character in the transcribed text and the sum of the products of the audio intervention adjusting parameter and the third normalized position weight.
As an optional implementation manner of the embodiment of the present disclosure, when the preset modification manner is a second preset modification manner, the calculating, according to the preset modification manner, a probability that a non-punctuation mark appears after each character in the transcribed text, the third normalized position weight, and the audio intervention adjustment parameter, and obtaining the probability that the non-punctuation mark appears after each character after modification includes:
and determining the probability of the non-punctuation mark appearing after each corrected character according to the product of the probability of the non-punctuation mark appearing after each character in the transcribed text, the audio intervention adjusting parameter and the third normalized position weight.
As an optional implementation manner of the embodiment of the present disclosure, when the preset correction manner is a first preset correction manner, the obtaining a first normalized position weight, a second normalized position weight, and a third normalized position weight includes:
acquiring a first normalized position weight according to the sum of the number of the hollow characters in the first text information, the average number of the hollow characters in the first text information and the number of the hollow characters behind each character in the first text information;
acquiring a second normalized position weight according to the sum of the number of the hollow characters in the second text information, the average number of the hollow characters in the second text information and the number of the hollow characters behind each character in the second text information;
and acquiring a third normalized position weight according to the sum of the first normalized position weight and the second normalized position weight.
As an optional implementation manner of the embodiment of the present disclosure, when the preset modification manner is a second preset modification manner, the obtaining the first normalized position weight, the second normalized position weight, and the third normalized position weight includes:
acquiring a first normalized position weight according to the maximum number of empty characters behind each character in the N characters of the first text information, the number of empty characters behind each character in the first text information and a scaling factor;
acquiring a second normalized position weight according to the maximum number of empty characters behind each character in the N characters of the second text information, the number of empty characters behind each character in the second text information and a scaling factor;
acquiring a third normalized position weight according to the sum of the first normalized position weight and the second normalized position weight;
wherein N is an integer of 1 or more.
In a second aspect, a semantic understanding apparatus is provided, and the method includes:
the punctuation probability acquisition module is used for acquiring the probability of punctuation labels and the probability of non-punctuation labels after each character in the transcribed text is acquired based on the punctuation prediction model; the transcription text is a text sequence obtained by performing voice recognition processing on original audio, and the punctuation prediction model is used for outputting the probability of punctuation labels and the probability of non-punctuation labels after each character in the transcription text;
the audio text acquisition module is used for acquiring first text information corresponding to an original audio and second text information corresponding to the original audio; the first text information is obtained by performing audio truncation processing on the original audio, the first text information comprises voice characteristic information and non-voice characteristic information corresponding to the original audio, the second text information is obtained by decoding the original audio, and the second text information comprises transcribed text characters and transcribed empty characters corresponding to the original audio;
and the punctuation probability correction module is used for correcting the probability of the occurrence of non-punctuation labels after each character in the transcribed text according to the first text information and the second text information and acquiring the corrected prediction probability information of the non-punctuation labels.
As an optional implementation manner of the embodiment of the present disclosure, the punctuation probability correction module includes:
a weight acquisition unit configured to acquire a first normalized position weight, a second normalized position weight, and a third normalized position weight;
the first normalized position weight is the normalized weight of the number of empty characters behind each character in the first text information, the second normalized position weight is the normalized weight of the number of empty characters behind each character in the second text information, and the third normalized position weight is the sum of the first normalized position weight and the second normalized position weight;
the character probability correction unit is used for acquiring the probability of the non-punctuation mark appearing after each corrected character according to the probability of the non-punctuation mark appearing after each character in the transcribed text, the third normalized position weight and the audio intervention adjusting parameter;
and the punctuation probability correction unit is used for acquiring the prediction probability information of the corrected non-punctuation label according to the probability of the non-punctuation label appearing after each corrected character.
As an optional implementation manner of the embodiment of the present disclosure, the probability of the punctuation label appearing after each character in the transcribed text and the probability of the non-punctuation label appearing after each character in the transcribed text are in negative correlation.
As an optional implementation manner of the embodiment of the present disclosure, the character probability correction unit is specifically configured to:
calculating the probability of the non-punctuation mark appearing after each character in the transcribed text, the third normalized position weight and the audio intervention adjusting parameter according to a preset correction mode, and acquiring the probability of the non-punctuation mark appearing after each corrected character; the preset correction mode comprises the following steps: a first preset correction mode and a second preset correction mode.
As an optional implementation manner of the embodiment of the present disclosure, when the preset correction manner is a first preset correction manner, the calculating, according to the preset correction manner, a probability that a non-punctuation mark appears after each character in the transcribed text, the third normalized position weight, and the audio intervention adjustment parameter, and obtaining the probability that the non-punctuation mark appears after each character after correction includes:
and determining the probability of the non-punctuation mark appearing after each corrected character according to the probability of the non-punctuation mark appearing after each character in the transcribed text and the sum of the products of the audio intervention adjusting parameter and the third normalized position weight.
As an optional implementation manner of the embodiment of the present disclosure, when the preset modification manner is a second preset modification manner, the calculating, according to the preset modification manner, a probability that a non-punctuation mark appears after each character in the transcribed text, the third normalized position weight, and the audio intervention adjustment parameter, and obtaining the probability that the non-punctuation mark appears after each character after modification includes:
and determining the probability of the non-punctuation label appearing after each corrected character according to the product of the probability of the non-punctuation label appearing after each character in the transcribed text, the audio intervention adjusting parameter and the third normalized position weight.
As an optional implementation manner of the embodiment of the present disclosure, when the preset modification manner is a first preset modification manner, the weight obtaining unit is specifically configured to:
acquiring a first normalized position weight according to the sum of the number of the hollow characters in the first text information, the average number of the hollow characters in the first text information and the number of the hollow characters behind each character in the first text information;
acquiring a second normalized position weight according to the sum of the number of the hollow characters in the second text information, the average number of the hollow characters in the second text information and the number of the hollow characters behind each character in the second text information;
and acquiring a third normalized position weight according to the sum of the first normalized position weight and the second normalized position weight.
As an optional implementation manner of the embodiment of the present disclosure, when the preset modification manner is a second preset modification manner, the weight obtaining unit is specifically configured to:
acquiring a first normalization position weight according to the maximum number of empty characters behind each character in the N characters of the first text information, the number of empty characters behind each character in the first text information and a scaling factor;
acquiring a second normalized position weight according to the maximum number of empty characters behind each character in the N characters of the second text information, the number of empty characters behind each character in the second text information and a scaling factor;
acquiring a third normalized position weight according to the sum of the first normalized position weight and the second normalized position weight;
wherein N is an integer of 1 or more.
In a third aspect, a speech recognition device is provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the punctuation prediction method as shown in the first aspect when executing the computer program.
In a fourth aspect, a computer-readable storage medium is provided, comprising: the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the punctuation prediction method as presented in the first aspect.
Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has the following advantages: the method comprises the steps of obtaining the probability of a punctuation label appearing after each character in a transcribed text and the probability of a non-punctuation label based on a punctuation prediction model, obtaining first text information corresponding to an original audio and second text information corresponding to the original audio, correcting according to the probability of the non-punctuation label appearing after each character in the transcribed text is obtained according to the first text information corresponding to the original audio and the second text information corresponding to the original audio, and obtaining the corrected prediction probability information of the non-punctuation label. The transcribed text is a text sequence obtained by performing speech recognition processing on an original audio, the punctuation prediction model is used for outputting the probability of a non-punctuation label of the probability of the punctuation label appearing after each character in the transcribed text, the first text information is the text information obtained by performing audio truncation processing on the original audio, the first text information comprises speech characteristic information and non-speech characteristic information corresponding to the original audio, the second text information is the text information obtained by decoding the original audio, and the second text information comprises transcribed word characters and transcribed empty characters corresponding to the original audio. The probability of non-punctuation appearing behind each character in the original audio is corrected through the non-voice characteristic information corresponding to the original audio and the character characters and the null characters corresponding to the original audio, not only is the context semantic meaning of the transcribed text recognized based on voice, but also the probability of non-punctuation appearing behind each character is predicted by combining the two audio information of the first text information and the second text information, and then the prediction probability information of the corrected non-punctuation label is obtained.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.
In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
FIG. 1A is a schematic diagram of a prior art tag prediction process;
fig. 1B is a schematic diagram of an application scenario of a tag prediction process in an embodiment of the present disclosure;
fig. 2A is a block diagram of a hardware configuration of a speech recognition device according to one or more embodiments of the present disclosure;
fig. 2B is a block diagram of a software configuration of a speech recognition device according to one or more embodiments of the present disclosure;
FIG. 2C is a schematic illustration of an icon control interface display of an application program included in a speech recognition device according to one or more embodiments of the present disclosure;
fig. 3A is a schematic flow chart of a punctuation prediction method according to an embodiment of the present disclosure;
FIG. 3B is a schematic diagram illustrating an automatic speech recognition transcription result of an audio signal according to an embodiment of the disclosure;
fig. 3C is a schematic diagram illustrating a result obtained by performing audio truncation processing on an original audio according to an embodiment of the disclosure;
fig. 3D is a schematic diagram illustrating a result obtained by decoding an original audio according to an embodiment of the disclosure;
fig. 4 is a second schematic flowchart of a punctuation prediction method provided by the embodiment of the present disclosure;
fig. 5A is a third schematic flowchart of a punctuation prediction method according to an embodiment of the present disclosure;
fig. 5B is a schematic diagram illustrating another result obtained by performing audio truncation processing on an original audio according to an embodiment of the disclosure;
fig. 5C is a schematic diagram illustrating audio fusion modification after audio truncation and decoding are performed on an original audio according to an embodiment of the disclosure;
fig. 6 is a fourth schematic flowchart of a punctuation prediction method provided by the embodiment of the present disclosure;
fig. 7 is a schematic structural diagram of a tag prediction apparatus according to an embodiment of the present disclosure;
fig. 8 is a schematic structural diagram of a speech recognition device according to an embodiment of the present disclosure.
Detailed Description
In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that, in the case of no conflict, the embodiments and features in the embodiments of the present disclosure may be combined with each other.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced in other ways than those described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.
The terms "first," "second," "third," and the like in the description and claims of this disclosure and in the above-described drawings are used for distinguishing between similar or analogous objects or entities and not necessarily for describing a particular sequential or chronological order, unless otherwise noted. It is to be understood that the terms so used are interchangeable under appropriate circumstances.
The terms "comprises" and "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or apparatus that comprises a list of elements is not necessarily limited to all elements expressly listed, but may include other elements not expressly listed or inherent to such product or apparatus.
With the rapid development of intelligent technology and the increasing popularization of intelligent devices, speech recognition technology plays an increasingly important role in a plurality of fields such as household appliances, automotive electronics, consumer electronics and the like. Speech recognition technology is a technology that allows a machine to convert a speech signal into corresponding text or commands through a process of recognition and understanding. In the scene of man-machine interaction, speech recognition plays a crucial role in natural language understanding and natural language generation. The correctness of the transcribed text is the basis and bottleneck for downstream tasks. There is still much exploratory work to do on how to improve the accuracy of speech recognition. And punctuation prediction based on the transcribed text is another important work for semantic understanding and interaction. The correct punctuation mark plays a great auxiliary role in the semantic understanding. Fuzzy or mismatched punctuation can even be misleading, thereby affecting the overall process of voice interaction.
Most of punctuation predictions in speech recognition currently output punctuation of a whole sentence based on context semantics of transcribed text. Fig. 1A is a schematic diagram of a punctuation prediction method in speech recognition in the prior art. As shown in fig. 1A, the main implementation flow is as follows: and processing the original audio to obtain a transcribed text, inputting the transcribed text into a punctuation prediction model, and outputting the probability of outputting punctuation at each position. However, in this method, when the transcribed text is too short and the available text information is insufficient, the probability of the occurrence of wrong or missed bid is very high by adopting the above method, and therefore, the accuracy of punctuation prediction is not high.
In view of the disadvantages in the foregoing method, in the embodiment of the disclosure, first, a probability of a punctuation label appearing after each character in a transcribed text and a probability of a non-punctuation label appearing after each character in the transcribed text are obtained based on a punctuation prediction model, first text information corresponding to an original audio and second text information corresponding to the original audio are obtained, and correction is performed according to the probabilities of the non-punctuation label appearing after each character in the transcribed text is performed according to the first text information corresponding to the original audio and the second text information corresponding to the original audio, so as to obtain prediction probability information of the corrected non-punctuation label. The transcribed text is a text sequence obtained by performing speech recognition processing on an original audio, the punctuation prediction model is used for outputting the probability of a non-punctuation label of the probability of the punctuation label appearing after each character in the transcribed text, the first text information is the text information obtained by performing audio truncation processing on the original audio, the first text information comprises speech characteristic information and non-speech characteristic information corresponding to the original audio, the second text information is the text information obtained by decoding the original audio, and the second text information comprises transcribed word characters and transcribed empty characters corresponding to the original audio. The method comprises the steps of correcting the probability of non-punctuation after each character in original audio through non-voice characteristic information corresponding to the original audio and character characters and null characters corresponding to the original audio, not only recognizing the context semantics of a transcribed text based on voice, but also predicting the probability of non-punctuation after each character by combining the first text information and the second text information, and further acquiring the prediction probability information of the corrected non-punctuation label.
For example, as shown in fig. 1B, fig. 1B is a schematic view of an application scenario of a semantic recognition punctuation prediction process of an intelligent device provided in the present disclosure, and in fig. 1B, the punctuation prediction process in speech recognition may be used in a speech interaction scenario between a user and an intelligent home, for example, the intelligent device in the scenario may be an intelligent device 100 (an example of fig. 1B is an intelligent refrigerator), an intelligent device 101 (an example of fig. 1B is an intelligent washing machine), an intelligent device 102 (an example of fig. 1B is an intelligent display device) and other intelligent devices having a speech recognition function, when the user wants to control the intelligent device in the scenario, the user needs to issue a speech instruction first, and when the intelligent device receives the speech instruction, perform speech recognition on the speech instruction, perform text punctuation prediction based on a transcribed text of the speech recognition, so that a subsequent intelligent device can conveniently display according to a result of the punctuation prediction, which is beneficial to improve readability of the transcribed text, and further improve efficiency of the speech interaction.
The punctuation prediction method provided by the embodiment of the present disclosure may be implemented based on a computer device, or a functional module or a functional entity in the computer device.
The computer device may be a Personal Computer (PC), a server, a mobile phone, a tablet computer, a notebook computer, a mainframe computer, and the like, which is not specifically limited in this disclosure.
Fig. 2A is a block diagram of a hardware configuration of a computer device according to one or more embodiments of the present disclosure. As shown in fig. 2A, the computer apparatus includes: at least one of a tuner 210, a communicator 220, a detector 230, an external device interface 240, a controller 250, a display 260, an audio output interface 270, a memory, a power supply, and a user interface 280. The controller 250 includes a central processing unit, a video processor, an audio processor, a graphic processor, a RAM, a ROM, a first interface to an nth interface for input/output, among others. The display 260 may be at least one of a liquid crystal display, an OLED display, a touch display, and a projection display, and may also be a projection device and a projection screen. The tuner demodulator 210 receives a broadcast television signal through wired or wireless reception, and demodulates an audio/video signal, such as an EPG audio/video data signal, from a plurality of wireless or wired broadcast television signals. The communicator 220 is a component for communicating with an external device or a server according to various communication protocol types. For example: the communicator may include at least one of a Wifi module, a bluetooth module, a wired ethernet module, and other network communication protocol chips or near field communication protocol chips, and an infrared receiver. The computer device may establish transmission and reception of control signals and data signals with a server or a local control device through the communicator 220. The detector 230 is used to collect signals of the external environment or interaction with the outside. The controller 250 and the modem 210 may be located in different separate devices, that is, the modem 210 may also be located in an external device of the main device where the controller 250 is located, such as an external set-top box.
In some embodiments, controller 250 controls the operation of the computer device and responds to user actions through various software control programs stored in memory. The controller 250 controls the overall operation of the computer device. A user may input a user command on a Graphical User Interface (GUI) displayed on the display 260, and the user input interface receives the user input command through the Graphical User Interface (GUI). Alternatively, the user may input the user command by inputting a specific sound or gesture, and the user input interface receives the user input command by recognizing the sound or gesture through the sensor.
Fig. 2B is a schematic software configuration diagram of a computer device according to one or more embodiments of the present disclosure, and as shown in fig. 2B, the system is divided into four layers, which are, from top to bottom, an Application (Applications) layer (referred to as an "Application layer"), an Application Framework (Application Framework) layer (referred to as a "Framework layer"), an Android runtime (Android runtime) and system library layer (referred to as a "system runtime library layer"), and a kernel layer.
Fig. 2C is a schematic diagram illustrating an icon control interface display of an application program included in an intelligent device (mainly, an intelligent playback device, such as an intelligent television, a digital cinema system, or a video server), according to one or more embodiments of the present disclosure, as shown in fig. 2C, an application layer includes at least one application program that can display a corresponding icon control on a display, such as: the system comprises a live television application icon control, a video on demand VOD application icon control, a media center application icon control, an application center icon control, a game application icon control and the like. The live television application program can provide live television through different signal sources. A video on demand VOD application may provide video from different storage sources. Unlike live television applications, video on demand provides a video display from some storage source. The media center application program can provide various applications for playing multimedia contents. The application center can provide and store various applications.
To explain the present solution in more detail, the following will be described with reference to fig. 3A, 4, 5A, and 6 by way of example, and it is understood that, although the steps in the flowcharts of fig. 3A, 4, 5A, and 6 are shown in order as indicated by arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not limited to being performed in the exact order illustrated and, unless explicitly stated herein, may be performed in other orders. Moreover, at least some of the steps in fig. 3A, fig. 4, fig. 5A, and fig. 6 may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least some of the sub-steps or stages of other steps. The punctuation prediction method provided in the embodiments of the present disclosure can be implemented.
As shown in fig. 3A, the method specifically includes the following steps:
s31, acquiring the probability of the punctuation mark after each character in the speech recognition transcription text is recognized based on the punctuation prediction model.
The voice recognition transcribed text is a text sequence obtained by performing voice recognition processing on the original audio. Specifically, the Speech Recognition transcription text is a text sequence obtained by processing an original audio signal based on an ASR (Automatic Speech Recognition) technology. Referring to fig. 3B, the input to speech recognition is typically a speech signal in the time domain, mathematically represented by a series of vectors, and the output is text.
The punctuation prediction model is used for outputting the probability of punctuation labels and the probability of non-punctuation labels after each character in the speech recognition and transcription text is recognized. It should be noted that the punctuation labels can be divided into four categories, which are: comma, period, question mark, exclamation mark. Non-punctuation labels, which can be understood as literal characters.
Specifically, the network structure of the punctuation prediction model is divided into two parts, namely a pre-training language model and a bidirectional LSTM (Long Short-Term Memory) model. For the prediction of text punctuation, the context information of the text is important, so a bidirectional LSTM model is added after a pre-training language model to acquire the global dependency information of the text. The pre-training model of the punctuation prediction model adopts a structure based on a transform, and the attention of a plurality of heads can better code the context information of the text.
The input of the punctuation prediction model is a transcription text of voice recognition, and the output is the probability of punctuation labels and the probability of non-punctuation labels appearing behind each character. For example, the input of the punctuation prediction model may be a transcribed text such as "how much weather today", "your good i is technical support of company a", and so on.
And correcting the punctuation probability by combining the acquired VAD audio information. Combining the audio signals such as pause and speech speed, and the like, adding weight to the positions where the punctuation points are not output in the punctuation prediction model and outputting the punctuation points.
And S32, acquiring first text information corresponding to the original audio and second text information corresponding to the original audio.
The first text information is obtained by performing audio truncation processing on an audio signal of an original audio, the first text information comprises voice characteristic information and non-voice characteristic information corresponding to the original audio, the second text information is obtained by decoding the original audio, and the second text information comprises transcribed text characters and transcribed empty characters corresponding to the original audio.
Specifically, the audio interception processing is audio interception processing performed by VAD (Voice Activity Detection) technology. Voice activity detection, also known as voice endpoint detection, is commonly used to identify the presence and absence of speech in an audio signal. Typically, a VAD algorithm will divide an audio signal into voiced parts, unvoiced parts, and silent parts. For example, the output is 1 when speech is detected, and 0 otherwise. In this embodiment, the first text information is speech feature information and non-speech feature information decoded and output from the original audio signal. For example, a schematic diagram of information included in the first text information is shown in fig. 3C.
Wherein, the decoding process is based on the automatic speech recognition technology to perform decoding operation. In the process of decoding processing, if no character is output at the current moment, the character is output
Figure BDA00038668266200001231
Indicating, which may be silence, or other conditions. A schematic diagram of information included in the second text information is shown in fig. 3D. The voice decoding adopts a prefix-beam-search algorithm to search N paths, and finally obtains an optimal result. For example, the transcribed text is "that is as if you are no work", and the corresponding intermediate transcribed text may include, but is not limited to, the following ways: (1)
Figure BDA0003866826620000121
That
Figure BDA0003866826620000122
Then is turned on
Figure BDA0003866826620000123
Good taste
Figure BDA0003866826620000124
You are
Figure BDA0003866826620000125
Lower part
Figure BDA0003866826620000126
Class
Figure BDA0003866826620000127
Feeder
Figure BDA0003866826620000128
(2)
Figure BDA0003866826620000129
That
Figure BDA00038668266200001210
Then it is turned into
Figure BDA00038668266200001211
Good taste
Figure BDA00038668266200001212
You are
Figure BDA00038668266200001213
Lower part
Figure BDA00038668266200001214
Class
Figure BDA00038668266200001215
Non-woven fabric
Figure BDA00038668266200001216
(3)
Figure BDA00038668266200001217
That is
Figure BDA00038668266200001218
Good taste
Figure BDA00038668266200001219
You are
Figure BDA00038668266200001220
Lower part
Figure BDA00038668266200001221
Class
Figure BDA00038668266200001222
There is no. Finally, an optimal result is obtained: for example,
Figure BDA00038668266200001223
that
Figure BDA00038668266200001224
Then is turned on
Figure BDA00038668266200001225
Good taste
Figure BDA00038668266200001226
You are
Figure BDA00038668266200001227
Lower part
Figure BDA00038668266200001228
Class
Figure BDA00038668266200001229
Feeder
Figure BDA00038668266200001230
S33, correcting the probability of the non-punctuation mark appearing after each character in the transcribed text according to the first text information and the second text information, and obtaining the corrected prediction probability information of the non-punctuation mark.
Specifically, after the probability that a punctuation label appears after each character in the transcribed text and the probability that a non-punctuation label appears after each character in the transcribed text are obtained, the probability that the non-punctuation label appears after each character in the transcribed text is corrected by combining the first text information and the second text information, so that corrected punctuation information is obtained. One way to implement this is to input the transcribed text into a punctuation prediction model, output the probability that five labels appear at each position, combine the obtained VAD audio information and the audio information transcribed into empty characters continuously by ASR, output punctuation in the transcribed text, add weight to the position where the prediction model does not output punctuation, and output punctuation.
Illustratively, raw audio captured in real-time is decoded by ASR into transcribed text, for example, the transcribed text is: "that is as if you are off duty". The transcribed text is input into the punctuation prediction model, and "0000000 question mark" is output. Processing the original audio to obtain first text information: "
Figure BDA0003866826620000131
That
Figure BDA0003866826620000132
Then is turned on
Figure BDA0003866826620000133
Good taste
Figure BDA0003866826620000134
You are
Figure BDA0003866826620000135
Lower part
Figure BDA0003866826620000136
Class
Figure BDA0003866826620000137
Non-woven fabric
Figure BDA0003866826620000138
", the second text information is: "11110001110001111000001000110001111000111111110000", and the result of the intervention of the above two audio information is finally obtained: "is that as good, do you go off work? ".
In some embodiments, the probability of punctuation marks occurring after transcription of each character in the text and the probability of non-punctuation marks occurring after transcription of each character in the text are inversely related.
Specifically, when the probability of a punctuation label appearing after a certain character of the text is transcribed is high, the probability of a non-punctuation label appearing after the character is correspondingly low; when the probability of a punctuation label appearing after a certain character of the transcribed text is low, the probability of a non-punctuation label appearing after the character is correspondingly high. Illustratively, assuming the transcribed text is "hello," the probability of a punctuation label appearing after the character "you" is low, while the probability of a non-punctuation label appearing is high.
In the embodiment of the disclosure, the probability of the occurrence of the non-punctuation mark after each character in the original audio is corrected through the non-speech characteristic information corresponding to the original audio and the character and null character corresponding to the original audio, not only the context semantic meaning of the transcribed text is identified based on the speech, but also the probability of the occurrence of the non-punctuation mark after each character is predicted by combining the two audio information, namely the first text information and the second text information, so as to obtain the prediction probability information of the corrected non-punctuation mark, and since the higher the probability of the occurrence of the non-punctuation mark after each character is, the lower the probability of the occurrence of the punctuation mark is, the accuracy and rationality of punctuation prediction can be improved by obtaining the prediction probability information of the corrected non-punctuation mark, meanwhile, the readability of the transcribed text is enhanced, and the efficiency of speech interaction is improved.
Fig. 4 is a schematic flowchart of another punctuation prediction method provided in the embodiment of the present disclosure. This embodiment is further expanded and optimized based on fig. 3A. Optionally, this embodiment mainly describes the process of step S33 (correcting the probability of the non-punctuation mark appearing after each character in the transcribed text according to the first text information and the second text information, and obtaining the corrected prediction probability information of the non-punctuation mark).
And S431, acquiring a first normalized position weight, a second normalized position weight and a third normalized position weight.
The first normalized position weight is the normalized weight of the number of empty characters behind each character in the first text information, the second normalized position weight is the normalized weight of the number of empty characters behind each character in the second text information, and the third normalized position weight is the sum of the first normalized position weight and the second normalized position weight.
S432, obtaining the probability of the non-punctuation mark appearing behind each corrected character according to the probability of the non-punctuation mark appearing behind each character in the transcribed text, the third normalization position weight and the audio intervention adjusting parameter.
The audio intervention adjusting parameters can adjust the intervention degree of the audio information according to different scenes so as to realize optimization under a specific scene. For example, the value of the audio intervention adjusting parameter may be 0.5,0.2,0.3, etc., or may be other reasonable values, which are not limited herein.
After determining the probability of the non-punctuation mark appearing after each character in the transcribed text, the third normalized position weight and the audio intervention adjusting parameter, obtaining the probability of the non-punctuation mark appearing after each character in the corrected transcribed text includes, but is not limited to, the following ways.
In some embodiments, when the preset correction manner is the first preset correction manner, the probability of the occurrence of the non-punctuation mark after each character in the transcribed text, the weight of the third normalization position, and the audio intervention adjustment parameter are calculated according to the preset correction manner, and the probability of the occurrence of the non-punctuation mark after each character after correction is obtained, which may be implemented by the following manner:
and determining the probability of the non-punctuation label appearing after each corrected character according to the probability of the non-punctuation label appearing after each character in the transcribed text and the sum of the products of the audio intervention adjusting parameter and the third normalized position weight.
Specifically, the probability of the occurrence of the non-punctuation mark after each character after correction can be calculated by the following formula (1):
L Oi =l Oi +β*δ i formula (1)
Wherein L is Oi Indicating that each character is come after correctionProbability of existing non-punctuation labels,/ Oi Representing the probability of a non-punctuation label appearing after each character output by the punctuation prediction model, beta representing an audio intervention regulation parameter, delta i Representing a third normalized position weight.
Illustratively, when the probability of the occurrence of the non-punctuation mark after each character output by the punctuation prediction model is 0.2, the audio intervention adjustment parameter is 0.5, and the third normalized position weight is 0.4, the probability of the occurrence of the non-punctuation mark after each character after correction is L Oi =0.2+0.5*0.4=0.4。
In some embodiments, when the preset correction mode is the second preset correction mode, the probability of the occurrence of the non-punctuation mark after each character in the transcribed text, the third normalized position weight, and the audio intervention adjustment parameter are calculated according to the preset correction mode, and the probability of the occurrence of the non-punctuation mark after each character after correction is obtained, which may be implemented in the following manner:
and determining the probability of the non-punctuation mark appearing after each corrected character according to the product of the probability of the non-punctuation mark appearing after each character in the transcribed text, the audio intervention adjusting parameter and the third normalized position weight.
Specifically, the probability of the occurrence of the non-punctuation mark after each character after correction can be calculated by the following formula (2):
L Oi =l Oi *β*δ i formula (2)
Wherein L is Oi Indicates the probability of the occurrence of a non-punctuation mark after each character after correction, l Oi Representing the probability of a non-punctuation label appearing after each character output by the punctuation prediction model, beta representing an audio intervention regulation parameter, delta i Representing the third normalized position weight.
Illustratively, when the probability of the occurrence of a non-punctuation mark after each character output by the punctuation prediction model is 0.2, the audio intervention adjustment parameter is 0.5, and the third normalized position weight is 0.4, the probability of the occurrence of a non-punctuation mark after each character after correction is L Oi =0.2*0.5*0.4=0.04。
And S433, determining the probability of the punctuation mark at each position after correction according to the probability of the non-punctuation mark appearing after each corrected character.
Specifically, since the probability of the non-punctuation mark appearing after each character in the transcribed text is negatively correlated with the probability of the punctuation mark appearing after each character, that is, the probability of the non-punctuation mark appearing after each character in the pre-transcribed texts increases, the probability of the punctuation mark appearing after each character in the transcribed text decreases. If the probability of the non-punctuation mark appearing after interfering with the transcription of each character in the text is reduced, the probability of the punctuation mark appearing after the transcription of each character in the text is increased. Through the intervention of audio information, the third normalization position weight delta is reduced when more empty characters appear i So that the probability of the punctuation mark appearing at each position after correction is increased. Increasing the third normalization position weight δ when fewer null characters occur i So that the probability of the punctuation label appearing at each position after correction is reduced, and the intervention of audio information on punctuation output is realized.
Fig. 5A is a schematic flowchart of another punctuation prediction method provided in the embodiments of the present disclosure. The embodiment is further expanded and optimized on the basis of fig. 4. Alternatively, in this embodiment, when the correction method is an additive intervention method, the process of step S431 (obtaining the first normalized position weight, the second normalized position weight, and the third normalized position weight) will be described.
When the correction mode is an additive intervention mode, the first normalized position weight, the second normalized position weight and the third normalized position weight are obtained, and the method can be realized by the following modes:
s5311, acquiring a first normalization position weight according to the sum of the number of the hollow characters in the first text information, the average number of the hollow characters in the first text information and the number of the hollow characters behind each character in the first text information.
Specifically, the first normalized position weight is calculated by the following formula (3):
Figure BDA0003866826620000161
wherein alpha is i A first normalized position weight is represented and,
Figure BDA0003866826620000162
representing the sum of the number of empty characters in the first text information,
Figure BDA0003866826620000163
representing an average number of empty characters in the first text information,
Figure BDA0003866826620000164
indicating the number of empty characters after each character in the first text information.
To realize transcription results in automatic speech recognition
Figure BDA0003866826620000165
More positions appear, the probability of punctuation is improved, and the automatic speech recognition transcription results can be fused in the following modes including but not limited to the following modes.
Illustratively, the transcribed text is: "that is as if you are off duty". The transcribed text has 7 Chinese characters and 7 positions, and because the beginning generally has no punctuation and does not need to be intervened, the first Chinese character is preceded by the first Chinese character
Figure BDA0003866826620000166
Are ignored. The position after the first character is marked as position No. 1, and then the position at each position is counted
Figure BDA0003866826620000167
Of the number of, e.g. optimal results of speech decoding "
Figure BDA0003866826620000168
That
Figure BDA0003866826620000169
Then it is turned into
Figure BDA00038668266200001610
Good taste
Figure BDA00038668266200001611
You are
Figure BDA00038668266200001612
Lower part
Figure BDA00038668266200001613
Class
Figure BDA00038668266200001614
Non-woven fabric
Figure BDA00038668266200001615
In
Figure BDA00038668266200001616
The position code can be expressed as: p a =2162211. Transcribing the results
Figure BDA00038668266200001620
The sum of (1) is denoted as
Figure BDA00038668266200001621
Each position null character
Figure BDA00038668266200001622
Is counted as
Figure BDA00038668266200001623
For each position
Figure BDA00038668266200001624
And normalizing the number of the weights to obtain corresponding normalized weights. Namely, the above equation (3). In the present embodiment, it is preferred that,
Figure BDA00038668266200001617
Figure BDA00038668266200001618
the normalized effect is: eliminating the influence of the speaking speed; the non-punctuation mark probability that is favorable is controlled more accurately, and then the punctuation mark probability is interfered effectively.
S5312, acquiring a second normalization position weight according to the sum of the number of the hollow characters in the second text information, the average number of the hollow characters in the second text information and the number of the hollow characters behind each character in the second text information.
Specifically, the second normalized position weight is calculated by the following formula (4):
Figure BDA00038668266200001619
wherein epsilon i Representing a second normalized position weight, θ N Representing the sum of the number of empty characters, theta, of the second text information avg Representing an average number of empty characters, θ, in the second text information i Indicating the number of empty characters after each character in the second text information.
The VAD processing is based on the principle that when a sound occurs, the flag of the frame is set to "1", and when no sound occurs, the flag of the frame is set to "0". Based on the position of the punctuation, the speaker mostly pauses more. Statistics begin after the characters are rolled out from the ASR, and the positions labeled "0" for the VAD output are counted. The text is then aligned according to the ASR transcription.
To achieve the high occurrence probability of the punctuation at the positions where "0" appears in the VAD result, the VAD result can be fused in the following ways, including but not limited to. Referring to fig. 5B, in fig. 5B, there are more positions where "0" appears, and the probability of the punctuation is higher. The position vector obtained after counting 0 for each position is: p v =3353334. The sum of "0" in the VAD result is 3+5+3+ 4=24, theta N =24, so θ avg =24/7,θ 1 =3,θ 2 =3,θ 3 =5,θ 4 =3,θ 5 =3,θ 6 =3,θ 7 =4。
Because the VAD output has various interferences, and is not like the ideal result, for this reason, the ASR transcription text is combined to merge the same terms of the VAD tags, so as to realize the alignment of the ASR transcription result and the VAD tag result. Referring to FIG. 5C, the bold label "1" in FIG. 5C may be caused by an error in transcoding or other conditions, and the ASR transcription result is combined, when the position in the ASR transcription result is
Figure BDA0003866826620000172
Then, it is corrected to "0".
S5313, obtaining a third normalized position weight according to the sum of the first normalized position weight and the second normalized position weight.
Specifically, the third normalized position weight is calculated by the following formula (5):
δ i =α ii formula (5)
Wherein alpha is i Denotes the first normalized position weight, ε i Representing a second normalized position weight.
In order to better fuse the transcription result of ASR and the VAD result information, the method of formula (5) is adopted to perform additive fusion on the weight of the audio information transcribed by ASR and the weight of the audio information transcribed by VAD to obtain delta i
Fig. 6 is a schematic flowchart of another punctuation prediction method provided in the embodiment of the present disclosure. The embodiment is further expanded and optimized on the basis of fig. 4. Optionally, in this embodiment, when the correction mode is a multiplicative intervention mode, a process of step S431 (obtaining the first normalized position weight, the second normalized position weight, and the third normalized position weight) is described.
S6311, obtaining a first normalization position weight according to the maximum number of empty characters behind each character in the N characters of the first text information, the number of empty characters behind each character in the first text information and the scaling factor.
Specifically, the first normalized position weight is calculated by the following formula (6):
Figure BDA0003866826620000171
wherein alpha is i Denotes a first normalized position weight, max 1N Representing the maximum number of empty characters after each of the N characters of the first text information,
Figure BDA0003866826620000182
indicating the number of empty characters after each character in the first text information and gamma indicating the scaling factor.
As an example, γ is a scaling factor, and an optimal value thereof may be 1.5, or may be other reasonable values, which is not limited herein.
S6312, obtaining a second normalization position weight according to the maximum number of empty characters behind each character in the N characters of the second text information, the number of empty characters behind each character in the second text information and the scaling factor.
Specifically, the first normalized position weight is calculated by the following formula (7):
Figure BDA0003866826620000181
wherein epsilon i Denotes a second normalized position weight, max 2N Representing the maximum number of empty characters, theta, after each of the N characters of the second text information i Indicating the number of empty characters after each character in the second text information and gamma indicating the scaling factor.
S6313, obtaining a third normalized position weight according to the sum of the first normalized position weight and the second normalized position weight.
Wherein N is an integer of 1 or more.
Specifically, the third normalized position weight is calculated by the above formula (5):
δ i =α ii formula (5)
Wherein alpha is i Denotes the first normalized position weight, ε i Representing a second normalized position weight.
In the embodiment of the disclosure, the probability of non-punctuation after each character in the original audio is corrected through the non-speech characteristic information corresponding to the original audio and the character and null character corresponding to the original audio, the context semantics of the transcribed text are recognized based on the speech, and the probability of non-punctuation after each character is predicted by combining the two audio information, namely the first text information and the second text information, so as to obtain the prediction probability information of the corrected non-punctuation label.
Fig. 7 is a schematic structural diagram of a semantic understanding apparatus according to an embodiment of the present disclosure. The device is configured in intelligent equipment, and can realize the punctuation prediction method in any embodiment of the disclosure. The apparatus 700 specifically includes the following:
a punctuation probability obtaining module 710, configured to obtain, based on a punctuation prediction model, a probability that a punctuation label appears after each character in the transcribed text and a probability that a non-punctuation label appears; the transcription text is a text sequence obtained by performing voice recognition processing on original audio, and the punctuation prediction model is used for outputting the probability of punctuation labels and the probability of non-punctuation labels after each character in the transcription text appears;
an audio text obtaining module 720, configured to obtain first text information corresponding to an original audio and second text information corresponding to the original audio; the first text information is obtained by performing audio truncation processing on the original audio, the first text information comprises voice characteristic information and non-voice characteristic information corresponding to the original audio, the second text information is obtained by decoding the original audio, and the second text information comprises transcribed characters and transcribed empty characters corresponding to the original audio;
and a punctuation probability correction module 730, configured to correct, according to the first text information and the second text information, the probability that a non-punctuation label appears after each character in the transcribed text, and obtain corrected prediction probability information of the non-punctuation label.
As an optional implementation manner of the embodiment of the present disclosure, the punctuation probability correction module 730 includes:
a weight acquisition unit configured to acquire a first normalized position weight, a second normalized position weight, and a third normalized position weight;
the first normalized position weight is the normalized weight of the number of empty characters behind each character in the first text information, the second normalized position weight is the normalized weight of the number of empty characters behind each character in the second text information, and the third normalized position weight is the sum of the first normalized position weight and the second normalized position weight;
the character probability correction unit is used for acquiring the probability of the non-punctuation mark appearing after each corrected character according to the probability of the non-punctuation mark appearing after each character in the transcribed text, the third normalized position weight and the audio intervention adjusting parameter;
and the punctuation probability correction unit is used for acquiring the prediction probability information of the corrected non-punctuation label according to the probability of the non-punctuation label appearing after each corrected character.
As an optional implementation manner of the embodiment of the present disclosure, the probability of the punctuation label appearing after each character in the transcribed text and the probability of the non-punctuation label appearing after each character in the transcribed text are in negative correlation.
As an optional implementation manner of the embodiment of the present disclosure, the character probability correction unit is specifically configured to:
calculating the probability of the non-punctuation mark appearing after each character in the transcribed text, the third normalized position weight and the audio intervention adjusting parameter according to a preset correction mode, and acquiring the probability of the non-punctuation mark appearing after each corrected character; the preset correction mode comprises the following steps: a first preset correction mode and a second preset correction mode.
As an optional implementation manner of the embodiment of the present disclosure, when the preset correction manner is a first preset correction manner, the calculating, according to the preset correction manner, a probability that a non-punctuation mark appears after each character in the transcribed text, the third normalized position weight, and the audio intervention adjustment parameter, and obtaining the probability that the non-punctuation mark appears after each character after correction includes:
and determining the probability of the non-punctuation mark appearing after each corrected character according to the probability of the non-punctuation mark appearing after each character in the transcribed text and the sum of the products of the audio intervention adjusting parameter and the third normalized position weight.
As an optional implementation manner of the embodiment of the present disclosure, when the preset modification manner is a second preset modification manner, the calculating, according to the preset modification manner, a probability that a non-punctuation mark appears after each character in the transcribed text, the third normalized position weight, and the audio intervention adjustment parameter, and obtaining the probability that the non-punctuation mark appears after each character after modification includes:
and determining the probability of the non-punctuation label appearing after each corrected character according to the product of the probability of the non-punctuation label appearing after each character in the transcribed text, the audio intervention adjusting parameter and the third normalized position weight.
As an optional implementation manner of the embodiment of the present disclosure, when the modification manner is an additive intervention manner, the weight obtaining unit is specifically configured to:
acquiring a first normalized position weight according to the sum of the number of the hollow characters in the first text information, the average number of the hollow characters in the first text information and the number of the hollow characters behind each character in the first text information;
acquiring a second normalized position weight according to the sum of the number of the hollow characters in the second text information, the average number of the hollow characters in the second text information and the number of the hollow characters behind each character in the second text information;
and acquiring a third normalized position weight according to the sum of the first normalized position weight and the second normalized position weight.
As an optional implementation manner of the embodiment of the present disclosure, when the modification manner is a multiplicative intervention manner, the weight obtaining unit is specifically configured to:
acquiring a first normalized position weight according to the maximum number of empty characters behind each character in the N characters of the first text information, the number of empty characters behind each character in the first text information and a scaling factor;
acquiring a second normalized position weight according to the maximum number of empty characters behind each character in the N characters of the second text information, the number of empty characters behind each character in the second text information and a scaling factor;
acquiring a third normalized position weight according to the sum of the first normalized position weight and the second normalized position weight;
wherein N is an integer of 1 or more.
In the embodiment of the disclosure, the probability of a punctuation label appearing after each character in a transcribed text and the probability of a non-punctuation label are obtained based on a punctuation prediction model, first text information corresponding to an original audio and second text information corresponding to the original audio are obtained, and the probability of the non-punctuation label appearing after each character in the text is transcribed according to the first text information corresponding to the original audio and the second text information corresponding to the original audio, so as to obtain the corrected prediction probability information of the non-punctuation label. The transcribed text is a text sequence obtained by performing speech recognition processing on an original audio, the punctuation prediction model is used for outputting the probability of a non-punctuation label of the probability of the punctuation label appearing after each character in the transcribed text, the first text information is the text information obtained by performing audio truncation processing on the original audio, the first text information comprises speech characteristic information and non-speech characteristic information corresponding to the original audio, the second text information is the text information obtained by decoding the original audio, and the second text information comprises transcribed word characters and transcribed empty characters corresponding to the original audio. The probability of non-punctuation appearing behind each character in the original audio is corrected through the non-voice characteristic information corresponding to the original audio and the character characters and the null characters corresponding to the original audio, not only is the context semantic meaning of the transcribed text recognized based on voice, but also the probability of non-punctuation appearing behind each character is predicted by combining the two audio information of the first text information and the second text information, and then the prediction probability information of the corrected non-punctuation label is obtained.
The semantic understanding apparatus provided in the embodiments of the present disclosure may execute the punctuation prediction method provided in any embodiment of the present disclosure, has the corresponding functional modules and beneficial effects of the execution method, and is not described herein again to avoid repetition.
The disclosed embodiment provides a computer device, including: one or more processors; a storage device configured to store one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the punctuation prediction method of any one of the embodiments of the present disclosure.
Fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure. As shown in fig. 8, the computer apparatus includes a processor 810 and a storage 820; the number of the processors 810 in the computer device may be one or more, and one processor 810 is taken as an example in fig. 8; the processor 810 and the storage 820 in the computer device may be connected by a bus or other means, and fig. 8 illustrates the connection by a bus as an example.
The storage device 820 is a computer-readable storage medium for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the punctuation prediction method in the embodiments of the present disclosure. The processor 810 executes various functional applications and data processing of the computer device by executing software programs, instructions and modules stored in the storage 820, that is, the punctuation prediction method provided by the embodiment of the present disclosure is implemented.
The storage device 820 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Additionally, storage 820 may include high speed random access storage, and may also include non-volatile storage, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, storage 820 may further include storage remotely located from processor 810, which may be connected to a computer device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The computer device provided by the embodiment can be used for executing the punctuation prediction method provided by any embodiment, and has corresponding functions and beneficial effects.
The embodiments of the present disclosure further provide a storage medium containing computer-executable instructions, where the computer-executable instructions, when executed by a computer processor, implement each process executed by the method provided in any of the above embodiments, and can achieve the same technical effect, and in order to avoid repetition, the details are not repeated here.
The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.
The foregoing description, for purposes of explanation, has been presented in conjunction with specific embodiments. However, the foregoing discussion in some embodiments is not intended to be exhaustive or to limit the embodiments to the precise forms disclosed above. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles and the practical application, to thereby enable others skilled in the art to best utilize the embodiments and various embodiments with various modifications as are suited to the particular use contemplated.

Claims (10)

1. A method for punctuation prediction, the method comprising:
acquiring the probability of punctuation labels appearing after each character in the transcribed text and the probability of non-punctuation labels based on a punctuation prediction model; the transcription text is a text sequence obtained by performing voice recognition processing on original audio, and the punctuation prediction model is used for outputting the probability of punctuation labels and the probability of non-punctuation labels after each character in the transcription text appears;
acquiring first text information corresponding to original audio and second text information corresponding to the original audio; the first text information is obtained by performing audio truncation processing on the original audio, the first text information comprises voice characteristic information and non-voice characteristic information corresponding to the original audio, the second text information is obtained by decoding the original audio, and the second text information comprises transcribed characters and transcribed empty characters corresponding to the original audio;
and correcting the probability of the non-punctuation mark appearing after each character in the transcribed text according to the first text information and the second text information, and acquiring the corrected prediction probability information of the non-punctuation mark.
2. The method according to claim 1, wherein the correcting, according to the first text information and the second text information, the probability of occurrence of a non-punctuation mark after each character in the transcribed text to obtain corrected prediction probability information of the non-punctuation mark comprises:
obtaining a first normalized position weight, a second normalized position weight and a third normalized position weight;
the first normalized position weight is the normalized weight of the number of the empty characters behind each character in the first text message, the second normalized position weight is the normalized weight of the number of the empty characters behind each character in the second text message, and the third normalized position weight is the sum of the first normalized position weight and the second normalized position weight;
acquiring the probability of the non-punctuation mark appearing after each corrected character according to the probability of the non-punctuation mark appearing after each character in the transcribed text, the third normalized position weight and the audio intervention adjusting parameter;
and acquiring the prediction probability information of the corrected non-punctuation label according to the probability of the non-punctuation label appearing after each corrected character.
3. The method of claim 1, further comprising:
the probability of punctuation marks appearing after each character in the transcribed text is inversely related to the probability of non-punctuation marks appearing after each character in the transcribed text.
4. The method as claimed in claim 2, wherein said obtaining the modified probability of the occurrence of the non-punctuation mark after each character in the transcribed text according to the probability of the occurrence of the non-punctuation mark after each character, the third normalized position weight, and the audio intervention adjusting parameter comprises:
calculating the probability of the non-punctuation mark appearing after each character in the transcribed text, the third normalized position weight and the audio intervention adjusting parameter according to a preset correction mode, and acquiring the probability of the non-punctuation mark appearing after each corrected character; the preset correction mode comprises the following steps: a first preset correction mode and a second preset correction mode.
5. The method according to claim 4, wherein when the preset correction mode is a first preset correction mode, the calculating, according to the preset correction mode, a probability that a non-punctuation mark appears after each character in the transcribed text, the third normalization position weight, and the audio intervention adjustment parameter, and obtaining the probability that the non-punctuation mark appears after each character after correction comprises:
and determining the probability of the non-punctuation mark appearing after each corrected character according to the probability of the non-punctuation mark appearing after each character in the transcribed text and the sum of the products of the audio intervention adjusting parameter and the third normalized position weight.
6. The method according to claim 4, wherein when the preset correction mode is a second preset correction mode, the calculating, according to the preset correction mode, a probability that a non-punctuation mark appears after each character in the transcribed text, the third normalization position weight, and the audio intervention adjustment parameter, and obtaining the probability that the non-punctuation mark appears after each character after correction comprises:
and determining the probability of the non-punctuation label appearing after each corrected character according to the product of the probability of the non-punctuation label appearing after each character in the transcribed text, the audio intervention adjusting parameter and the third normalized position weight.
7. The method according to claim 4, wherein when the preset modification manner is a first preset modification manner, the obtaining the first normalized position weight, the second normalized position weight, and the third normalized position weight includes:
acquiring a first normalized position weight according to the sum of the number of the hollow characters in the first text information, the average number of the hollow characters in the first text information and the number of the hollow characters behind each character in the first text information;
acquiring a second normalized position weight according to the sum of the number of the hollow characters in the second text information, the average number of the hollow characters in the second text information and the number of the hollow characters behind each character in the second text information;
and acquiring a third normalized position weight according to the sum of the first normalized position weight and the second normalized position weight.
8. The method according to claim 4, wherein when the preset modification manner is a second preset modification manner, the obtaining the first normalized position weight, the second normalized position weight, and the third normalized position weight includes:
acquiring a first normalized position weight according to the maximum number of empty characters behind each character in the N characters of the first text information, the number of empty characters behind each character in the first text information and a scaling factor;
acquiring a second normalized position weight according to the maximum number of empty characters behind each character in the N characters of the second text information, the number of empty characters behind each character in the second text information and a scaling factor;
acquiring a third normalized position weight according to the sum of the first normalized position weight and the second normalized position weight;
wherein N is an integer of 1 or more.
9. An apparatus for punctuation prediction, the apparatus comprising:
the punctuation probability acquisition module is used for acquiring the probability of punctuation labels and the probability of non-punctuation labels after each character in the transcribed text is acquired based on the punctuation prediction model; the transcription text is a text sequence obtained by performing voice recognition processing on original audio, and the punctuation prediction model is used for outputting the probability of punctuation labels and the probability of non-punctuation labels after each character in the transcription text;
the audio text acquisition module is used for acquiring first text information corresponding to original audio and second text information corresponding to the original audio; the first text information is obtained by performing audio truncation processing on the original audio, the first text information comprises voice characteristic information and non-voice characteristic information corresponding to the original audio, the second text information is obtained by decoding the original audio, and the second text information comprises transcribed characters and transcribed empty characters corresponding to the original audio;
and the punctuation probability correction module is used for correcting the probability of the occurrence of non-punctuation labels after each character in the transcribed text according to the first text information and the second text information and acquiring the corrected prediction probability information of the non-punctuation labels.
10. A speech recognition device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the punctuation prediction method as defined in any one of claims 1 to 8 when executing the computer program.
CN202211184502.5A 2022-09-27 2022-09-27 Punctuation prediction method and device and voice recognition equipment Pending CN115662432A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211184502.5A CN115662432A (en) 2022-09-27 2022-09-27 Punctuation prediction method and device and voice recognition equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211184502.5A CN115662432A (en) 2022-09-27 2022-09-27 Punctuation prediction method and device and voice recognition equipment

Publications (1)

Publication Number Publication Date
CN115662432A true CN115662432A (en) 2023-01-31

Family

ID=84986269

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211184502.5A Pending CN115662432A (en) 2022-09-27 2022-09-27 Punctuation prediction method and device and voice recognition equipment

Country Status (1)

Country Link
CN (1) CN115662432A (en)

Similar Documents

Publication Publication Date Title
US9520133B2 (en) Display apparatus and method for controlling the display apparatus
CN108063969B (en) Display apparatus, method of controlling display apparatus, server, and method of controlling server
US20220147870A1 (en) Method for providing recommended content list and electronic device according thereto
US20140350933A1 (en) Voice recognition apparatus and control method thereof
US20140196092A1 (en) Dialog-type interface apparatus and method for controlling the same
US20130169524A1 (en) Electronic apparatus and method for controlling the same
US20230343345A1 (en) Audio packet loss compensation method and apparatus and electronic device
US11687526B1 (en) Identifying user content
JP2014132465A (en) Display device and control method of the same
KR20130018464A (en) Electronic apparatus and method for controlling electronic apparatus thereof
JP2013041579A (en) Electronic device and method of controlling the same
CN112163086A (en) Multi-intention recognition method and display device
CN112489691A (en) Electronic device and operation method thereof
CN111816172A (en) Voice response method and device
CN110226202B (en) Method and apparatus for transmitting and receiving audio data
CN115662432A (en) Punctuation prediction method and device and voice recognition equipment
KR20130080380A (en) Electronic apparatus and method for controlling electronic apparatus thereof
KR102091006B1 (en) Display apparatus and method for controlling the display apparatus
CN113035194B (en) Voice control method, display device and server
KR20160022326A (en) Display apparatus and method for controlling the display apparatus
CN114822598A (en) Server and speech emotion recognition method
KR20210065308A (en) Electronic apparatus and the method thereof
CN117891517A (en) Display equipment and voice awakening method
CN117809659A (en) Server, terminal equipment and voice interaction method
CN115438625A (en) Text error correction server, terminal device and text error correction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination