CN107305541B

CN107305541B - Method and device for segmenting speech recognition text

Info

Publication number: CN107305541B
Application number: CN201610256898.8A
Authority: CN
Inventors: 胡尹; 潘清华; 王金钖; 胡国平; 胡郁
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2016-04-20
Filing date: 2016-04-20
Publication date: 2021-05-04
Anticipated expiration: 2036-04-20
Also published as: CN107305541A

Abstract

The invention discloses a method and a device for segmenting a speech recognition text, wherein the method comprises the following steps: carrying out end point detection on the voice data to obtain each voice segment and a starting frame number and an ending frame number of each voice segment; carrying out voice recognition on each voice section to obtain a recognition text corresponding to each voice section; extracting the segmented characteristics of the recognition texts corresponding to the voice segments; carrying out segmentation detection on the recognition text corresponding to the voice data by using the extracted segmentation features and a pre-constructed segmentation model so as to determine the position to be segmented; and segmenting the recognition text corresponding to the voice data according to the segmentation detection result. The invention can automatically realize the segmentation of the recognition text, so that the chapter structure of the recognition text is clearer.

Description

Method and device for segmenting speech recognition text

Technical Field

The invention relates to the field of natural language processing, in particular to a method and a device for segmenting a speech recognition text.

Background

With the development of voice technology, automatic voice recognition technology has been widely applied to various fields of life, and the conversion of voice into text greatly facilitates the life requirements of people, for example, the conversion of conference recording into text is sent to participants as conference summary; the recording of the interview of the reporter is converted into text, and then edited into a news manuscript and the like on the basis of the text. However, the recognized text obtained by speech recognition does not have a clear chapter structure like the manually edited text, such as the division of the paragraph structure, so that it is often difficult for a user to find the emphasis or the theme of the entire recognized text when viewing the recognized text, and especially when the recognized text is more and involves multiple themes, it is more difficult for the user to clear the chapter structure of the recognized text and accurately find the content of each theme. Therefore, how to clearly display the recognition text to the user helps the user to understand the content of the recognition text is very important for displaying the speech recognition text.

In the prior art, generally, a recognition text of voice data is directly displayed to a user, and a recognition result is not processed; or manually adjusting the chapter structure of the recognition text, displaying the adjusted recognition text to the user, for example, dividing the recognition text into different paragraphs according to the content of the recognition text, and displaying the adjusted recognition text to the user. When the manual adjustment mode is used for recognizing more texts, the manual work is heavy, the efficiency is low, the consumed time is long, and the recognition system is difficult to achieve the practical effect.

Disclosure of Invention

The invention provides a method and a device for segmenting a voice recognition text, which aim to solve the problems of large workload and low efficiency of manual adjustment of text recognition chapter structures in the prior art.

Therefore, the invention provides the following technical scheme:

a speech recognition text segmentation method comprising:

carrying out end point detection on the voice data to obtain each voice segment and a starting frame number and an ending frame number of each voice segment;

carrying out voice recognition on each voice section to obtain a recognition text corresponding to each voice section;

extracting the segmented characteristics of the recognition texts corresponding to the voice segments;

carrying out segmentation detection on the recognition text corresponding to the voice data by using the extracted segmentation features and a pre-constructed segmentation model so as to determine the position to be segmented;

and segmenting the recognition text corresponding to the voice data according to the segmentation detection result.

Preferably, the method further comprises constructing the segment model in the following manner:

collecting voice data;

carrying out end point detection on the collected voice data to obtain each voice section;

marking the segmentation information of the identification text corresponding to each voice segment, wherein the segmentation information is used for indicating whether the ending position of the identification text corresponding to the current voice segment needs to be segmented or not;

and constructing a segmentation model by using the segmentation characteristics and the segmentation information as training data.

Preferably, the extracting the segmentation features of the recognized text corresponding to each speech segment includes:

extracting the segmentation characteristics of each voice segment from the acoustics of the voice data, and taking the segmentation characteristics as the first segmentation characteristics of the recognition text corresponding to the voice segment; and/or

And extracting a segmentation feature from the semantics of the recognition text, and using the segmentation feature as a second segmentation feature of the recognition text.

Preferably, the first segmentation feature comprises: the duration of the current speech segment further comprises: the distance between the current voice section and the previous voice section and/or the distance between the current voice section and the next voice section;

the acoustically extracting the segmentation features of the speech segments from the speech data comprises:

calculating the difference value between the ending frame number of the current voice segment and the starting frame number of the current voice segment, and taking the difference value as the time length of the current voice segment;

further comprising:

calculating the difference value between the starting frame number of the current voice segment and the ending frame number of the previous voice segment, and taking the difference value as the distance between the current voice segment and the previous voice segment; and/or

And calculating the difference value between the starting frame number of the next voice segment and the ending frame number of the current voice segment, and taking the difference value as the distance between the current voice segment and the next voice segment.

Preferably, the first section feature further comprises: whether the speaker of the current voice section is the same as the speaker of the previous voice section and/or whether the speaker of the current voice section is the same as the speaker of the next voice section;

the acoustically extracting the segmentation features of the speech segments from the speech data further comprises:

carrying out speaker change point detection on the voice data by using a speaker separation technology;

and determining whether the speaker in the current voice section is the same as the speaker in the previous voice section according to the speaker change point detection result and/or determining whether the speaker in the current voice section is the same as the speaker in the next voice section according to the speaker change point detection result.

Preferably, the second section feature comprises any one or more of:

the forward non-segmented sentence number refers to the total number of sentences contained in all the recognized texts from the starting position of the recognized text corresponding to the current speech segment to the last segmented mark;

the backward non-segmented sentence number refers to the total number of sentences contained in all the recognized texts after the recognized text corresponding to the current speech segment;

the number of sentences contained in the recognition text corresponding to the current voice segment;

similarity between the recognition text corresponding to the current voice section and the recognition text corresponding to the previous voice section;

and the similarity between the recognition text corresponding to the current voice section and the recognition text corresponding to the next voice section.

Preferably, the semantically extracting the segmentation features from the recognized text comprises:

and correcting the recognition text corresponding to the voice data, wherein the correction comprises: adding punctuation to the recognition text corresponding to the voice data;

segmentation features are extracted from the semantics of the modified recognized text.

Preferably, the modification further comprises any one or more of:

filtering abnormal words of the recognition text corresponding to the voice data;

performing smooth processing on the recognition text corresponding to the voice data;

carrying out digital normalization on the recognition text corresponding to the voice data;

performing text replacement on the recognition text corresponding to the voice data, wherein the text replacement comprises: converting English lowercase letters in the recognized text corresponding to the voice data into uppercase letters or vice versa; and/or replacing sensitive words in the recognized text corresponding to the voice data with special symbols.

Preferably, the step of detecting the segmentation of the recognized text corresponding to the speech data by using the extracted segmentation features and a pre-constructed segmentation model to determine the position to be segmented includes:

and sequentially inputting the segmentation characteristics of the recognition texts corresponding to the speech segments into the segmentation model for segmentation detection by taking the speech segments as units, and determining whether the end positions of the recognition texts corresponding to the speech segments need to be segmented or not.

Preferably, the method further comprises:

displaying the segmented recognition text to a user; or

Extracting the topics of the segmented paragraph recognition texts, and displaying the topics to a user;

when a topic which is interesting to the user is sensed, the identification text of the paragraph corresponding to the topic is displayed to the user.

A speech recognition text segmentation apparatus comprising:

the end point detection module is used for carrying out end point detection on the voice data to obtain each voice section and the starting frame number and the ending frame number of each voice section;

the voice recognition module is used for carrying out voice recognition on each voice section to obtain a recognition text corresponding to each voice section;

the feature extraction module is used for extracting the segmented features of the identification texts corresponding to the voice segments;

the segmentation detection module is used for carrying out segmentation detection on the recognition text corresponding to the voice data by utilizing the extracted segmentation characteristics and a pre-constructed segmentation model so as to determine the position to be segmented;

and the segmentation module is used for segmenting the recognition text corresponding to the voice data according to the segmentation detection result.

Preferably, the apparatus further comprises a segment model construction module for constructing a segment model; the segmentation model building module comprises:

a data collection unit for collecting voice data;

the end point detection unit is used for carrying out end point detection on the voice data collected by the data collection unit to obtain each voice section;

the voice recognition unit is used for carrying out voice recognition on each voice section to obtain a recognition text corresponding to each voice section;

the marking unit is used for marking the segmentation information of the identification text corresponding to each voice segment, and the segmentation information is used for indicating whether the end position of the identification text corresponding to the current voice segment needs to be segmented or not;

the characteristic extraction unit is used for extracting the segmented characteristics of the identification texts corresponding to the voice segments;

and the training unit is used for constructing a segmentation model by taking the segmentation characteristics and the segmentation information as training data.

Preferably, the feature extraction module includes:

the first feature extraction module is used for extracting the segmented features of the voice segments from the acoustics of the voice data and taking the segmented features as the first segmented features of the recognition texts corresponding to the voice segments; and/or

And the second feature extraction module is used for extracting segmentation features from the semantics of the recognition text and taking the segmentation features as second segmentation features of the recognition text.

Preferably, the first feature extraction module includes:

the time length calculating unit is used for calculating the difference value between the ending frame number of the current voice segment and the starting frame number of the current voice segment, and taking the difference value as the time length of the current voice segment;

the distance calculation unit is used for calculating the difference value between the starting frame number of the current voice segment and the ending frame number of the previous voice segment, and taking the difference value as the distance between the current voice segment and the previous voice segment; and/or calculating the difference value between the starting frame number of the next voice segment and the ending frame number of the current voice segment, and taking the difference value as the distance between the current voice segment and the next voice segment.

Preferably, the first feature extraction module further comprises:

the speaker change point detection unit is used for detecting the speaker change points of the voice data by using a speaker separation technology;

and the speaker determining unit is used for determining whether the speaker of the current voice section is the same as the speaker of the previous voice section according to the speaker change point detection result and/or determining whether the speaker of the current voice section is the same as the speaker of the next voice section according to the speaker change point detection result.

Preferably, the second section feature comprises any one or more of:

Preferably, the second feature extraction module includes:

a correction unit configured to correct a recognition text corresponding to the speech data, the correction unit including: a punctuation adding subunit, configured to add punctuation to the recognition text corresponding to the voice data;

and the feature extraction unit is used for extracting segmentation features from the semanteme of the corrected recognition text.

Preferably, the correction unit further comprises any one or more of the following sub-units:

the filtering subunit is used for filtering abnormal words of the recognition text corresponding to the voice data;

a smoothing processing subunit, configured to perform smoothing processing on the recognition text corresponding to the voice data;

the normalization subunit is used for carrying out digital normalization on the recognition text corresponding to the voice data;

a text replacement subunit, configured to perform text replacement on the recognition text corresponding to the voice data, where the text replacement includes: converting English lowercase letters in the recognized text corresponding to the voice data into uppercase letters or vice versa; and/or replacing sensitive words in the recognized text corresponding to the voice data with special symbols.

Preferably, the segmentation detection module is specifically configured to, with a speech segment as a unit, sequentially input the segmentation characteristics of the recognized text corresponding to each speech segment into the segmentation model for segmentation detection, and determine whether the end position of the recognized text corresponding to each speech segment needs to be segmented.

Preferably, the apparatus further comprises:

the first display module is used for displaying the segmented identification texts to a user; or

The topic extraction module is used for extracting the topic of each paragraph identification text after segmentation;

the second display module is used for displaying all the themes to the user;

and the perception module is used for perceiving the topic which is interested by the user and triggering the second display module to display the identification text of the paragraph corresponding to the topic to the user when perceiving the topic which is interested by the user.

The invention provides a method and a device for segmenting a voice recognition text, which are characterized in that each voice segment is obtained by carrying out endpoint detection on voice data, the recognition text corresponding to each voice segment is obtained by carrying out voice recognition on each voice segment, then, the segmentation characteristics of the recognition text corresponding to each voice segment are extracted, the recognition text corresponding to the voice data is segmented and detected by utilizing the extracted segmentation characteristics and a pre-constructed segmentation model so as to determine the position to be segmented, and the recognition text is segmented according to the segmentation detection result, so that the chapter structure of the recognition text can be automatically adjusted, the chapter structure is clearer, the content of the recognition text can be rapidly understood by a user, and the reading efficiency of the user is improved.

Further, the segmentation features may be extracted acoustically from the voice data or semantically from the recognized text, and of course, the two segmentation features extracted based on different layers may be combined, and the corresponding segmentation model is used to perform segmentation detection on the recognized text corresponding to the voice data, so as to determine the position to be segmented, thereby further improving the accuracy of segmentation.

Furthermore, all segmented identification texts can be displayed for a user, or the theme of each segment of identification text is extracted, the theme of each segment is displayed for the user, and when the user needs to check an interested segment, the content of the segment is displayed, so that the user can quickly find the interested content.

Drawings

In order to more clearly illustrate the embodiments of the present application or technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings can be obtained by those skilled in the art according to the drawings.

FIG. 1 is a flow chart of a method for speech recognition text segmentation in accordance with an embodiment of the present invention;

FIG. 2 is a flow chart of building a segmentation model in an embodiment of the present invention;

FIG. 3 is a schematic diagram of an embodiment of a speech recognition text segmentation apparatus;

FIG. 4 is a schematic structural diagram of a segment module building block in an embodiment of the present invention;

FIG. 5 is a schematic diagram of another structure of a speech recognition text segmentation apparatus according to an embodiment of the present invention;

fig. 6 is another structural diagram of the speech recognition text segmenting device according to the embodiment of the invention.

Detailed Description

In order to make the technical field of the invention better understand the scheme of the embodiment of the invention, the embodiment of the invention is further described in detail with reference to the drawings and the implementation mode.

As shown in fig. 1, it is a flowchart of a speech recognition text segmentation method according to an embodiment of the present invention, including the following steps:

step 101, performing endpoint detection on the voice data to obtain each voice segment and the starting frame number and the ending frame number of each voice segment.

The voice data can be obtained according to actual application records, such as meeting records, interview records and the like.

End point detection is the finding of the start and end points of each speech segment from a given speech signal. Some end point detection methods in the prior art may be specifically adopted, and the embodiment of the present invention is not limited.

And 102, performing voice recognition on each voice section to obtain a recognition text corresponding to each voice section.

Specifically, feature extraction may be performed on each speech segment, such as extracting mfcc (mel Frequency Cepstrum coefficient) features; then, decoding operation is carried out by utilizing the extracted feature data and the acoustic model and the language model which are trained in advance; and finally, obtaining the identification text corresponding to each voice segment according to the decoding result. The specific process of speech recognition is the same as the prior art and will not be described in detail here.

And 103, extracting the segmented characteristics of the recognition texts corresponding to the voice segments.

In practical application, the segmentation features may be extracted acoustically from the voice data or semantically from the recognized text, or, of course, the two segmentation features extracted based on different layers may be integrated, and the corresponding segmentation model is used to perform segmentation detection on the recognized text corresponding to the voice data, so as to determine the position to be segmented, thereby further improving the accuracy of segmentation.

And 104, carrying out segmentation detection on the recognition text corresponding to the voice data by using the extracted segmentation features and a pre-constructed segmentation model so as to determine the position to be segmented.

Specifically, with the voice segments as units, the segmentation characteristics of the recognized text corresponding to each voice segment are input into the segmentation model for segmentation detection, and whether the ending position of the recognized text corresponding to each voice segment needs to be segmented is determined.

It should be noted that, in practical applications, the output of the segmentation model may be whether the end position of the recognized text corresponding to the current speech segment needs to be segmented, or may be a probability that the end position of the recognized text corresponding to the current speech segment needs to be segmented. Of course, the output of different types of parameters does not affect the training process of the segmented model, and only different input and output parameters need to be set during model training. The specific training process of the segmented model will be described in detail later.

If the output of the segmentation model is probability parameters, in this case, a corresponding threshold value may be preset, and if the probability is greater than the threshold value, it is determined that the end position of the recognized text corresponding to the current speech segment needs to be segmented.

And 105, segmenting the recognition text corresponding to the voice data according to the segmentation detection result.

Specifically, a segmentation mark may be added at an end position of the recognition text that needs to be segmented, so that the recognition text corresponding to the voice data may be conveniently displayed in a segmented manner according to the segmentation mark during displaying.

It should be noted that, in practical application, the segmentation detection may be performed on the recognition text every time a segmentation feature of the recognition text corresponding to a speech segment is extracted; or, after the segmented features of the recognized texts corresponding to all the speech segments are extracted, the segmented features of the recognized texts corresponding to each speech segment are sequentially input into the segmentation model for segment detection by taking the speech segment as a unit, which is not limited in the embodiment of the present invention.

In another embodiment of the method of the present invention, the method may further comprise the step of presenting the segmented recognized text to a user. During specific display, the texts belonging to the same segment can be placed in a paragraph according to the segmentation marks in the recognized texts, and the texts in different segments are displayed in a segmented manner.

For example, the text is recognized as A1, A2, A3, A4, A5, A6, A7, A8, A9 and A10. Wherein Ai represents a recognition text corresponding to a speech segment, and after segmentation detection, it is determined that segmentation is required at a2 and a5, and the form can be shown as follows:

A1,A2

A3,A4,A5

A6,A7,A8,A9,A10

in another embodiment of the method of the present invention, the method may further comprise the following steps:

There are various ways for the user to select the interested topic, for example, clicking or swiping the corresponding topic, or giving a corresponding serial number to each topic, and the user inputting the corresponding serial number through a keyboard.

As mentioned above, in the embodiment of the present invention, the segmentation feature may be extracted acoustically from the voice data or semantically from the recognized text, or, of course, the two segmentation features extracted based on different layers may be combined, that is, the segmentation feature of each voice segment is extracted acoustically from the voice data and is used as the first segmentation feature of the recognized text corresponding to the voice segment, and the segmentation feature is extracted semantically from the recognized text and is used as the second segmentation feature of the recognized text. Accordingly, when the segmentation model training is performed, the segmentation model may also be trained based on the acoustic segmentation features alone, or based on the semantic segmentation features alone, or based on the acoustic segmentation features and the semantic segmentation features.

The segmentation features at these two different levels are described in detail below.

1. The segmentation feature, i.e. the first segmentation feature described earlier, is extracted acoustically from the speech data.

In practical applications, the first segmentation feature may include: the duration of the current speech segment further comprises: a distance between a current speech segment and a previous speech segment, and/or a distance between a current speech segment and a subsequent speech segment.

Further, the first segmentation feature may further include: whether the speaker of the current voice segment is the same as the speaker of the previous voice segment and/or whether the speaker of the current voice segment is the same as the speaker of the next voice segment.

These several segmentation features are described in detail below.

a) Duration of current speech segment

The duration of a speech segment may be represented by the number of frames contained in the speech segment. Therefore, the difference between the ending frame number of the current speech segment and the starting frame number of the current speech segment is calculated, and the duration of the current speech segment can be obtained, that is, the difference is used as the duration of the current speech segment.

b) Distance between current speech segment and previous speech segment

The distance between the current speech segment and the previous speech segment may be represented using a difference between a starting frame number of the current speech segment and an ending frame number of the previous speech segment. Therefore, the difference between the starting frame number of the current speech segment and the ending frame number of the previous speech segment is calculated and used as the distance between the current speech segment and the previous speech segment.

It should be noted that, when the current speech segment is the first speech segment, the distance between the current speech segment and the previous speech segment is 0.

c) Distance between current speech segment and subsequent speech segment

Similarly, the distance between the current speech segment and the subsequent speech segment may be represented using the difference between the start frame number of the subsequent speech segment and the end frame number of the current speech segment. Therefore, the difference between the starting frame number of the next speech segment and the ending frame number of the current speech segment is calculated and used as the distance between the current speech segment and the next speech segment.

It should be noted that, when the current speech segment is the last speech segment, the distance between the current speech segment and the next speech segment is 0.

d) Whether the speaker of the current speech segment is the same as the speaker of the previous speech segment

e) Whether the speaker of the current speech segment is the same as the speaker of the next speech segment

The detection of whether the speakers in the adjacent voice sections are the same can be realized by detecting the speaker change points of the voice data by using a speaker separation technology, and whether the speaker in the current voice section is the same as the speaker in the previous voice section and whether the speaker in the current voice section is the same as the speaker in the next voice section are determined according to the detection result of the speaker change points.

The specific detection method of the speaker change point, namely the position where the same speaker finishes speaking and the other speaker begins, is the same as the prior art, and is not described in detail herein.

2. The segmentation feature, i.e. the second segmentation feature described earlier, is extracted from the semantics of the recognized text.

In practical applications, the second section feature may include any one or more of the following:

a) the number of the forward non-segmented sentences refers to the total number of sentences contained in all the recognized texts from the starting position of the recognized text corresponding to the current speech segment to the last segmentation mark.

The last segmentation mark may be obtained according to a segmentation mark of the recognized text before the current speech segment corresponds to the recognized text, and the segmentation mark may be obtained according to a segmentation detection result of the recognized text corresponding to the previous speech segment.

It should be noted that, in practical application, if the second segmentation feature includes a feature of the number of forward non-segmented sentences, when performing segmentation detection, the aforementioned segmentation feature of each recognition text corresponding to one speech segment needs to be extracted, that is, a manner of performing segmentation detection on the recognition text is required.

In addition, it should be noted that, if the recognized text corresponding to the current speech segment is the beginning of the recognized text, the number of forward non-segmented sentences is 0.

b) The backward non-segmented sentence number refers to the total number of sentences contained in all the recognized texts after the recognized text corresponding to the current speech segment.

The backward non-segmented sentence number refers to the total number of sentences contained in all the recognized texts after the current speech segment corresponds to the recognized text, and can be obtained by analyzing the sentence number after the current speech segment corresponds to the recognized text.

It should be noted that, if the current speech segment corresponds to the recognized text which is the end of all recognized texts, the number of backward non-segmented sentences is 0.

c) The number of sentences contained in the recognized text corresponding to the current speech segment.

Specifically, punctuations in the recognized text corresponding to the current speech segment can be directly analyzed to obtain the corresponding sentence number.

d) And the similarity between the recognition text corresponding to the current speech section and the recognition text corresponding to the previous speech section.

The similarity is generally measured by the distance or the included angle between the vectors, for example, the cosine included angle between the vectors is calculated, and the smaller the included angle is, the higher the similarity between the two recognition text vectors is. The word vectorization process is prior art and will not be described in detail here.

In order to eliminate the interference of some stop words on the calculation of the similarity of the text, the stop words included in the recognized text corresponding to the current speech segment and the previous speech segment, i.e. the words, symbols or messy codes which appear in the recognized text with high frequency but have no practical meaning, such as "this, and, meeting, and being" may be deleted respectively. And when the specific deletion is carried out, the stop words in the recognized text can be searched through a pre-constructed stop word list. And then vectorizing the residual words in the recognized text after the stop words are deleted, combining all word vectors in the recognized text corresponding to the voice sections to respectively obtain the recognized text vector corresponding to the current voice section and the recognized text vector corresponding to the previous voice section, and calculating the similarity of the two recognized text vectors.

It should be noted that, when the current speech segment corresponds to the beginning of all the recognized texts, the similarity is 0.

e) And the similarity between the recognition text corresponding to the current voice section and the recognition text corresponding to the next voice section.

Similar to the previous similarity calculation method, i.e. after deleting the stop word in the recognized text, the recognized text is vectorized and then the similarity is calculated.

It should be noted that, when the current speech segment corresponds to the recognized text and is the end of all recognized texts, the similarity is 0.

It should be noted that before extracting the second segmentation feature, the recognition text corresponding to the voice data needs to be modified, and then the second segmentation feature is extracted from the semantic of the modified recognition text.

The correction of the recognized text mainly comprises the following steps: punctuation is added to the recognized text. And adding punctuation, namely adding corresponding punctuation symbols to the recognition text, for example, adding punctuation to the recognition text based on a conditional random field model. In order to make the added punctuation more accurate, the threshold value of adding punctuation between the voice sections and the sections can be set, for example, the threshold value of adding punctuation between the voice sections is set to be smaller, and the threshold value of adding punctuation in the voice sections is set to be larger, so that the probability of adding punctuation between the voice sections is increased, and the probability of adding punctuation in the voice sections is reduced. The punctuated text is a sentence of text separated by punctuations (including comma, ", question mark".

Secondly, the modification may further include any one or more of the following:

(1) and filtering abnormal words of the recognition text corresponding to the voice data.

Text filtering is mainly to filter out words that identify erroneous anomalies in text, and specifically can filter words according to word confidence and the result of syntactic analysis.

(2) And performing smooth processing on the recognition text corresponding to the voice data.

The text smoothing processing mainly comprises smoothing out unlubricated sentences, and only one repeated word without practical meaning is reserved, such as 'very good' and 'very good'. The words of language and qi without practical meaning can be ignored and need not be recorded, for example, hiccup of hiccup problem needs to be smooth.

(3) And carrying out digital normalization on the recognition text corresponding to the voice data.

All the numbers in the recognized text obtained by speech recognition are expressed by Chinese numbers, but some numbers are expressed by Arabic numbers to accord with the reading habit of a user, such as five-quinquece, which should be expressed as 21.5 yuan. The number normalization is to convert some chinese numbers into arabic numbers, for example, an ABNF grammar-based method can be used.

(4) Performing text replacement on the recognition text corresponding to the voice data, wherein the text replacement comprises two conditions:

one case is the replacement between upper and lower cases of English, namely, English lower case letters in the recognized text corresponding to the voice data are converted into upper case letters, or vice versa; such as replacing "NBA" with "NBA", "cro C" with "cro C", etc.;

and in the other situation, the sensitive words in the recognized text corresponding to the voice data are replaced by special symbols, so that the hiding effect is achieved. During specific replacement, a sensitive word list can be established, then the sensitive word list is traversed to search and identify whether a sensitive word appears in the text, if so, a special symbol is used for replacement, for example, some violent-tendency words, for example, if the robbery is a sensitive word, the robbery in the text is replaced by the word.

It should be noted that, the above two text replacement situations can be performed by selecting one or two text replacement situations according to actual application needs, and the embodiment of the present invention is not limited.

As shown in fig. 2, it is a flowchart of constructing a segmentation model in the embodiment of the present invention, and includes the following steps:

step 201, voice data is collected.

Step 202, performing endpoint detection on the voice data to obtain each voice segment.

And 203, performing voice recognition on each voice section to obtain a recognition text corresponding to each voice section.

And 204, marking the segmentation information of the identification texts corresponding to the voice segments, wherein the segmentation information is used for indicating whether the ending position of the identification text corresponding to the current voice segment needs to be segmented or not.

For example, if segmented, it is labeled 1, otherwise it is labeled 0. Of course, other symbols may be used, and the embodiments of the present invention are not limited.

Step 205, extracting the segmentation features of the recognition text corresponding to each voice segment.

And step 206, constructing a segmentation model by using the segmentation characteristics and the segmentation information as training data.

The segmented model can adopt a common model in pattern recognition, such as a Bayesian model, a support vector machine model and the like. During specific training, the segmentation features of the recognized text are used as the input of the model, the labeled segmentation information is used as the output of the model, and model training is carried out to obtain the segmentation model. The specific training process of the segmented model is the same as that of the prior art, and is not described in detail here. It should be noted that the classification model can be obtained by off-line training.

It should be noted that, when the segmentation model training is performed, the segmentation model may be trained based on the acoustic segmentation features alone, or based on the semantic segmentation features alone, or based on the acoustic segmentation features and the semantic segmentation features. Accordingly, in the step 205, when the segmentation features of the recognized text corresponding to each speech segment are extracted, only the segmentation features based on acoustics or the segmentation features based on semantics may be extracted, or the segmentation features based on acoustics and the segmentation features based on semantics may be extracted at the same time, which is not limited in the embodiment of the present invention.

In addition, it should be noted that, when the segmentation model trained based on the different types of segmentation features is used to perform segmentation detection on the text to be displayed and recognized, the corresponding type of segmentation features of the text to be displayed and recognized need to be extracted, and the extracted segmentation features are input into the segmentation model to determine the position of the text to be displayed and recognized, which needs to be segmented.

The voice recognition text segmentation method provided by the invention has the advantages that each voice segment is obtained by carrying out endpoint detection on voice data, the recognition text corresponding to each voice segment is obtained by carrying out voice recognition on each voice segment, then, the segmentation characteristics of the recognition text corresponding to each voice segment are extracted, the recognition text corresponding to the voice data is subjected to segmentation detection by utilizing the extracted segmentation characteristics and the pre-constructed segmentation model, and the recognition text is segmented according to the segmentation detection result, so that the chapter structure of the recognition text can be automatically adjusted, the chapter structure is clearer, the content of the recognition text can be rapidly understood by a user, and the reading efficiency of the user is improved.

In practical application, the segmentation features may be extracted acoustically from the voice data or semantically from the recognized text, and of course, the two segmentation features extracted based on different layers may also be combined, and the corresponding segmentation model is used to perform segmentation detection on the recognized text corresponding to the voice data, so as to determine the position to be segmented, thereby further improving the accuracy of segmentation.

Furthermore, the speech recognition text segmentation method provided by the invention can also display all the segmented recognition texts to the user, or extract the theme of each segment of recognition text, display the theme of each segment to the user, and display the paragraph contents when the user needs to check the interested paragraph, thereby being beneficial to the user to quickly find the interested contents.

Correspondingly, an embodiment of the present invention further provides a speech recognition text segmentation apparatus, as shown in fig. 3, which is a schematic structural diagram of the apparatus.

In this embodiment, the apparatus comprises:

an endpoint detection module 301, configured to perform endpoint detection on the voice data to obtain each voice segment and a start frame number and an end frame number of each voice segment;

the voice recognition module 302 is configured to perform voice recognition on each voice segment to obtain a recognition text corresponding to each voice segment;

the feature extraction module 303 is configured to extract a segmentation feature of the recognition text corresponding to each speech segment;

a segmentation detection module 304, configured to perform segmentation detection on the recognition text corresponding to the speech data by using the extracted segmentation features and a pre-constructed segmentation model, so as to determine a position to be segmented; specifically, with the voice segments as units, sequentially inputting the segmentation characteristics of the recognition texts corresponding to the voice segments into the segmentation model for segmentation detection, and determining whether the end positions of the recognition texts corresponding to the voice segments need to be segmented;

and a segmenting module 305, configured to segment the recognition text corresponding to the voice data according to a segmentation detection result.

It should be noted that, in practical applications, the feature extraction module 303 may extract the segmentation features from the acoustics of the speech data or the semantics of the recognized text, or may extract the segmentation features based on different layers by combining the two types of features. Accordingly, the feature extraction module 303 may include: a first feature extraction module and/or a second feature extraction module. Wherein:

the first feature extraction module is used for extracting the segmented features of the voice segments from the acoustics of the voice data and taking the segmented features as the first segmented features of the recognition texts corresponding to the voice segments;

Wherein an embodiment of the first feature extraction module comprises: a duration calculation unit and a distance calculation unit; another embodiment of the first feature extraction module may further comprise: a speaker change point detection unit and a speaker determination unit. These units will be described separately below.

The time length calculating unit is used for calculating the difference value between the ending frame number of the current voice segment and the starting frame number of the current voice segment, and taking the difference value as the time length of the current voice segment.

The distance calculating unit is used for calculating the difference value between the starting frame number of the current voice segment and the ending frame number of the previous voice segment, and taking the difference value as the distance between the current voice segment and the previous voice segment; and/or calculating the difference value between the starting frame number of the next voice segment and the ending frame number of the current voice segment, and taking the difference value as the distance between the current voice segment and the next voice segment.

The speaker change point detection unit is used for detecting the speaker change point of the voice data by using a speaker separation technology.

The speaker determining unit is used for determining whether the speaker of the current voice section is the same as the speaker of the previous voice section according to the detection result of the speaker change point and/or determining whether the speaker of the current voice section is the same as the speaker of the next voice section according to the detection result of the speaker change point.

One embodiment of the second feature extraction module comprises:

a correction unit configured to correct a recognition text corresponding to the speech data, the correction unit including: a punctuation adding subunit, configured to add punctuation to the recognition text corresponding to the speech data, for example, add punctuation to the recognition text corresponding to the speech data based on a conditional random field model;

The second segmentation features extracted by the feature extraction unit may include any one or more of the following: the number of forward non-segmented sentences, the number of backward non-segmented sentences, the number of sentences contained in the identification text corresponding to the current voice segment, the similarity between the identification text corresponding to the current voice segment and the identification text corresponding to the previous voice segment, and the similarity between the identification text corresponding to the current voice segment and the identification text corresponding to the next voice segment.

In practical applications, the modification unit may further include any one or more of the following sub-units:

In the embodiment of the present invention, the segment model may be constructed offline by a corresponding segment model construction module, and the segment model construction module may be independent from the speech recognition text segmentation apparatus of the present invention, or may be integrated with the speech recognition text segmentation apparatus of the present invention, which is not limited to this embodiment of the present invention.

Fig. 4 is a schematic structural diagram of a segmentation model building module in the embodiment of the present invention, including:

a data collection unit 401 for collecting voice data;

an endpoint detection unit 402, configured to perform endpoint detection on the voice data collected by the data collection unit to obtain each voice segment;

a voice recognition unit 403, configured to perform voice recognition on each voice segment to obtain a recognition text corresponding to each voice segment;

a labeling unit 404, configured to label segmentation information of the identification text corresponding to each speech segment, where the segmentation information is used to indicate whether an end position of the identification text corresponding to a current speech segment needs to be segmented;

a feature extraction unit 405, configured to extract a segmentation feature of the recognition text corresponding to each speech segment;

a training unit 406, configured to use the segmentation features and the segmentation information as training data to construct a segmentation model.

It should be noted that, in the training of the segmentation model, the segmentation model may be trained based on the acoustic segmentation features (i.e., the first segmentation features mentioned above) alone, or based on the semantic segmentation features (i.e., the second segmentation features mentioned above) alone, or based on both the acoustic segmentation features and the semantic segmentation features. Accordingly, when the feature extraction unit 405 extracts the segmentation features of the recognized text corresponding to each speech segment, only the segmentation features based on acoustics or the segmentation features based on semantics may be extracted, or the segmentation features based on acoustics and the segmentation features based on semantics may be extracted at the same time, which is not limited in the embodiment of the present invention.

In addition, the output of the segmentation model may be whether the ending position of the recognized text corresponding to the current speech segment needs to be segmented, or may be the probability that the ending position of the recognized text corresponding to the current speech segment needs to be segmented. Of course, the output of different types of parameters does not affect the training process of the segmented model, and only different input and output parameters need to be set during model training.

The invention provides a speech recognition text segmenting device, which obtains each speech segment by performing endpoint detection on speech data, obtains a recognition text corresponding to each speech segment by performing speech recognition on each speech segment, extracts the segmentation characteristics of the recognition text corresponding to each speech segment, performs segmentation detection on the recognition text corresponding to the speech data by using the extracted segmentation characteristics and a pre-constructed segmentation model, and segments the recognition text according to the segmentation detection result, so that the chapter structure of the recognition text can be automatically adjusted, the chapter structure is clearer, the content of the recognition text can be rapidly understood by a user, and the reading efficiency of the user is improved.

Fig. 5 is a schematic diagram of another structure of the speech recognition text segmenting device according to the embodiment of the present invention.

Unlike fig. 3, in this embodiment, the apparatus further includes:

a first display module 501, configured to display the segmented recognition text to a user.

Fig. 6 is a schematic diagram of another structure of the speech recognition text segmenting device according to the embodiment of the present invention.

Unlike fig. 3, in this embodiment, the apparatus further includes:

a topic extraction module 601, configured to extract a topic of each segmented paragraph identification text;

a second presentation module 602, configured to present each topic to a user;

the perceiving module 603 is configured to perceive a topic that is interested by a user, and when the topic that is interested by the user is perceived, trigger the second displaying module 602 to display the identification text of the paragraph corresponding to the topic to the user.

The voice recognition text segmentation device provided by the invention can display the segmented recognition text to the user in various ways, not only can display the recognition text with clear chapter structures to the user, but also can help the user to quickly find the content of interest of the user, and further improves the reading efficiency.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, it is relatively simple to describe, and reference may be made to some descriptions of the method embodiment for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The above embodiments of the present invention have been described in detail, and the present invention has been described herein with reference to particular embodiments, but the above embodiments are merely intended to facilitate an understanding of the methods and apparatuses of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method for speech recognition text segmentation, comprising:

carrying out segmentation detection on the recognition text corresponding to the voice data by using the extracted segmentation features and a pre-constructed segmentation model so as to determine the position to be segmented, wherein the segmentation features are used for determining whether the end position of the recognition text corresponding to each voice segment is a segmentation boundary, and the segmentation refers to the division of a paragraph structure;

2. The method of claim 1, further comprising constructing a segmentation model by:

collecting voice data;

3. The method according to claim 1, wherein the extracting the segmented features of the recognized text corresponding to each speech segment comprises:

4. The method of claim 3, wherein the first segmented feature comprises: the duration of the current speech segment further comprises: the distance between the current voice section and the previous voice section and/or the distance between the current voice section and the next voice section;

further comprising:

5. The method of claim 4, wherein the first segmented feature further comprises: whether the speaker of the current voice section is the same as the speaker of the previous voice section and/or whether the speaker of the current voice section is the same as the speaker of the next voice section;

6. The method of claim 3, wherein the second segmented feature comprises any one or more of:

7. The method of claim 3, wherein semantically extracting segmentation features from the recognized text comprises:

8. The method of claim 7, wherein the modification further comprises any one or more of:

9. The method according to any one of claims 1 to 8, wherein the detecting the segmentation of the recognition text corresponding to the speech data by using the extracted segmentation features and a pre-constructed segmentation model to determine the position to be segmented comprises:

10. The method according to any one of claims 1 to 8, further comprising:

displaying the segmented recognition text to a user; or

11. A speech recognition text segmentation apparatus, comprising:

the segmentation detection module is used for carrying out segmentation detection on the identification texts corresponding to the voice data by utilizing the extracted segmentation features and a pre-constructed segmentation model so as to determine the positions to be segmented, wherein the segmentation model features are used for determining whether the ending positions of the identification texts corresponding to the voice segments need to be segmentation boundaries or not, and the segmentation refers to the division of a paragraph structure;

12. The apparatus of claim 11, further comprising a segment model building module for building a segment model; the segmentation model building module comprises:

a data collection unit for collecting voice data;

13. The apparatus of claim 11, wherein the feature extraction module comprises:

14. The apparatus of claim 13, wherein the first feature extraction module comprises:

15. The apparatus of claim 14, wherein the first feature extraction module further comprises:

16. The apparatus of claim 13, wherein the second segmented feature comprises any one or more of:

17. The apparatus of claim 13, wherein the second feature extraction module comprises:

18. The apparatus according to claim 17, wherein the modification unit further comprises any one or more of the following sub-units:

19. The apparatus of any one of claims 11 to 18,

the segmentation detection module is specifically configured to, with the voice segments as units, sequentially input the segmentation characteristics of the recognized text corresponding to each voice segment into the segmentation model for segmentation detection, and determine whether the end position of the recognized text corresponding to each voice segment needs to be segmented.

20. The apparatus of any one of claims 11 to 18, further comprising:

the second display module is used for displaying all the themes to the user;