CN107305541B - Method and device for segmenting speech recognition text - Google Patents
Method and device for segmenting speech recognition text Download PDFInfo
- Publication number
- CN107305541B CN107305541B CN201610256898.8A CN201610256898A CN107305541B CN 107305541 B CN107305541 B CN 107305541B CN 201610256898 A CN201610256898 A CN 201610256898A CN 107305541 B CN107305541 B CN 107305541B
- Authority
- CN
- China
- Prior art keywords
- voice
- segmentation
- segment
- recognition
- text corresponding
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 39
- 230000011218 segmentation Effects 0.000 claims abstract description 267
- 238000001514 detection method Methods 0.000 claims abstract description 83
- 238000000605 extraction Methods 0.000 claims description 36
- 238000012549 training Methods 0.000 claims description 21
- 238000012937 correction Methods 0.000 claims description 10
- 238000001914 filtration Methods 0.000 claims description 10
- 238000010606 normalization Methods 0.000 claims description 10
- 238000012545 processing Methods 0.000 claims description 10
- 238000005516 engineering process Methods 0.000 claims description 8
- 238000009499 grossing Methods 0.000 claims description 8
- 230000002159 abnormal effect Effects 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000013480 data collection Methods 0.000 claims description 6
- 238000000926 separation method Methods 0.000 claims description 6
- 238000012986 modification Methods 0.000 claims description 5
- 230000004048 modification Effects 0.000 claims description 5
- 230000008447 perception Effects 0.000 claims description 2
- 238000010586 diagram Methods 0.000 description 8
- 239000013598 vector Substances 0.000 description 7
- 239000000284 extract Substances 0.000 description 5
- 238000010276 construction Methods 0.000 description 3
- 208000031361 Hiccup Diseases 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000011895 specific detection Methods 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a method and a device for segmenting a speech recognition text, wherein the method comprises the following steps: carrying out end point detection on the voice data to obtain each voice segment and a starting frame number and an ending frame number of each voice segment; carrying out voice recognition on each voice section to obtain a recognition text corresponding to each voice section; extracting the segmented characteristics of the recognition texts corresponding to the voice segments; carrying out segmentation detection on the recognition text corresponding to the voice data by using the extracted segmentation features and a pre-constructed segmentation model so as to determine the position to be segmented; and segmenting the recognition text corresponding to the voice data according to the segmentation detection result. The invention can automatically realize the segmentation of the recognition text, so that the chapter structure of the recognition text is clearer.
Description
Technical Field
The invention relates to the field of natural language processing, in particular to a method and a device for segmenting a speech recognition text.
Background
With the development of voice technology, automatic voice recognition technology has been widely applied to various fields of life, and the conversion of voice into text greatly facilitates the life requirements of people, for example, the conversion of conference recording into text is sent to participants as conference summary; the recording of the interview of the reporter is converted into text, and then edited into a news manuscript and the like on the basis of the text. However, the recognized text obtained by speech recognition does not have a clear chapter structure like the manually edited text, such as the division of the paragraph structure, so that it is often difficult for a user to find the emphasis or the theme of the entire recognized text when viewing the recognized text, and especially when the recognized text is more and involves multiple themes, it is more difficult for the user to clear the chapter structure of the recognized text and accurately find the content of each theme. Therefore, how to clearly display the recognition text to the user helps the user to understand the content of the recognition text is very important for displaying the speech recognition text.
In the prior art, generally, a recognition text of voice data is directly displayed to a user, and a recognition result is not processed; or manually adjusting the chapter structure of the recognition text, displaying the adjusted recognition text to the user, for example, dividing the recognition text into different paragraphs according to the content of the recognition text, and displaying the adjusted recognition text to the user. When the manual adjustment mode is used for recognizing more texts, the manual work is heavy, the efficiency is low, the consumed time is long, and the recognition system is difficult to achieve the practical effect.
Disclosure of Invention
The invention provides a method and a device for segmenting a voice recognition text, which aim to solve the problems of large workload and low efficiency of manual adjustment of text recognition chapter structures in the prior art.
Therefore, the invention provides the following technical scheme:
a speech recognition text segmentation method comprising:
carrying out end point detection on the voice data to obtain each voice segment and a starting frame number and an ending frame number of each voice segment;
carrying out voice recognition on each voice section to obtain a recognition text corresponding to each voice section;
extracting the segmented characteristics of the recognition texts corresponding to the voice segments;
carrying out segmentation detection on the recognition text corresponding to the voice data by using the extracted segmentation features and a pre-constructed segmentation model so as to determine the position to be segmented;
and segmenting the recognition text corresponding to the voice data according to the segmentation detection result.
Preferably, the method further comprises constructing the segment model in the following manner:
collecting voice data;
carrying out end point detection on the collected voice data to obtain each voice section;
carrying out voice recognition on each voice section to obtain a recognition text corresponding to each voice section;
marking the segmentation information of the identification text corresponding to each voice segment, wherein the segmentation information is used for indicating whether the ending position of the identification text corresponding to the current voice segment needs to be segmented or not;
extracting the segmented characteristics of the recognition texts corresponding to the voice segments;
and constructing a segmentation model by using the segmentation characteristics and the segmentation information as training data.
Preferably, the extracting the segmentation features of the recognized text corresponding to each speech segment includes:
extracting the segmentation characteristics of each voice segment from the acoustics of the voice data, and taking the segmentation characteristics as the first segmentation characteristics of the recognition text corresponding to the voice segment; and/or
And extracting a segmentation feature from the semantics of the recognition text, and using the segmentation feature as a second segmentation feature of the recognition text.
Preferably, the first segmentation feature comprises: the duration of the current speech segment further comprises: the distance between the current voice section and the previous voice section and/or the distance between the current voice section and the next voice section;
the acoustically extracting the segmentation features of the speech segments from the speech data comprises:
calculating the difference value between the ending frame number of the current voice segment and the starting frame number of the current voice segment, and taking the difference value as the time length of the current voice segment;
further comprising:
calculating the difference value between the starting frame number of the current voice segment and the ending frame number of the previous voice segment, and taking the difference value as the distance between the current voice segment and the previous voice segment; and/or
And calculating the difference value between the starting frame number of the next voice segment and the ending frame number of the current voice segment, and taking the difference value as the distance between the current voice segment and the next voice segment.
Preferably, the first section feature further comprises: whether the speaker of the current voice section is the same as the speaker of the previous voice section and/or whether the speaker of the current voice section is the same as the speaker of the next voice section;
the acoustically extracting the segmentation features of the speech segments from the speech data further comprises:
carrying out speaker change point detection on the voice data by using a speaker separation technology;
and determining whether the speaker in the current voice section is the same as the speaker in the previous voice section according to the speaker change point detection result and/or determining whether the speaker in the current voice section is the same as the speaker in the next voice section according to the speaker change point detection result.
Preferably, the second section feature comprises any one or more of:
the forward non-segmented sentence number refers to the total number of sentences contained in all the recognized texts from the starting position of the recognized text corresponding to the current speech segment to the last segmented mark;
the backward non-segmented sentence number refers to the total number of sentences contained in all the recognized texts after the recognized text corresponding to the current speech segment;
the number of sentences contained in the recognition text corresponding to the current voice segment;
similarity between the recognition text corresponding to the current voice section and the recognition text corresponding to the previous voice section;
and the similarity between the recognition text corresponding to the current voice section and the recognition text corresponding to the next voice section.
Preferably, the semantically extracting the segmentation features from the recognized text comprises:
and correcting the recognition text corresponding to the voice data, wherein the correction comprises: adding punctuation to the recognition text corresponding to the voice data;
segmentation features are extracted from the semantics of the modified recognized text.
Preferably, the modification further comprises any one or more of:
filtering abnormal words of the recognition text corresponding to the voice data;
performing smooth processing on the recognition text corresponding to the voice data;
carrying out digital normalization on the recognition text corresponding to the voice data;
performing text replacement on the recognition text corresponding to the voice data, wherein the text replacement comprises: converting English lowercase letters in the recognized text corresponding to the voice data into uppercase letters or vice versa; and/or replacing sensitive words in the recognized text corresponding to the voice data with special symbols.
Preferably, the step of detecting the segmentation of the recognized text corresponding to the speech data by using the extracted segmentation features and a pre-constructed segmentation model to determine the position to be segmented includes:
and sequentially inputting the segmentation characteristics of the recognition texts corresponding to the speech segments into the segmentation model for segmentation detection by taking the speech segments as units, and determining whether the end positions of the recognition texts corresponding to the speech segments need to be segmented or not.
Preferably, the method further comprises:
displaying the segmented recognition text to a user; or
Extracting the topics of the segmented paragraph recognition texts, and displaying the topics to a user;
when a topic which is interesting to the user is sensed, the identification text of the paragraph corresponding to the topic is displayed to the user.
A speech recognition text segmentation apparatus comprising:
the end point detection module is used for carrying out end point detection on the voice data to obtain each voice section and the starting frame number and the ending frame number of each voice section;
the voice recognition module is used for carrying out voice recognition on each voice section to obtain a recognition text corresponding to each voice section;
the feature extraction module is used for extracting the segmented features of the identification texts corresponding to the voice segments;
the segmentation detection module is used for carrying out segmentation detection on the recognition text corresponding to the voice data by utilizing the extracted segmentation characteristics and a pre-constructed segmentation model so as to determine the position to be segmented;
and the segmentation module is used for segmenting the recognition text corresponding to the voice data according to the segmentation detection result.
Preferably, the apparatus further comprises a segment model construction module for constructing a segment model; the segmentation model building module comprises:
a data collection unit for collecting voice data;
the end point detection unit is used for carrying out end point detection on the voice data collected by the data collection unit to obtain each voice section;
the voice recognition unit is used for carrying out voice recognition on each voice section to obtain a recognition text corresponding to each voice section;
the marking unit is used for marking the segmentation information of the identification text corresponding to each voice segment, and the segmentation information is used for indicating whether the end position of the identification text corresponding to the current voice segment needs to be segmented or not;
the characteristic extraction unit is used for extracting the segmented characteristics of the identification texts corresponding to the voice segments;
and the training unit is used for constructing a segmentation model by taking the segmentation characteristics and the segmentation information as training data.
Preferably, the feature extraction module includes:
the first feature extraction module is used for extracting the segmented features of the voice segments from the acoustics of the voice data and taking the segmented features as the first segmented features of the recognition texts corresponding to the voice segments; and/or
And the second feature extraction module is used for extracting segmentation features from the semantics of the recognition text and taking the segmentation features as second segmentation features of the recognition text.
Preferably, the first feature extraction module includes:
the time length calculating unit is used for calculating the difference value between the ending frame number of the current voice segment and the starting frame number of the current voice segment, and taking the difference value as the time length of the current voice segment;
the distance calculation unit is used for calculating the difference value between the starting frame number of the current voice segment and the ending frame number of the previous voice segment, and taking the difference value as the distance between the current voice segment and the previous voice segment; and/or calculating the difference value between the starting frame number of the next voice segment and the ending frame number of the current voice segment, and taking the difference value as the distance between the current voice segment and the next voice segment.
Preferably, the first feature extraction module further comprises:
the speaker change point detection unit is used for detecting the speaker change points of the voice data by using a speaker separation technology;
and the speaker determining unit is used for determining whether the speaker of the current voice section is the same as the speaker of the previous voice section according to the speaker change point detection result and/or determining whether the speaker of the current voice section is the same as the speaker of the next voice section according to the speaker change point detection result.
Preferably, the second section feature comprises any one or more of:
the forward non-segmented sentence number refers to the total number of sentences contained in all the recognized texts from the starting position of the recognized text corresponding to the current speech segment to the last segmented mark;
the backward non-segmented sentence number refers to the total number of sentences contained in all the recognized texts after the recognized text corresponding to the current speech segment;
the number of sentences contained in the recognition text corresponding to the current voice segment;
similarity between the recognition text corresponding to the current voice section and the recognition text corresponding to the previous voice section;
and the similarity between the recognition text corresponding to the current voice section and the recognition text corresponding to the next voice section.
Preferably, the second feature extraction module includes:
a correction unit configured to correct a recognition text corresponding to the speech data, the correction unit including: a punctuation adding subunit, configured to add punctuation to the recognition text corresponding to the voice data;
and the feature extraction unit is used for extracting segmentation features from the semanteme of the corrected recognition text.
Preferably, the correction unit further comprises any one or more of the following sub-units:
the filtering subunit is used for filtering abnormal words of the recognition text corresponding to the voice data;
a smoothing processing subunit, configured to perform smoothing processing on the recognition text corresponding to the voice data;
the normalization subunit is used for carrying out digital normalization on the recognition text corresponding to the voice data;
a text replacement subunit, configured to perform text replacement on the recognition text corresponding to the voice data, where the text replacement includes: converting English lowercase letters in the recognized text corresponding to the voice data into uppercase letters or vice versa; and/or replacing sensitive words in the recognized text corresponding to the voice data with special symbols.
Preferably, the segmentation detection module is specifically configured to, with a speech segment as a unit, sequentially input the segmentation characteristics of the recognized text corresponding to each speech segment into the segmentation model for segmentation detection, and determine whether the end position of the recognized text corresponding to each speech segment needs to be segmented.
Preferably, the apparatus further comprises:
the first display module is used for displaying the segmented identification texts to a user; or
The topic extraction module is used for extracting the topic of each paragraph identification text after segmentation;
the second display module is used for displaying all the themes to the user;
and the perception module is used for perceiving the topic which is interested by the user and triggering the second display module to display the identification text of the paragraph corresponding to the topic to the user when perceiving the topic which is interested by the user.
The invention provides a method and a device for segmenting a voice recognition text, which are characterized in that each voice segment is obtained by carrying out endpoint detection on voice data, the recognition text corresponding to each voice segment is obtained by carrying out voice recognition on each voice segment, then, the segmentation characteristics of the recognition text corresponding to each voice segment are extracted, the recognition text corresponding to the voice data is segmented and detected by utilizing the extracted segmentation characteristics and a pre-constructed segmentation model so as to determine the position to be segmented, and the recognition text is segmented according to the segmentation detection result, so that the chapter structure of the recognition text can be automatically adjusted, the chapter structure is clearer, the content of the recognition text can be rapidly understood by a user, and the reading efficiency of the user is improved.
Further, the segmentation features may be extracted acoustically from the voice data or semantically from the recognized text, and of course, the two segmentation features extracted based on different layers may be combined, and the corresponding segmentation model is used to perform segmentation detection on the recognized text corresponding to the voice data, so as to determine the position to be segmented, thereby further improving the accuracy of segmentation.
Furthermore, all segmented identification texts can be displayed for a user, or the theme of each segment of identification text is extracted, the theme of each segment is displayed for the user, and when the user needs to check an interested segment, the content of the segment is displayed, so that the user can quickly find the interested content.
Drawings
In order to more clearly illustrate the embodiments of the present application or technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings can be obtained by those skilled in the art according to the drawings.
FIG. 1 is a flow chart of a method for speech recognition text segmentation in accordance with an embodiment of the present invention;
FIG. 2 is a flow chart of building a segmentation model in an embodiment of the present invention;
FIG. 3 is a schematic diagram of an embodiment of a speech recognition text segmentation apparatus;
FIG. 4 is a schematic structural diagram of a segment module building block in an embodiment of the present invention;
FIG. 5 is a schematic diagram of another structure of a speech recognition text segmentation apparatus according to an embodiment of the present invention;
fig. 6 is another structural diagram of the speech recognition text segmenting device according to the embodiment of the invention.
Detailed Description
In order to make the technical field of the invention better understand the scheme of the embodiment of the invention, the embodiment of the invention is further described in detail with reference to the drawings and the implementation mode.
As shown in fig. 1, it is a flowchart of a speech recognition text segmentation method according to an embodiment of the present invention, including the following steps:
The voice data can be obtained according to actual application records, such as meeting records, interview records and the like.
End point detection is the finding of the start and end points of each speech segment from a given speech signal. Some end point detection methods in the prior art may be specifically adopted, and the embodiment of the present invention is not limited.
And 102, performing voice recognition on each voice section to obtain a recognition text corresponding to each voice section.
Specifically, feature extraction may be performed on each speech segment, such as extracting mfcc (mel Frequency Cepstrum coefficient) features; then, decoding operation is carried out by utilizing the extracted feature data and the acoustic model and the language model which are trained in advance; and finally, obtaining the identification text corresponding to each voice segment according to the decoding result. The specific process of speech recognition is the same as the prior art and will not be described in detail here.
And 103, extracting the segmented characteristics of the recognition texts corresponding to the voice segments.
In practical application, the segmentation features may be extracted acoustically from the voice data or semantically from the recognized text, or, of course, the two segmentation features extracted based on different layers may be integrated, and the corresponding segmentation model is used to perform segmentation detection on the recognized text corresponding to the voice data, so as to determine the position to be segmented, thereby further improving the accuracy of segmentation.
And 104, carrying out segmentation detection on the recognition text corresponding to the voice data by using the extracted segmentation features and a pre-constructed segmentation model so as to determine the position to be segmented.
Specifically, with the voice segments as units, the segmentation characteristics of the recognized text corresponding to each voice segment are input into the segmentation model for segmentation detection, and whether the ending position of the recognized text corresponding to each voice segment needs to be segmented is determined.
It should be noted that, in practical applications, the output of the segmentation model may be whether the end position of the recognized text corresponding to the current speech segment needs to be segmented, or may be a probability that the end position of the recognized text corresponding to the current speech segment needs to be segmented. Of course, the output of different types of parameters does not affect the training process of the segmented model, and only different input and output parameters need to be set during model training. The specific training process of the segmented model will be described in detail later.
If the output of the segmentation model is probability parameters, in this case, a corresponding threshold value may be preset, and if the probability is greater than the threshold value, it is determined that the end position of the recognized text corresponding to the current speech segment needs to be segmented.
And 105, segmenting the recognition text corresponding to the voice data according to the segmentation detection result.
Specifically, a segmentation mark may be added at an end position of the recognition text that needs to be segmented, so that the recognition text corresponding to the voice data may be conveniently displayed in a segmented manner according to the segmentation mark during displaying.
It should be noted that, in practical application, the segmentation detection may be performed on the recognition text every time a segmentation feature of the recognition text corresponding to a speech segment is extracted; or, after the segmented features of the recognized texts corresponding to all the speech segments are extracted, the segmented features of the recognized texts corresponding to each speech segment are sequentially input into the segmentation model for segment detection by taking the speech segment as a unit, which is not limited in the embodiment of the present invention.
In another embodiment of the method of the present invention, the method may further comprise the step of presenting the segmented recognized text to a user. During specific display, the texts belonging to the same segment can be placed in a paragraph according to the segmentation marks in the recognized texts, and the texts in different segments are displayed in a segmented manner.
For example, the text is recognized as A1, A2, A3, A4, A5, A6, A7, A8, A9 and A10. Wherein Ai represents a recognition text corresponding to a speech segment, and after segmentation detection, it is determined that segmentation is required at a2 and a5, and the form can be shown as follows:
A1,A2
A3,A4,A5
A6,A7,A8,A9,A10
in another embodiment of the method of the present invention, the method may further comprise the following steps:
extracting the topics of the segmented paragraph recognition texts, and displaying the topics to a user;
when a topic which is interesting to the user is sensed, the identification text of the paragraph corresponding to the topic is displayed to the user.
There are various ways for the user to select the interested topic, for example, clicking or swiping the corresponding topic, or giving a corresponding serial number to each topic, and the user inputting the corresponding serial number through a keyboard.
As mentioned above, in the embodiment of the present invention, the segmentation feature may be extracted acoustically from the voice data or semantically from the recognized text, or, of course, the two segmentation features extracted based on different layers may be combined, that is, the segmentation feature of each voice segment is extracted acoustically from the voice data and is used as the first segmentation feature of the recognized text corresponding to the voice segment, and the segmentation feature is extracted semantically from the recognized text and is used as the second segmentation feature of the recognized text. Accordingly, when the segmentation model training is performed, the segmentation model may also be trained based on the acoustic segmentation features alone, or based on the semantic segmentation features alone, or based on the acoustic segmentation features and the semantic segmentation features.
The segmentation features at these two different levels are described in detail below.
1. The segmentation feature, i.e. the first segmentation feature described earlier, is extracted acoustically from the speech data.
In practical applications, the first segmentation feature may include: the duration of the current speech segment further comprises: a distance between a current speech segment and a previous speech segment, and/or a distance between a current speech segment and a subsequent speech segment.
Further, the first segmentation feature may further include: whether the speaker of the current voice segment is the same as the speaker of the previous voice segment and/or whether the speaker of the current voice segment is the same as the speaker of the next voice segment.
These several segmentation features are described in detail below.
a) Duration of current speech segment
The duration of a speech segment may be represented by the number of frames contained in the speech segment. Therefore, the difference between the ending frame number of the current speech segment and the starting frame number of the current speech segment is calculated, and the duration of the current speech segment can be obtained, that is, the difference is used as the duration of the current speech segment.
b) Distance between current speech segment and previous speech segment
The distance between the current speech segment and the previous speech segment may be represented using a difference between a starting frame number of the current speech segment and an ending frame number of the previous speech segment. Therefore, the difference between the starting frame number of the current speech segment and the ending frame number of the previous speech segment is calculated and used as the distance between the current speech segment and the previous speech segment.
It should be noted that, when the current speech segment is the first speech segment, the distance between the current speech segment and the previous speech segment is 0.
c) Distance between current speech segment and subsequent speech segment
Similarly, the distance between the current speech segment and the subsequent speech segment may be represented using the difference between the start frame number of the subsequent speech segment and the end frame number of the current speech segment. Therefore, the difference between the starting frame number of the next speech segment and the ending frame number of the current speech segment is calculated and used as the distance between the current speech segment and the next speech segment.
It should be noted that, when the current speech segment is the last speech segment, the distance between the current speech segment and the next speech segment is 0.
d) Whether the speaker of the current speech segment is the same as the speaker of the previous speech segment
e) Whether the speaker of the current speech segment is the same as the speaker of the next speech segment
The detection of whether the speakers in the adjacent voice sections are the same can be realized by detecting the speaker change points of the voice data by using a speaker separation technology, and whether the speaker in the current voice section is the same as the speaker in the previous voice section and whether the speaker in the current voice section is the same as the speaker in the next voice section are determined according to the detection result of the speaker change points.
The specific detection method of the speaker change point, namely the position where the same speaker finishes speaking and the other speaker begins, is the same as the prior art, and is not described in detail herein.
2. The segmentation feature, i.e. the second segmentation feature described earlier, is extracted from the semantics of the recognized text.
In practical applications, the second section feature may include any one or more of the following:
a) the number of the forward non-segmented sentences refers to the total number of sentences contained in all the recognized texts from the starting position of the recognized text corresponding to the current speech segment to the last segmentation mark.
The last segmentation mark may be obtained according to a segmentation mark of the recognized text before the current speech segment corresponds to the recognized text, and the segmentation mark may be obtained according to a segmentation detection result of the recognized text corresponding to the previous speech segment.
It should be noted that, in practical application, if the second segmentation feature includes a feature of the number of forward non-segmented sentences, when performing segmentation detection, the aforementioned segmentation feature of each recognition text corresponding to one speech segment needs to be extracted, that is, a manner of performing segmentation detection on the recognition text is required.
In addition, it should be noted that, if the recognized text corresponding to the current speech segment is the beginning of the recognized text, the number of forward non-segmented sentences is 0.
b) The backward non-segmented sentence number refers to the total number of sentences contained in all the recognized texts after the recognized text corresponding to the current speech segment.
The backward non-segmented sentence number refers to the total number of sentences contained in all the recognized texts after the current speech segment corresponds to the recognized text, and can be obtained by analyzing the sentence number after the current speech segment corresponds to the recognized text.
It should be noted that, if the current speech segment corresponds to the recognized text which is the end of all recognized texts, the number of backward non-segmented sentences is 0.
c) The number of sentences contained in the recognized text corresponding to the current speech segment.
Specifically, punctuations in the recognized text corresponding to the current speech segment can be directly analyzed to obtain the corresponding sentence number.
d) And the similarity between the recognition text corresponding to the current speech section and the recognition text corresponding to the previous speech section.
The similarity is generally measured by the distance or the included angle between the vectors, for example, the cosine included angle between the vectors is calculated, and the smaller the included angle is, the higher the similarity between the two recognition text vectors is. The word vectorization process is prior art and will not be described in detail here.
In order to eliminate the interference of some stop words on the calculation of the similarity of the text, the stop words included in the recognized text corresponding to the current speech segment and the previous speech segment, i.e. the words, symbols or messy codes which appear in the recognized text with high frequency but have no practical meaning, such as "this, and, meeting, and being" may be deleted respectively. And when the specific deletion is carried out, the stop words in the recognized text can be searched through a pre-constructed stop word list. And then vectorizing the residual words in the recognized text after the stop words are deleted, combining all word vectors in the recognized text corresponding to the voice sections to respectively obtain the recognized text vector corresponding to the current voice section and the recognized text vector corresponding to the previous voice section, and calculating the similarity of the two recognized text vectors.
It should be noted that, when the current speech segment corresponds to the beginning of all the recognized texts, the similarity is 0.
e) And the similarity between the recognition text corresponding to the current voice section and the recognition text corresponding to the next voice section.
Similar to the previous similarity calculation method, i.e. after deleting the stop word in the recognized text, the recognized text is vectorized and then the similarity is calculated.
It should be noted that, when the current speech segment corresponds to the recognized text and is the end of all recognized texts, the similarity is 0.
It should be noted that before extracting the second segmentation feature, the recognition text corresponding to the voice data needs to be modified, and then the second segmentation feature is extracted from the semantic of the modified recognition text.
The correction of the recognized text mainly comprises the following steps: punctuation is added to the recognized text. And adding punctuation, namely adding corresponding punctuation symbols to the recognition text, for example, adding punctuation to the recognition text based on a conditional random field model. In order to make the added punctuation more accurate, the threshold value of adding punctuation between the voice sections and the sections can be set, for example, the threshold value of adding punctuation between the voice sections is set to be smaller, and the threshold value of adding punctuation in the voice sections is set to be larger, so that the probability of adding punctuation between the voice sections is increased, and the probability of adding punctuation in the voice sections is reduced. The punctuated text is a sentence of text separated by punctuations (including comma, ", question mark".
Secondly, the modification may further include any one or more of the following:
(1) and filtering abnormal words of the recognition text corresponding to the voice data.
Text filtering is mainly to filter out words that identify erroneous anomalies in text, and specifically can filter words according to word confidence and the result of syntactic analysis.
(2) And performing smooth processing on the recognition text corresponding to the voice data.
The text smoothing processing mainly comprises smoothing out unlubricated sentences, and only one repeated word without practical meaning is reserved, such as 'very good' and 'very good'. The words of language and qi without practical meaning can be ignored and need not be recorded, for example, hiccup of hiccup problem needs to be smooth.
(3) And carrying out digital normalization on the recognition text corresponding to the voice data.
All the numbers in the recognized text obtained by speech recognition are expressed by Chinese numbers, but some numbers are expressed by Arabic numbers to accord with the reading habit of a user, such as five-quinquece, which should be expressed as 21.5 yuan. The number normalization is to convert some chinese numbers into arabic numbers, for example, an ABNF grammar-based method can be used.
(4) Performing text replacement on the recognition text corresponding to the voice data, wherein the text replacement comprises two conditions:
one case is the replacement between upper and lower cases of English, namely, English lower case letters in the recognized text corresponding to the voice data are converted into upper case letters, or vice versa; such as replacing "NBA" with "NBA", "cro C" with "cro C", etc.;
and in the other situation, the sensitive words in the recognized text corresponding to the voice data are replaced by special symbols, so that the hiding effect is achieved. During specific replacement, a sensitive word list can be established, then the sensitive word list is traversed to search and identify whether a sensitive word appears in the text, if so, a special symbol is used for replacement, for example, some violent-tendency words, for example, if the robbery is a sensitive word, the robbery in the text is replaced by the word.
It should be noted that, the above two text replacement situations can be performed by selecting one or two text replacement situations according to actual application needs, and the embodiment of the present invention is not limited.
As shown in fig. 2, it is a flowchart of constructing a segmentation model in the embodiment of the present invention, and includes the following steps:
And 203, performing voice recognition on each voice section to obtain a recognition text corresponding to each voice section.
And 204, marking the segmentation information of the identification texts corresponding to the voice segments, wherein the segmentation information is used for indicating whether the ending position of the identification text corresponding to the current voice segment needs to be segmented or not.
For example, if segmented, it is labeled 1, otherwise it is labeled 0. Of course, other symbols may be used, and the embodiments of the present invention are not limited.
And step 206, constructing a segmentation model by using the segmentation characteristics and the segmentation information as training data.
The segmented model can adopt a common model in pattern recognition, such as a Bayesian model, a support vector machine model and the like. During specific training, the segmentation features of the recognized text are used as the input of the model, the labeled segmentation information is used as the output of the model, and model training is carried out to obtain the segmentation model. The specific training process of the segmented model is the same as that of the prior art, and is not described in detail here. It should be noted that the classification model can be obtained by off-line training.
It should be noted that, when the segmentation model training is performed, the segmentation model may be trained based on the acoustic segmentation features alone, or based on the semantic segmentation features alone, or based on the acoustic segmentation features and the semantic segmentation features. Accordingly, in the step 205, when the segmentation features of the recognized text corresponding to each speech segment are extracted, only the segmentation features based on acoustics or the segmentation features based on semantics may be extracted, or the segmentation features based on acoustics and the segmentation features based on semantics may be extracted at the same time, which is not limited in the embodiment of the present invention.
In addition, it should be noted that, when the segmentation model trained based on the different types of segmentation features is used to perform segmentation detection on the text to be displayed and recognized, the corresponding type of segmentation features of the text to be displayed and recognized need to be extracted, and the extracted segmentation features are input into the segmentation model to determine the position of the text to be displayed and recognized, which needs to be segmented.
The voice recognition text segmentation method provided by the invention has the advantages that each voice segment is obtained by carrying out endpoint detection on voice data, the recognition text corresponding to each voice segment is obtained by carrying out voice recognition on each voice segment, then, the segmentation characteristics of the recognition text corresponding to each voice segment are extracted, the recognition text corresponding to the voice data is subjected to segmentation detection by utilizing the extracted segmentation characteristics and the pre-constructed segmentation model, and the recognition text is segmented according to the segmentation detection result, so that the chapter structure of the recognition text can be automatically adjusted, the chapter structure is clearer, the content of the recognition text can be rapidly understood by a user, and the reading efficiency of the user is improved.
In practical application, the segmentation features may be extracted acoustically from the voice data or semantically from the recognized text, and of course, the two segmentation features extracted based on different layers may also be combined, and the corresponding segmentation model is used to perform segmentation detection on the recognized text corresponding to the voice data, so as to determine the position to be segmented, thereby further improving the accuracy of segmentation.
Furthermore, the speech recognition text segmentation method provided by the invention can also display all the segmented recognition texts to the user, or extract the theme of each segment of recognition text, display the theme of each segment to the user, and display the paragraph contents when the user needs to check the interested paragraph, thereby being beneficial to the user to quickly find the interested contents.
Correspondingly, an embodiment of the present invention further provides a speech recognition text segmentation apparatus, as shown in fig. 3, which is a schematic structural diagram of the apparatus.
In this embodiment, the apparatus comprises:
an endpoint detection module 301, configured to perform endpoint detection on the voice data to obtain each voice segment and a start frame number and an end frame number of each voice segment;
the voice recognition module 302 is configured to perform voice recognition on each voice segment to obtain a recognition text corresponding to each voice segment;
the feature extraction module 303 is configured to extract a segmentation feature of the recognition text corresponding to each speech segment;
a segmentation detection module 304, configured to perform segmentation detection on the recognition text corresponding to the speech data by using the extracted segmentation features and a pre-constructed segmentation model, so as to determine a position to be segmented; specifically, with the voice segments as units, sequentially inputting the segmentation characteristics of the recognition texts corresponding to the voice segments into the segmentation model for segmentation detection, and determining whether the end positions of the recognition texts corresponding to the voice segments need to be segmented;
and a segmenting module 305, configured to segment the recognition text corresponding to the voice data according to a segmentation detection result.
It should be noted that, in practical applications, the feature extraction module 303 may extract the segmentation features from the acoustics of the speech data or the semantics of the recognized text, or may extract the segmentation features based on different layers by combining the two types of features. Accordingly, the feature extraction module 303 may include: a first feature extraction module and/or a second feature extraction module. Wherein:
the first feature extraction module is used for extracting the segmented features of the voice segments from the acoustics of the voice data and taking the segmented features as the first segmented features of the recognition texts corresponding to the voice segments;
and the second feature extraction module is used for extracting segmentation features from the semantics of the recognition text and taking the segmentation features as second segmentation features of the recognition text.
Wherein an embodiment of the first feature extraction module comprises: a duration calculation unit and a distance calculation unit; another embodiment of the first feature extraction module may further comprise: a speaker change point detection unit and a speaker determination unit. These units will be described separately below.
The time length calculating unit is used for calculating the difference value between the ending frame number of the current voice segment and the starting frame number of the current voice segment, and taking the difference value as the time length of the current voice segment.
The distance calculating unit is used for calculating the difference value between the starting frame number of the current voice segment and the ending frame number of the previous voice segment, and taking the difference value as the distance between the current voice segment and the previous voice segment; and/or calculating the difference value between the starting frame number of the next voice segment and the ending frame number of the current voice segment, and taking the difference value as the distance between the current voice segment and the next voice segment.
The speaker change point detection unit is used for detecting the speaker change point of the voice data by using a speaker separation technology.
The speaker determining unit is used for determining whether the speaker of the current voice section is the same as the speaker of the previous voice section according to the detection result of the speaker change point and/or determining whether the speaker of the current voice section is the same as the speaker of the next voice section according to the detection result of the speaker change point.
One embodiment of the second feature extraction module comprises:
a correction unit configured to correct a recognition text corresponding to the speech data, the correction unit including: a punctuation adding subunit, configured to add punctuation to the recognition text corresponding to the speech data, for example, add punctuation to the recognition text corresponding to the speech data based on a conditional random field model;
and the feature extraction unit is used for extracting segmentation features from the semanteme of the corrected recognition text.
The second segmentation features extracted by the feature extraction unit may include any one or more of the following: the number of forward non-segmented sentences, the number of backward non-segmented sentences, the number of sentences contained in the identification text corresponding to the current voice segment, the similarity between the identification text corresponding to the current voice segment and the identification text corresponding to the previous voice segment, and the similarity between the identification text corresponding to the current voice segment and the identification text corresponding to the next voice segment.
In practical applications, the modification unit may further include any one or more of the following sub-units:
the filtering subunit is used for filtering abnormal words of the recognition text corresponding to the voice data;
a smoothing processing subunit, configured to perform smoothing processing on the recognition text corresponding to the voice data;
the normalization subunit is used for carrying out digital normalization on the recognition text corresponding to the voice data;
a text replacement subunit, configured to perform text replacement on the recognition text corresponding to the voice data, where the text replacement includes: converting English lowercase letters in the recognized text corresponding to the voice data into uppercase letters or vice versa; and/or replacing sensitive words in the recognized text corresponding to the voice data with special symbols.
In the embodiment of the present invention, the segment model may be constructed offline by a corresponding segment model construction module, and the segment model construction module may be independent from the speech recognition text segmentation apparatus of the present invention, or may be integrated with the speech recognition text segmentation apparatus of the present invention, which is not limited to this embodiment of the present invention.
Fig. 4 is a schematic structural diagram of a segmentation model building module in the embodiment of the present invention, including:
a data collection unit 401 for collecting voice data;
an endpoint detection unit 402, configured to perform endpoint detection on the voice data collected by the data collection unit to obtain each voice segment;
a voice recognition unit 403, configured to perform voice recognition on each voice segment to obtain a recognition text corresponding to each voice segment;
a labeling unit 404, configured to label segmentation information of the identification text corresponding to each speech segment, where the segmentation information is used to indicate whether an end position of the identification text corresponding to a current speech segment needs to be segmented;
a feature extraction unit 405, configured to extract a segmentation feature of the recognition text corresponding to each speech segment;
a training unit 406, configured to use the segmentation features and the segmentation information as training data to construct a segmentation model.
It should be noted that, in the training of the segmentation model, the segmentation model may be trained based on the acoustic segmentation features (i.e., the first segmentation features mentioned above) alone, or based on the semantic segmentation features (i.e., the second segmentation features mentioned above) alone, or based on both the acoustic segmentation features and the semantic segmentation features. Accordingly, when the feature extraction unit 405 extracts the segmentation features of the recognized text corresponding to each speech segment, only the segmentation features based on acoustics or the segmentation features based on semantics may be extracted, or the segmentation features based on acoustics and the segmentation features based on semantics may be extracted at the same time, which is not limited in the embodiment of the present invention.
In addition, the output of the segmentation model may be whether the ending position of the recognized text corresponding to the current speech segment needs to be segmented, or may be the probability that the ending position of the recognized text corresponding to the current speech segment needs to be segmented. Of course, the output of different types of parameters does not affect the training process of the segmented model, and only different input and output parameters need to be set during model training.
The invention provides a speech recognition text segmenting device, which obtains each speech segment by performing endpoint detection on speech data, obtains a recognition text corresponding to each speech segment by performing speech recognition on each speech segment, extracts the segmentation characteristics of the recognition text corresponding to each speech segment, performs segmentation detection on the recognition text corresponding to the speech data by using the extracted segmentation characteristics and a pre-constructed segmentation model, and segments the recognition text according to the segmentation detection result, so that the chapter structure of the recognition text can be automatically adjusted, the chapter structure is clearer, the content of the recognition text can be rapidly understood by a user, and the reading efficiency of the user is improved.
Further, the segmentation features may be extracted acoustically from the voice data or semantically from the recognized text, and of course, the two segmentation features extracted based on different layers may be combined, and the corresponding segmentation model is used to perform segmentation detection on the recognized text corresponding to the voice data, so as to determine the position to be segmented, thereby further improving the accuracy of segmentation.
Fig. 5 is a schematic diagram of another structure of the speech recognition text segmenting device according to the embodiment of the present invention.
Unlike fig. 3, in this embodiment, the apparatus further includes:
a first display module 501, configured to display the segmented recognition text to a user.
Fig. 6 is a schematic diagram of another structure of the speech recognition text segmenting device according to the embodiment of the present invention.
Unlike fig. 3, in this embodiment, the apparatus further includes:
a topic extraction module 601, configured to extract a topic of each segmented paragraph identification text;
a second presentation module 602, configured to present each topic to a user;
the perceiving module 603 is configured to perceive a topic that is interested by a user, and when the topic that is interested by the user is perceived, trigger the second displaying module 602 to display the identification text of the paragraph corresponding to the topic to the user.
The voice recognition text segmentation device provided by the invention can display the segmented recognition text to the user in various ways, not only can display the recognition text with clear chapter structures to the user, but also can help the user to quickly find the content of interest of the user, and further improves the reading efficiency.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, it is relatively simple to describe, and reference may be made to some descriptions of the method embodiment for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
The above embodiments of the present invention have been described in detail, and the present invention has been described herein with reference to particular embodiments, but the above embodiments are merely intended to facilitate an understanding of the methods and apparatuses of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.
Claims (20)
1. A method for speech recognition text segmentation, comprising:
carrying out end point detection on the voice data to obtain each voice segment and a starting frame number and an ending frame number of each voice segment;
carrying out voice recognition on each voice section to obtain a recognition text corresponding to each voice section;
extracting the segmented characteristics of the recognition texts corresponding to the voice segments;
carrying out segmentation detection on the recognition text corresponding to the voice data by using the extracted segmentation features and a pre-constructed segmentation model so as to determine the position to be segmented, wherein the segmentation features are used for determining whether the end position of the recognition text corresponding to each voice segment is a segmentation boundary, and the segmentation refers to the division of a paragraph structure;
and segmenting the recognition text corresponding to the voice data according to the segmentation detection result.
2. The method of claim 1, further comprising constructing a segmentation model by:
collecting voice data;
carrying out end point detection on the collected voice data to obtain each voice section;
carrying out voice recognition on each voice section to obtain a recognition text corresponding to each voice section;
marking the segmentation information of the identification text corresponding to each voice segment, wherein the segmentation information is used for indicating whether the ending position of the identification text corresponding to the current voice segment needs to be segmented or not;
extracting the segmented characteristics of the recognition texts corresponding to the voice segments;
and constructing a segmentation model by using the segmentation characteristics and the segmentation information as training data.
3. The method according to claim 1, wherein the extracting the segmented features of the recognized text corresponding to each speech segment comprises:
extracting the segmentation characteristics of each voice segment from the acoustics of the voice data, and taking the segmentation characteristics as the first segmentation characteristics of the recognition text corresponding to the voice segment; and/or
And extracting a segmentation feature from the semantics of the recognition text, and using the segmentation feature as a second segmentation feature of the recognition text.
4. The method of claim 3, wherein the first segmented feature comprises: the duration of the current speech segment further comprises: the distance between the current voice section and the previous voice section and/or the distance between the current voice section and the next voice section;
the acoustically extracting the segmentation features of the speech segments from the speech data comprises:
calculating the difference value between the ending frame number of the current voice segment and the starting frame number of the current voice segment, and taking the difference value as the time length of the current voice segment;
further comprising:
calculating the difference value between the starting frame number of the current voice segment and the ending frame number of the previous voice segment, and taking the difference value as the distance between the current voice segment and the previous voice segment; and/or
And calculating the difference value between the starting frame number of the next voice segment and the ending frame number of the current voice segment, and taking the difference value as the distance between the current voice segment and the next voice segment.
5. The method of claim 4, wherein the first segmented feature further comprises: whether the speaker of the current voice section is the same as the speaker of the previous voice section and/or whether the speaker of the current voice section is the same as the speaker of the next voice section;
the acoustically extracting the segmentation features of the speech segments from the speech data further comprises:
carrying out speaker change point detection on the voice data by using a speaker separation technology;
and determining whether the speaker in the current voice section is the same as the speaker in the previous voice section according to the speaker change point detection result and/or determining whether the speaker in the current voice section is the same as the speaker in the next voice section according to the speaker change point detection result.
6. The method of claim 3, wherein the second segmented feature comprises any one or more of:
the forward non-segmented sentence number refers to the total number of sentences contained in all the recognized texts from the starting position of the recognized text corresponding to the current speech segment to the last segmented mark;
the backward non-segmented sentence number refers to the total number of sentences contained in all the recognized texts after the recognized text corresponding to the current speech segment;
the number of sentences contained in the recognition text corresponding to the current voice segment;
similarity between the recognition text corresponding to the current voice section and the recognition text corresponding to the previous voice section;
and the similarity between the recognition text corresponding to the current voice section and the recognition text corresponding to the next voice section.
7. The method of claim 3, wherein semantically extracting segmentation features from the recognized text comprises:
and correcting the recognition text corresponding to the voice data, wherein the correction comprises: adding punctuation to the recognition text corresponding to the voice data;
segmentation features are extracted from the semantics of the modified recognized text.
8. The method of claim 7, wherein the modification further comprises any one or more of:
filtering abnormal words of the recognition text corresponding to the voice data;
performing smooth processing on the recognition text corresponding to the voice data;
carrying out digital normalization on the recognition text corresponding to the voice data;
performing text replacement on the recognition text corresponding to the voice data, wherein the text replacement comprises: converting English lowercase letters in the recognized text corresponding to the voice data into uppercase letters or vice versa; and/or replacing sensitive words in the recognized text corresponding to the voice data with special symbols.
9. The method according to any one of claims 1 to 8, wherein the detecting the segmentation of the recognition text corresponding to the speech data by using the extracted segmentation features and a pre-constructed segmentation model to determine the position to be segmented comprises:
and sequentially inputting the segmentation characteristics of the recognition texts corresponding to the speech segments into the segmentation model for segmentation detection by taking the speech segments as units, and determining whether the end positions of the recognition texts corresponding to the speech segments need to be segmented or not.
10. The method according to any one of claims 1 to 8, further comprising:
displaying the segmented recognition text to a user; or
Extracting the topics of the segmented paragraph recognition texts, and displaying the topics to a user;
when a topic which is interesting to the user is sensed, the identification text of the paragraph corresponding to the topic is displayed to the user.
11. A speech recognition text segmentation apparatus, comprising:
the end point detection module is used for carrying out end point detection on the voice data to obtain each voice section and the starting frame number and the ending frame number of each voice section;
the voice recognition module is used for carrying out voice recognition on each voice section to obtain a recognition text corresponding to each voice section;
the feature extraction module is used for extracting the segmented features of the identification texts corresponding to the voice segments;
the segmentation detection module is used for carrying out segmentation detection on the identification texts corresponding to the voice data by utilizing the extracted segmentation features and a pre-constructed segmentation model so as to determine the positions to be segmented, wherein the segmentation model features are used for determining whether the ending positions of the identification texts corresponding to the voice segments need to be segmentation boundaries or not, and the segmentation refers to the division of a paragraph structure;
and the segmentation module is used for segmenting the recognition text corresponding to the voice data according to the segmentation detection result.
12. The apparatus of claim 11, further comprising a segment model building module for building a segment model; the segmentation model building module comprises:
a data collection unit for collecting voice data;
the end point detection unit is used for carrying out end point detection on the voice data collected by the data collection unit to obtain each voice section;
the voice recognition unit is used for carrying out voice recognition on each voice section to obtain a recognition text corresponding to each voice section;
the marking unit is used for marking the segmentation information of the identification text corresponding to each voice segment, and the segmentation information is used for indicating whether the end position of the identification text corresponding to the current voice segment needs to be segmented or not;
the characteristic extraction unit is used for extracting the segmented characteristics of the identification texts corresponding to the voice segments;
and the training unit is used for constructing a segmentation model by taking the segmentation characteristics and the segmentation information as training data.
13. The apparatus of claim 11, wherein the feature extraction module comprises:
the first feature extraction module is used for extracting the segmented features of the voice segments from the acoustics of the voice data and taking the segmented features as the first segmented features of the recognition texts corresponding to the voice segments; and/or
And the second feature extraction module is used for extracting segmentation features from the semantics of the recognition text and taking the segmentation features as second segmentation features of the recognition text.
14. The apparatus of claim 13, wherein the first feature extraction module comprises:
the time length calculating unit is used for calculating the difference value between the ending frame number of the current voice segment and the starting frame number of the current voice segment, and taking the difference value as the time length of the current voice segment;
the distance calculation unit is used for calculating the difference value between the starting frame number of the current voice segment and the ending frame number of the previous voice segment, and taking the difference value as the distance between the current voice segment and the previous voice segment; and/or calculating the difference value between the starting frame number of the next voice segment and the ending frame number of the current voice segment, and taking the difference value as the distance between the current voice segment and the next voice segment.
15. The apparatus of claim 14, wherein the first feature extraction module further comprises:
the speaker change point detection unit is used for detecting the speaker change points of the voice data by using a speaker separation technology;
and the speaker determining unit is used for determining whether the speaker of the current voice section is the same as the speaker of the previous voice section according to the speaker change point detection result and/or determining whether the speaker of the current voice section is the same as the speaker of the next voice section according to the speaker change point detection result.
16. The apparatus of claim 13, wherein the second segmented feature comprises any one or more of:
the forward non-segmented sentence number refers to the total number of sentences contained in all the recognized texts from the starting position of the recognized text corresponding to the current speech segment to the last segmented mark;
the backward non-segmented sentence number refers to the total number of sentences contained in all the recognized texts after the recognized text corresponding to the current speech segment;
the number of sentences contained in the recognition text corresponding to the current voice segment;
similarity between the recognition text corresponding to the current voice section and the recognition text corresponding to the previous voice section;
and the similarity between the recognition text corresponding to the current voice section and the recognition text corresponding to the next voice section.
17. The apparatus of claim 13, wherein the second feature extraction module comprises:
a correction unit configured to correct a recognition text corresponding to the speech data, the correction unit including: a punctuation adding subunit, configured to add punctuation to the recognition text corresponding to the voice data;
and the feature extraction unit is used for extracting segmentation features from the semanteme of the corrected recognition text.
18. The apparatus according to claim 17, wherein the modification unit further comprises any one or more of the following sub-units:
the filtering subunit is used for filtering abnormal words of the recognition text corresponding to the voice data;
a smoothing processing subunit, configured to perform smoothing processing on the recognition text corresponding to the voice data;
the normalization subunit is used for carrying out digital normalization on the recognition text corresponding to the voice data;
a text replacement subunit, configured to perform text replacement on the recognition text corresponding to the voice data, where the text replacement includes: converting English lowercase letters in the recognized text corresponding to the voice data into uppercase letters or vice versa; and/or replacing sensitive words in the recognized text corresponding to the voice data with special symbols.
19. The apparatus of any one of claims 11 to 18,
the segmentation detection module is specifically configured to, with the voice segments as units, sequentially input the segmentation characteristics of the recognized text corresponding to each voice segment into the segmentation model for segmentation detection, and determine whether the end position of the recognized text corresponding to each voice segment needs to be segmented.
20. The apparatus of any one of claims 11 to 18, further comprising:
the first display module is used for displaying the segmented identification texts to a user; or
The topic extraction module is used for extracting the topic of each paragraph identification text after segmentation;
the second display module is used for displaying all the themes to the user;
and the perception module is used for perceiving the topic which is interested by the user and triggering the second display module to display the identification text of the paragraph corresponding to the topic to the user when perceiving the topic which is interested by the user.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610256898.8A CN107305541B (en) | 2016-04-20 | 2016-04-20 | Method and device for segmenting speech recognition text |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610256898.8A CN107305541B (en) | 2016-04-20 | 2016-04-20 | Method and device for segmenting speech recognition text |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107305541A CN107305541A (en) | 2017-10-31 |
CN107305541B true CN107305541B (en) | 2021-05-04 |
Family
ID=60150228
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610256898.8A Active CN107305541B (en) | 2016-04-20 | 2016-04-20 | Method and device for segmenting speech recognition text |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107305541B (en) |
Families Citing this family (30)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108090051A (en) * | 2017-12-20 | 2018-05-29 | 深圳市沃特沃德股份有限公司 | The interpretation method and translator of continuous long voice document |
CN108363765B (en) * | 2018-02-06 | 2020-12-08 | 深圳市鹰硕技术有限公司 | Audio paragraph identification method and device |
CN108446389B (en) * | 2018-03-22 | 2021-12-24 | 平安科技(深圳)有限公司 | Voice message search display method and device, computer equipment and storage medium |
CN108364650B (en) * | 2018-04-18 | 2024-01-19 | 北京声智科技有限公司 | Device and method for adjusting voice recognition result |
CN108830639B (en) * | 2018-05-17 | 2022-04-26 | 科大讯飞股份有限公司 | Content data processing method and device, and computer readable storage medium |
CN110503943B (en) * | 2018-05-17 | 2023-09-19 | 蔚来(安徽)控股有限公司 | Voice interaction method and voice interaction system |
CN109344411A (en) * | 2018-09-19 | 2019-02-15 | 深圳市合言信息科技有限公司 | A kind of interpretation method for listening to formula simultaneous interpretation automatically |
CN109361823A (en) * | 2018-11-01 | 2019-02-19 | 深圳市号互联科技有限公司 | A kind of intelligent interaction mode that voice is mutually converted with text |
CN109743589B (en) * | 2018-12-26 | 2021-12-14 | 百度在线网络技术(北京)有限公司 | Article generation method and device |
CN110083645A (en) | 2019-05-06 | 2019-08-02 | 浙江核新同花顺网络信息股份有限公司 | A kind of system and method for report generation |
CN110264997A (en) * | 2019-05-30 | 2019-09-20 | 北京百度网讯科技有限公司 | The method, apparatus and storage medium of voice punctuate |
CN110379413B (en) * | 2019-06-28 | 2022-04-19 | 联想(北京)有限公司 | Voice processing method, device, equipment and storage medium |
CN110399489B (en) * | 2019-07-08 | 2022-06-17 | 厦门市美亚柏科信息股份有限公司 | Chat data segmentation method, device and storage medium |
CN110502631B (en) * | 2019-07-17 | 2022-11-04 | 招联消费金融有限公司 | Input information response method and device, computer equipment and storage medium |
CN110619897A (en) * | 2019-08-02 | 2019-12-27 | 精电有限公司 | Conference summary generation method and vehicle-mounted recording system |
CN110588524B (en) * | 2019-08-02 | 2021-01-01 | 精电有限公司 | Information display method and vehicle-mounted auxiliary display system |
CN110827825A (en) * | 2019-11-11 | 2020-02-21 | 广州国音智能科技有限公司 | Punctuation prediction method, system, terminal and storage medium for speech recognition text |
CN111079384B (en) * | 2019-11-18 | 2023-05-02 | 佰聆数据股份有限公司 | Identification method and system for forbidden language of intelligent quality inspection service |
WO2021109000A1 (en) * | 2019-12-03 | 2021-06-10 | 深圳市欢太科技有限公司 | Data processing method and apparatus, electronic device, and storage medium |
CN113041623B (en) * | 2019-12-26 | 2023-04-07 | 波克科技股份有限公司 | Game parameter configuration method and device and computer readable storage medium |
CN111862980A (en) * | 2020-08-07 | 2020-10-30 | 斑马网络技术有限公司 | Incremental semantic processing method |
CN112036128A (en) * | 2020-08-21 | 2020-12-04 | 百度在线网络技术(北京)有限公司 | Text content processing method, device, equipment and storage medium |
CN111931482B (en) * | 2020-09-22 | 2021-09-24 | 思必驰科技股份有限公司 | Text segmentation method and device |
CN112712794A (en) * | 2020-12-25 | 2021-04-27 | 苏州思必驰信息科技有限公司 | Speech recognition marking training combined system and device |
CN112818077B (en) * | 2020-12-31 | 2023-05-30 | 科大讯飞股份有限公司 | Text processing method, device, equipment and storage medium |
CN112733660B (en) * | 2020-12-31 | 2022-05-27 | 蚂蚁胜信(上海)信息技术有限公司 | Method and device for splitting video strip |
CN112699687A (en) * | 2021-01-07 | 2021-04-23 | 北京声智科技有限公司 | Content cataloging method and device and electronic equipment |
CN113076720B (en) * | 2021-04-29 | 2022-01-28 | 新声科技(深圳)有限公司 | Long text segmentation method and device, storage medium and electronic device |
CN114841171B (en) * | 2022-04-29 | 2023-04-28 | 北京思源智通科技有限责任公司 | Text segmentation theme extraction method, system, readable medium and equipment |
CN117113974B (en) * | 2023-04-26 | 2024-05-24 | 荣耀终端有限公司 | Text segmentation method, device, chip, electronic equipment and medium |
Family Cites Families (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6279017B1 (en) * | 1996-08-07 | 2001-08-21 | Randall C. Walker | Method and apparatus for displaying text based upon attributes found within the text |
US20040006748A1 (en) * | 2002-07-03 | 2004-01-08 | Amit Srivastava | Systems and methods for providing online event tracking |
US8849648B1 (en) * | 2002-12-24 | 2014-09-30 | At&T Intellectual Property Ii, L.P. | System and method of extracting clauses for spoken language understanding |
ATE518193T1 (en) * | 2003-05-28 | 2011-08-15 | Loquendo Spa | AUTOMATIC SEGMENTATION OF TEXT WITH UNITS WITHOUT SEPARATORS |
JP2007512609A (en) * | 2003-11-21 | 2007-05-17 | コーニンクレッカ フィリップス エレクトロニクス エヌ ヴィ | Text segmentation and topic annotation for document structuring |
US8577684B2 (en) * | 2005-07-13 | 2013-11-05 | Intellisist, Inc. | Selective security masking within recorded speech utilizing speech recognition techniques |
US20100169318A1 (en) * | 2008-12-30 | 2010-07-01 | Microsoft Corporation | Contextual representations from data streams |
CN103150294A (en) * | 2011-12-06 | 2013-06-12 | 盛乐信息技术(上海)有限公司 | Method and system for correcting based on voice identification results |
CN103164399A (en) * | 2013-02-26 | 2013-06-19 | 北京捷通华声语音技术有限公司 | Punctuation addition method and device in speech recognition |
CN103345922B (en) * | 2013-07-05 | 2016-07-06 | 张巍 | A kind of large-length voice full-automatic segmentation method |
CN103488723B (en) * | 2013-09-13 | 2016-11-09 | 复旦大学 | A kind of method and system of electronic reading semantic coverage interested self-navigation |
CN105244029B (en) * | 2015-08-28 | 2019-02-26 | 安徽科大讯飞医疗信息技术有限公司 | Voice recognition post-processing method and system |
CN105427858B (en) * | 2015-11-06 | 2019-09-03 | 科大讯飞股份有限公司 | Realize the method and system that voice is classified automatically |
-
2016
- 2016-04-20 CN CN201610256898.8A patent/CN107305541B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN107305541A (en) | 2017-10-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107305541B (en) | Method and device for segmenting speech recognition text | |
CN110717031B (en) | Intelligent conference summary generation method and system | |
CN106878632B (en) | Video data processing method and device | |
US9230547B2 (en) | Metadata extraction of non-transcribed video and audio streams | |
US10114809B2 (en) | Method and apparatus for phonetically annotating text | |
JP5343861B2 (en) | Text segmentation apparatus, text segmentation method and program | |
CN107562760B (en) | Voice data processing method and device | |
CN104598644B (en) | Favorite label mining method and device | |
CN111723791A (en) | Character error correction method, device, equipment and storage medium | |
CN112784696B (en) | Lip language identification method, device, equipment and storage medium based on image identification | |
CN109801628B (en) | Corpus collection method, apparatus and system | |
JP2006190006A5 (en) | ||
CN111128223A (en) | Text information-based auxiliary speaker separation method and related device | |
US20150019206A1 (en) | Metadata extraction of non-transcribed video and audio streams | |
CN108305618B (en) | Voice acquisition and search method, intelligent pen, search terminal and storage medium | |
CN111341305A (en) | Audio data labeling method, device and system | |
CN109033060B (en) | Information alignment method, device, equipment and readable storage medium | |
CN112818680B (en) | Corpus processing method and device, electronic equipment and computer readable storage medium | |
CN111951825A (en) | Pronunciation evaluation method, medium, device and computing equipment | |
CN110413998B (en) | Self-adaptive Chinese word segmentation method oriented to power industry, system and medium thereof | |
US20240064383A1 (en) | Method and Apparatus for Generating Video Corpus, and Related Device | |
JP2012194245A (en) | Speech recognition device, speech recognition method and speech recognition program | |
CN113283327A (en) | Video text generation method, device, equipment and storage medium | |
CN111881297A (en) | Method and device for correcting voice recognition text | |
CN113838460A (en) | Video voice recognition method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |