CN111091834B - Text and audio alignment method and related product - Google Patents

Text and audio alignment method and related product Download PDF

Info

Publication number
CN111091834B
CN111091834B CN201911342808.7A CN201911342808A CN111091834B CN 111091834 B CN111091834 B CN 111091834B CN 201911342808 A CN201911342808 A CN 201911342808A CN 111091834 B CN111091834 B CN 111091834B
Authority
CN
China
Prior art keywords
text
corpus
segment
matching
identification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911342808.7A
Other languages
Chinese (zh)
Other versions
CN111091834A (en
Inventor
王庆然
高建清
万根顺
黄佑银
崔芳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN201911342808.7A priority Critical patent/CN111091834B/en
Publication of CN111091834A publication Critical patent/CN111091834A/en
Application granted granted Critical
Publication of CN111091834B publication Critical patent/CN111091834B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)
  • Character Input (AREA)

Abstract

The embodiment of the application discloses a text and audio alignment method and a related product, wherein the method comprises the following steps: carrying out voice recognition on the collected audio data to obtain a recognition text, and acquiring a corpus text of the audio data; matching and segmenting the corpus text and the identification text to obtain a small-segment corpus text and a small-segment identification text; repairing the small-segment corpus text according to the small-segment identification text to obtain a repaired small-segment text; and acquiring the time boundary of the small section of the identification text, and extracting an audio segment corresponding to the time boundary from the audio data as the matching audio of the repaired small section of the text. The technical scheme has the advantage of low cost.

Description

Text and audio alignment method and related product
Technical Field
The application relates to the technical field, in particular to a text and audio alignment method and a related product.
Background
The method for acquiring the voice training data is different from the method for acquiring the training data such as the text pictures and the like, and has the characteristics of high acquisition difficulty, high labeling cost and the like, so that the voice training data is more difficult to acquire. The voice resources contain more private information, such as personal privacy and business privacy, which increasingly makes the acquisition of voice data difficult. Voice resources obtained from public ways such as the internet may face problems such as poor sound quality or undesirable audio scenes. In order to achieve better effect, each research institution or company can only record audio meeting the requirements of the specification and the scene manually, and the audio annotation is carried out manually at great cost. The available training audio data produced by this method is expensive. The cost of the existing speech training data is high.
Disclosure of Invention
The embodiment of the application provides a text and audio alignment method and a related product, so that voice training data can be acquired at low cost, and the method has the advantage of reducing the voice training cost.
In a first aspect, a method for aligning text and audio is provided, the method comprising the steps of:
carrying out voice recognition on the collected audio data to obtain a recognition text, and acquiring a corpus text of the audio data;
matching and segmenting the corpus text and the recognition text to obtain a small-segment corpus text and a small-segment recognition text; repairing the small-segment corpus text according to the small-segment identification text to obtain a repaired small-segment text;
and acquiring the time boundary of the small section of the identification text, and extracting an audio segment corresponding to the time boundary from the audio data as the matching audio of the repaired small section of the text.
In a second aspect, there is provided a text-to-audio alignment apparatus, the apparatus comprising:
the voice recognition unit is used for carrying out voice recognition on the collected audio data to obtain a recognition text;
the matching and segmenting unit is used for matching and segmenting the corpus text and the identification text to obtain a small-segment corpus text and a small-segment identification text;
the repairing unit is used for repairing the small-segment corpus text according to the small-segment identification text to obtain a repaired small-segment text;
and the processing unit is used for acquiring the time boundary of the small section of the identification text and extracting an audio segment corresponding to the time boundary from the audio data as the matching audio of the repaired small section of text.
In a third aspect, an embodiment of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program for electronic data exchange, where the computer program causes a computer to perform some or all of the steps described in the first aspect of the embodiment of the present application.
In a fourth aspect, embodiments of the present application provide a computer program product, where the computer program product includes a non-transitory computer-readable storage medium storing a computer program, where the computer program is operable to cause a computer to perform some or all of the steps as described in the first aspect of embodiments of the present application. The computer program product may be a software installation package.
It can be seen that the technical scheme provided by the application processes the collected audio data and the corpus text, compares the corpus text with the identification text to correct the corpus text, and segments the audio data, so that the segmented audio can correspond to the small segment text, and the confidence of the small segment text and the segmented audio can be improved through the comparison processing of the corpus text and the identification text, thereby improving the accuracy of the small segment text.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a flowchart illustrating a text-to-audio alignment method according to an embodiment of the present disclosure.
Fig. 2 is a schematic flowchart of a method for acquiring a short text according to the second embodiment of the present application.
Fig. 2-1 is a schematic diagram between an anchor point and a character string provided in the second embodiment of the present application.
Fig. 2-2 is a schematic comparison diagram of two strings to be matched, a1 and B1, provided in the second embodiment of the present application.
Fig. 3 is a schematic flowchart of a method for repairing a text according to a third embodiment of the present application.
Fig. 3-1 is a schematic diagram of repairing a corpus text and a recognition text provided in the third embodiment of the present application.
Fig. 4 is a schematic structural diagram of a text-to-audio alignment apparatus according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Training audio data acquisition is costly, but the acquisition of such data is relatively easy and inexpensive, with a large amount of audio and corresponding corpus text on the network between available and unavailable for direct use. Such as program audio (original audio) and corresponding lead text (corpus text) of a television station, and dubbing audio of a movie and corresponding station word text, etc. However, such long audio text cannot be directly used as training audio data, because the long audio cannot be directly trained, it is difficult to find a better way to automatically cut an audio file, and find a text label corresponding to each speech segment from a long original corpus text. Meanwhile, the original corpus texts obtained may have the problems of deficiency and irregularity, so that the texts need to be labeled and aligned manually and proper text restoration is performed.
Example one
Referring to fig. 1, fig. 1 provides a text and audio alignment method, where the method is executed on an electronic device, and the electronic device may be a general computer, a server, or other devices, and certainly in practical applications, the electronic device may also be a data processing center, a cloud platform, or other devices, and the present application does not limit a specific implementation manner of the electronic device. In addition, the text illustrated in this embodiment is only illustrated by using a short piece of text information, and in practical application, the method provided in this embodiment may be applied to a long text, and certainly may also be applied to a short text. As shown in fig. 1, the method comprises the steps of:
step S101, obtaining audio data and a corpus text corresponding to the audio data.
The original corpus text substantially corresponds to the original audio, such as a novel text and a corresponding recording on a network, a program audio and a corresponding introduction text of a television station, or a speech text corresponding to a movie.
And S102, carrying out voice recognition on the audio data to obtain a recognition text.
Decoding the audio by using a speech recognition technology (such as a semantic recognition model) to obtain a corresponding recognition text, and meanwhile, obtaining the recognition confidence coefficient of each word in the recognition text; and then, after the recognition text and the corresponding audio are aligned forcibly by using the acoustic model again, the time boundary of each word in the recognition text can be obtained.
And step S103, matching and segmenting the corpus text and the identification text to obtain a small-segment corpus text and a small-segment identification text.
For a specific implementation of the step S103, reference may be made to the description of the second embodiment, and details are not described here.
And S104, repairing the small segment corpus text according to the small segment identification text to obtain a repaired small segment text.
For a specific implementation of the step S104, reference may be made to the description of the third embodiment, which is not described herein again.
Step S105, acquiring the time boundary of the short segment of the identification text, and extracting an audio segment corresponding to the time boundary from the audio data as the matching audio of the repaired short segment of the text.
The implementation method of step S105 may specifically include:
and marking time end points (namely marking the head and tail time end points) for the repaired short section text according to the head and tail time end points of the recognized text segment corresponding to each section of the repaired short section text, and cutting the original audio file according to the time end points so as to obtain the audio segment corresponding to the repaired short section text.
For example, the beginning and end time endpoints of the corresponding recognized text segment for repairing the short segment of text are: 0.05-0.10, namely the head time end point is 5 seconds, and the tail time end point is 10 seconds, then time end points [ 0.05,0.10 ] can be marked on the repaired short section text, audio segments [ 0.05,0.10 ] are extracted from the original audio file, and the repaired short section text and the audio segments [ 0.05,0.10 ] are used as training audio.
The technical scheme provided by the application processes the collected audio data and the corpus text, compares the corpus text with the identification text to correct the corpus text, and performs segmentation processing on the audio data, so that the segmented audio can correspond to the small segment text, and can improve the confidence coefficient of the small segment text and the segmented audio through the comparison processing of the corpus text and the identification text, thereby improving the accuracy of the small segment text, and therefore the technical scheme provided by the application can automatically acquire the training audio through software, and the cost is reduced.
Example two
A technical solution provided in embodiment two of the present application is a refinement solution of step S103 in embodiment one, and the solution of this embodiment may be executed by an electronic device, and the expression form of the electronic device may refer to the description of embodiment one, and an implementation scenario of this embodiment may also be the same as that of embodiment one, and is not described here again. Referring to fig. 2, fig. 2 provides a method for acquiring a short text, as shown in fig. 2, the method includes the following steps:
step S201, marking a plurality of anchor points for the corpus text and the recognition text to obtain a marked corpus text and a marked recognition text.
The implementation method of the step S201 may include:
and replacing punctuation marks of the corpus text and the recognition text with semantic marks to obtain a rough matching result of the corpus text and a rough matching result of the recognition text.
Specifically, segmentation may be performed according to punctuation, and the segmentable punctuation such as comma, period, semicolon, question mark and exclamation mark may be replaced by special characters, such as "@", although in practical applications, other symbols may be used, such as "@", so that during matching process, @ and @ can be naturally matched, and @ can become semantic mark.
For example, the corpus text "this is a plant that is importable, but it is a plant on big grasslands where other animals may be lazy to quench thirst. ", the punctuation mark is changed into the semantic mark to become a coarse matching result; "this is a plant that is imponderable @ but is a plant that other animals on the grassland may lazy to quench their thirst @".
Step S202, performing rough matching and fine matching on the marked corpus text and the marked recognition text, adjusting the positions of the anchor points, and cutting the marked corpus text and the marked recognition text by taking the adjusted positions of the anchor points as boundaries to obtain a small-segment corpus text and a small-segment recognition text.
The implementation method of step S202 may specifically include:
and performing coarse matching on the n corpus characters in the anchor point peripheral setting area of the marked corpus text and the n identification characters in the anchor point peripheral setting area of the marked identification text to coarsely adjust the positions of the anchor points, and performing fine matching on the w corpus characters and the p corpus characters between the two anchor points in the coarsely adjusted positions to finely adjust the positions of the anchor points to obtain the adjusted positions of the anchor points.
The coarsely matching the n corpus character strings in the anchor point peripheral setting area of the tagged corpus text with the n identification character strings in the anchor point peripheral setting area of the tagged identification text to coarsely adjust the positions of the anchor points may specifically include:
and modifying the n corpus characters according to the n recognition characters to enable the two characters to be the same, acquiring the number x of the modified characters, if x is smaller than or equal to the matching threshold, not adjusting the anchor point position, and if x is larger than the matching threshold, moving the anchor point position until x is smaller than or equal to the matching threshold.
For better illustration, the following describes a practical example of the implementation method of coarse matching.
Referring to fig. 2-1, fig. 2-1 is a schematic diagram between an anchor point and n characters, and for convenience of description, a corpus text is used as a character string a, and an identification text is used as a character string B.
As shown in fig. 2-1, two horizontal columns and two vertical columns in the figure respectively represent a character string a and a character string B to be matched, and the lengths are la and lb respectively. The number of the "anchor marks" to be set may be freely determined, for example, three "anchor marks" may be set in each of the two long strings, and the length of the interval may also be freely set, and here, for convenience, the "anchor marks" are set at the same interval. Taking 0.25 × lb, 0.5 × lb and 0.75 × lb of the character string B as anchor marks, respectively taking 0.25 × la, 0.5 × la and 0.75 × la of the character string a as anchor marks, performing character matching, finding a semantic mark @closestto the corresponding position point of the shorter character string B when the character string B is specifically matched (the semantic mark is found to obtain a matched sub-string, then comparing the matched sub-string to determine an editing distance, if the semantic mark is not found, comparing the whole shorter character string B with the character string a to improve comparison difficulty and calculation amount), and forming n characters of the character string B, if n is 5, obtaining 5 characters "a B @ c d" (abcd is a representative symbol, wherein ab respectively represents the first two characters closest to the @ symbol, cd respectively represents the last two characters closest to the @ symbol), and editing the corresponding position of the character string a according to the minimum distance of editing algorithm, and finding n characters with the editing distance smaller than a set threshold value to serve as anchor point marks. For example, at 0.5 × lb in the string B, there is a segment of content "· living in such a world @ me very happy. ·", n characters where the latest @ symbol is located, such as 5 characters "world @ me very" (first sub-string), and when there is a sentence "· · true very happy. · in such a world" i "corresponding to about 0.5 × la in the string a, and it can match with" world @ me "(second sub-string), the transformation between the two only needs to insert (this modified character operation) a word" li "and delete (another modified character operation) a word" heart ", so the modified character number x is 2 (where the modified character number x is specifically, two identical numbers of times that need to insert, delete or replace, for example, the above-mentioned number needs to be inserted once and deleted once, so x is 2), if the matching threshold is 3, the matching is determined to be successful (if x is larger than the matching threshold, the positions of the anchor marks need to be changed until x is smaller than or equal to the matching threshold), and the two substrings are used as the anchor marks. By this method, after the third "anchor mark" is determined, the matching regions of the character string a and the character string B are divided into four segments, and the following matching process can match in the four regions respectively, and the matching in the four regions is a fine matching process, and for convenience of description, the four regions are respectively denoted as a1, a2, A3, a4, B1, B2, B3, and B4.
Performing fine matching on w corpus characters and p corpus characters between two anchor points of the coarse adjustment position to adjust the positions of the anchor points again specifically may include:
and performing fine matching on each character of the w corpus characters and the P corpus characters to obtain an editing distance of each character, obtaining the maximum editing distance y in the w corpus characters, if y is smaller than or equal to a matching threshold, not adjusting the anchor point position, and if y is larger than the matching threshold, moving the anchor point position until y is smaller than or equal to the matching threshold.
Also taking the above example as an example, referring to fig. 2-1, the matching is divided into four regions for analysis, and after the "anchor mark" is set, the matching process is divided into four blocks, and then the "matrix matching" algorithm designed by the present invention may be adopted. The main idea of the "matrix matching" algorithm is to establish a full connection matrix of sub-segments to be matched, as shown in fig. 2-2, two segments of character strings a1 (containing w corpus characters) and character strings B1 (containing p corpus characters) to be matched, and the lengths of the two segments of character strings are 8 and 6 (i.e., w is 8, and p is 6), respectively. Wherein each word in string B1 corresponds to each word in the a1 string, which corresponds to maintaining a matrix size of 6 x 8; the algorithm process is as follows:
a: finding a substring B1 with a smaller length, traversing each word of B1, firstly starting matching according to an editing distance of 0, and starting matching from the first character;
b: traversing each character in B1, if the edit distance L between the character and a character in A1 string is less than a threshold value 0, matching is successful, a mark ed for successful matching is marked at the corresponding position in A1 string, and the calculation of edit distance means that the two characters have the same distance of 0 and different distances of 1.
In FIG. 2-2, for example, the first character "@" in B1 and the first character "@" in A1 are taken to match, the edit distance is 0, less than the threshold 0, and the match is made, when the match flag is marked on the first character "@" in the A1 string;
c: the next matching character starts matching from after the matching marker until the end of segment B1. And if all matching is completed, the matching is exited, otherwise, sub-segments which are not successfully matched in the B1 and corresponding sub-segments in the A1 are respectively matched, and the global editing distance threshold is added by 1, so that the probability of successful matching of other sub-segments is increased. For example, in fig. 2-2, the sub-segments are "no" in B1 and "true very" in the a string, respectively, when only one character is not successfully matched;
d: and C, repeating the step C until all characters are matched successfully. In fig. 2-2, the "no" and "really very" cannot be matched successfully until the edit distance threshold is set to 3, at which time all characters are matched, so that the maximum edit distance y is 3.
The technical scheme provided by the application supports the implementation of the method of the first embodiment, so that the method has the advantage of saving cost.
EXAMPLE III
An embodiment of the present application provides a refinement scheme of step S104 in the first embodiment, and specifically provides a method for repairing a text, an application scenario of the present embodiment is the same as that of the first embodiment and the second embodiment, and a technical scheme of the present embodiment may also be executed by an electronic device, where the method includes, as shown in fig. 3, the following steps:
step S301, a short section text W1 (one of the short section corpus texts) of the corpus text and a short section text W2 (one of the short section identification texts) of the identification text are obtained.
The above W1 and the W2 are matching short text pieces.
Step S302, the short section text W1 and the short section text W2 are aligned to obtain an aligned corpus text and an aligned recognition text.
And S303, repairing corresponding characters of the small segment of text W1 according to the confidence coefficient of each character in the aligned recognition text to obtain a repaired small segment of text.
The implementation method of step S303 may specifically include:
if the confidence coefficient is larger than a confidence threshold value, determining the character string of the repaired short section of text as the character of the aligned recognition text;
and if the confidence coefficient is smaller than a first threshold value, determining that the characters of the repaired short text segment are corresponding characters of the aligned corpus text.
Referring to fig. 3-1, the W1 may be a corpus text B1, the W2 may be an identification text a1, and the alignment operation is performed as shown in fig. 3-1.
Referring to FIG. 3-1, since the first character "@" of the A1 and B1 texts is in a matching state, the first character "@" is directly used as the first character of the repaired short segment text C1. When a replacement error occurs when the second character of C1 is constructed, the "i" in the recognized text a1 is different from the "you" in the corpus text, the confidence of "i" in the query recognized text a1 is 99.3%, if the threshold is set to 98%, the confidence of the current word is found to exceed the threshold, that is, the engine decoding result is determined to be credible, the character of the believed recognized text is selected, and the second character of C1 is set as "i";
when the third character of C1 is found to be unmatched, and the recognized text A1 is different from the recognized text B1 in the form of 'very' word, two pieces of unmatched text are aligned according to the word segmentation information, the word 'very' word in A1 corresponds to the word 'not' word in B1, and a replacement error occurs. Corresponding the adverb 'true' in the A1 to the empty character in the B1, when a deletion error occurs, processing the deletion error firstly, checking the confidence coefficient of the adverb 'true' to be 87%, setting the threshold of the confidence coefficient of the deletion error to be 95%, determining that the adverb 'true' is not credible, adding the empty character in the C1, and setting the confidence coefficient of the 'very' in the replacement error to be 99% which is higher than the threshold 98%, determining that the 'very' is more credible, and setting the corresponding position of the C1 to be 'very';
when the fourth character of C1 is constructed, the characters are found to be matched and written directly;
when the sixth character of C1 is constructed, an insertion error occurs, since the insertion error belongs to the corpus text, there is no confidence information and no time boundary information, while the sensitivity of the speech recognition module is turned high. The insertion error tone word "bar" appearing in corpus text B1 is discarded;
when the last character of C1 is constructed, if @ all match, the repairing process of C1 is completed to obtain C1.
The third embodiment of the application can realize the repair of the small section of text, and improves the matching degree of the text and the audio file, so that the third embodiment of the application has the advantages of improving the accuracy of training audio and further improving the training accuracy.
Example four
In accordance with a fourth embodiment of the present application, there is provided an apparatus, with reference to fig. 4, where fig. 4 provides a text-to-audio alignment apparatus, the apparatus comprising:
a speech recognition unit 401, configured to perform speech recognition on the collected audio data to obtain a recognition text;
a matching and segmenting unit 402, configured to perform matching and segmenting on the corpus text and the identification text to obtain a small segment corpus text and a small segment identification text;
for a specific implementation of the matching and splitting unit 402, reference may be made to the description of the second embodiment, which is not described herein again,
a repairing unit 403, configured to repair the short segment corpus text according to the short segment identification text to obtain a repaired short segment text;
for a specific implementation manner of the repair unit 403, reference may be made to the description of the third embodiment, and details are not described here again.
And the processing unit 404 is configured to obtain a time boundary of the short segment identification text, and extract an audio segment corresponding to the time boundary from the audio data as matching audio of the repaired short segment text.
The technical scheme provided by the application processes the collected audio data and the corpus text, compares the corpus text with the identification text to correct the corpus text, and performs segmentation processing on the audio data, so that the segmented audio can correspond to the small segment text, and can improve the confidence coefficient of the small segment text and the segmented audio through the comparison processing of the corpus text and the identification text, thereby improving the accuracy of the small segment text, and therefore the technical scheme provided by the application can automatically acquire the training audio through software, and the cost is reduced.
In an alternative, the matching slicing unit 402 may include: the device comprises a marking module, a matching module and a cutting module;
the marking module is used for marking a plurality of anchor points on the corpus text and the identification text to obtain a marked corpus text and a marked identification text;
the matching module is used for performing rough matching and fine matching on the marked corpus text and the marked recognition text and adjusting the positions of the anchor points;
and the cutting module is used for cutting the marked corpus text and the marked recognition text by taking the adjusted positions of the anchor points as boundaries to obtain a small-segment corpus text and a small-segment recognition text.
Optionally, the marking module is specifically configured to replace punctuation marks of the corpus text and the identification text with semantic marks to obtain the marked corpus text and the marked identification text.
Optionally, the matching module is specifically configured to perform coarse matching on n corpus characters in the anchor point peripheral setting area of the tagged corpus text and n recognition characters in the anchor point peripheral setting area of the tagged recognition text to coarsely adjust the positions of the anchor points, and then perform fine matching on w corpus characters between two anchor points in the coarse-adjusted positions and p corpus characters to finely adjust the positions of the anchor points, so as to obtain adjusted positions of the anchor points.
Optionally, the matching module may further include: a coarse matching submodule and a fine matching submodule;
and the rough matching sub-module is used for modifying the n corpus characters according to the n recognition characters to enable the two characters to be the same, acquiring the number x of the modified characters, if the x is smaller than or equal to the matching threshold, not adjusting the anchor point position, and if the x is larger than the matching threshold, moving the anchor point position until the x is smaller than or equal to the matching threshold.
And the fine matching sub-module is used for performing fine matching on each character of the w linguistic characters and the P linguistic characters to obtain an editing distance of each character, obtaining the maximum editing distance y in the w linguistic characters, if y is smaller than or equal to a matching threshold, not adjusting the anchor point position, and if y is larger than the matching threshold, moving the anchor point position until y is smaller than or equal to the matching threshold.
In an alternative, the repair unit may include: an alignment module and a repair module;
the alignment module is used for performing alignment operation on the small-segment corpus text and the small-segment identification text to obtain an aligned corpus text and an aligned identification text;
and the repairing module is used for repairing the corresponding character of the aligned corpus text according to the confidence coefficient of each character in the aligned recognition text to obtain a small repaired segment of text.
Optionally, the repairing module is specifically configured to determine that the character string of the repaired short segment of text is a character of the aligned recognized text if the confidence is greater than a confidence threshold; and if the confidence coefficient is smaller than a first threshold value, determining that the characters of the repaired short section of text are corresponding characters of the aligned corpus text.
In the embodiment of the present application, the electronic device may be divided into the functional units according to the method example, for example, each functional unit may be divided corresponding to each function, or two or more functions may be integrated into one processing unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit. It should be noted that the division of the unit in the embodiment of the present application is schematic, and is only a logic function division, and there may be another division manner in actual implementation.
Embodiments of the present application also provide a computer storage medium, where the computer storage medium stores a computer program for electronic data exchange, the computer program enabling a computer to execute part or all of the steps of any one of the methods described in the above method embodiments, and the computer includes an electronic device.
Embodiments of the present application also provide a computer program product comprising a non-transitory computer readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps of any of the methods as described in the above method embodiments. The computer program product may be a software installation package, the computer comprising an electronic device.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to the related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the above-described division of the units is only one type of division of logical functions, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed coupling or direct coupling or communication connection between each other may be through some interfaces, indirect coupling or communication connection between devices or units, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit.
The integrated unit may be stored in a computer readable memory if it is implemented in the form of a software functional unit and sold or used as a separate product. Based on such understanding, the technical solutions of the present application, which are essential or part of the technical solutions contributing to the prior art, or all or part of the technical solutions, may be embodied in the form of a software product, which is stored in a memory and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the above methods of the embodiments of the present application. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, which may include: flash Memory disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.
The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, the specific implementation manner and the application scope may be changed, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (8)

1. A method for aligning text and audio, the method comprising the steps of:
carrying out voice recognition on the collected audio data to obtain a recognition text, and acquiring a corpus text of the audio data;
matching and segmenting the corpus text and the recognition text to obtain a small-segment corpus text and a small-segment recognition text; repairing the small-segment corpus text according to the small-segment identification text to obtain a repaired small-segment text;
acquiring a time boundary of the small segment of the identification text, and extracting an audio segment corresponding to the time boundary from the audio data as a matching audio of the repaired small segment of the text; the obtaining of the corpus short section text and the identification of the short section text by matching and segmenting the corpus text and the identification text specifically comprises:
marking a plurality of anchor points on the corpus text and the identification text to obtain a marked corpus text and a marked identification text;
performing rough matching and fine matching on the marked corpus text and the marked recognition text, adjusting the positions of the anchor points, and cutting the marked corpus text and the marked recognition text by taking the adjusted positions of the anchor points as boundaries to obtain a small-segment corpus text and a small-segment recognition text;
the performing rough matching and fine matching on the labeled corpus text and the labeled recognition text, and adjusting the positions of the anchor points specifically includes:
and performing coarse matching on the n corpus characters in the anchor point peripheral setting area of the marked corpus text and the n identification characters in the anchor point peripheral setting area of the marked identification text to coarsely adjust the positions of the anchor points, and performing fine matching on the w corpus characters and the p corpus characters between the two anchor points in the coarsely adjusted positions to finely adjust the positions of the anchor points to obtain the adjusted positions of the anchor points.
2. The method according to claim 1, wherein the labeling the corpus text and the recognition text with a plurality of anchors to obtain a labeled corpus text and a labeled recognition text specifically comprises:
and replacing punctuation marks of the corpus text and the identification text with semantic marks to obtain the marked corpus text and the marked identification text.
3. The method according to claim 1, wherein the performing rough matching on the n corpus character strings in the anchor-surrounding set area of the tagged corpus text and the n recognition character strings in the anchor-surrounding set area of the tagged recognition text to coarsely adjust the positions of the anchors specifically comprises:
and modifying the n corpus characters according to the n recognition characters to enable the two characters to be the same, acquiring the number x of the modified characters, if x is smaller than or equal to the matching threshold, not adjusting the anchor point position, and if x is larger than the matching threshold, moving the anchor point position until x is smaller than or equal to the matching threshold.
4. The method according to claim 1, wherein the fine matching of w linguistic characters and p linguistic characters between two anchor points in the coarse tuning position to re-tune the positions of the anchor points specifically comprises:
and performing fine matching on each character of the w corpus characters and the P corpus characters to obtain an editing distance of each character, obtaining the maximum editing distance y in the w corpus characters, if y is smaller than or equal to a matching threshold, not adjusting the anchor point position, and if y is larger than the matching threshold, moving the anchor point position until y is smaller than or equal to the matching threshold.
5. The method according to claim 1, wherein the repairing the short segment corpus text according to the short segment identification text to obtain a repaired short segment text specifically comprises:
and performing alignment operation on the small-segment corpus text and the small-segment identification text to obtain an aligned corpus text and an aligned identification text, and repairing corresponding characters of the aligned corpus text according to the confidence coefficient of each character in the aligned identification text to obtain a repaired small-segment text.
6. The method according to claim 5, wherein the obtaining a repaired short segment text by repairing the corresponding character of the aligned corpus text according to the confidence of each character in the aligned recognition text specifically comprises:
if the confidence coefficient is larger than a confidence threshold value, determining the character string of the repaired short section of text as the character of the aligned recognition text;
and if the confidence coefficient is smaller than a first threshold value, determining that the characters of the repaired short section of text are corresponding characters of the aligned corpus text.
7. A text-to-audio alignment apparatus, the apparatus comprising:
the identification unit is used for carrying out voice identification on the collected audio data to obtain an identification text and acquiring a corpus text of the audio data;
the matching and segmenting unit is used for matching and segmenting the corpus text and the identification text to obtain a small-segment corpus text and a small-segment identification text;
the repairing unit is used for repairing the small-segment corpus text according to the small-segment identification text to obtain a repaired small-segment text;
the processing unit is used for acquiring the time boundary of the small segment of the identification text and extracting an audio segment corresponding to the time boundary from the audio data as the matching audio of the repaired small segment of the text; the obtaining of the corpus short section text and the identification of the short section text by matching and segmenting the corpus text and the identification text specifically comprises:
marking a plurality of anchor points on the corpus text and the identification text to obtain a marked corpus text and a marked identification text;
performing rough matching and fine matching on the marked corpus text and the marked recognition text, adjusting the positions of the anchor points, and cutting the marked corpus text and the marked recognition text by taking the adjusted positions of the anchor points as boundaries to obtain a small-segment corpus text and a small-segment recognition text;
the performing rough matching and fine matching on the labeled corpus text and the labeled recognition text, and adjusting the positions of the anchor points specifically includes:
and performing coarse matching on the n corpus characters in the anchor point peripheral setting area of the marked corpus text and the n identification characters in the anchor point peripheral setting area of the marked identification text to coarsely adjust the positions of the anchor points, and performing fine matching on the w corpus characters and the p corpus characters between the two anchor points in the coarsely adjusted positions to finely adjust the positions of the anchor points to obtain the adjusted positions of the anchor points.
8. A computer-readable storage medium, characterized in that a computer program for electronic data exchange is stored, wherein the computer program causes a computer to perform the method according to any of the claims 1-6.
CN201911342808.7A 2019-12-23 2019-12-23 Text and audio alignment method and related product Active CN111091834B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911342808.7A CN111091834B (en) 2019-12-23 2019-12-23 Text and audio alignment method and related product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911342808.7A CN111091834B (en) 2019-12-23 2019-12-23 Text and audio alignment method and related product

Publications (2)

Publication Number Publication Date
CN111091834A CN111091834A (en) 2020-05-01
CN111091834B true CN111091834B (en) 2022-09-06

Family

ID=70395348

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911342808.7A Active CN111091834B (en) 2019-12-23 2019-12-23 Text and audio alignment method and related product

Country Status (1)

Country Link
CN (1) CN111091834B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112037769B (en) * 2020-07-28 2024-07-30 出门问问信息科技有限公司 Training data generation method and device and computer readable storage medium
CN111966839B (en) * 2020-08-17 2023-07-25 北京奇艺世纪科技有限公司 Data processing method, device, electronic equipment and computer storage medium
CN112257407B (en) * 2020-10-20 2024-05-14 网易(杭州)网络有限公司 Text alignment method and device in audio, electronic equipment and readable storage medium
CN115062599B (en) * 2022-06-02 2024-09-06 青岛科技大学 Multi-stage voice and text fault tolerance alignment method and device
CN115906781B (en) * 2022-12-15 2023-11-24 广州文石信息科技有限公司 Audio identification anchor adding method, device, equipment and readable storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105244022A (en) * 2015-09-28 2016-01-13 科大讯飞股份有限公司 Audio and video subtitle generation method and apparatus
CN109145149A (en) * 2018-08-16 2019-01-04 科大讯飞股份有限公司 A kind of information alignment schemes, device, equipment and readable storage medium storing program for executing
CN109300468A (en) * 2018-09-12 2019-02-01 科大讯飞股份有限公司 A kind of voice annotation method and device
CN109830229A (en) * 2018-12-11 2019-05-31 平安科技(深圳)有限公司 Audio corpus intelligence cleaning method, device, storage medium and computer equipment
CN110265001A (en) * 2019-05-06 2019-09-20 平安科技(深圳)有限公司 Corpus screening technique, device and computer equipment for speech recognition training
CN110263322A (en) * 2019-05-06 2019-09-20 平安科技(深圳)有限公司 Audio for speech recognition corpus screening technique, device and computer equipment

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9020817B2 (en) * 2013-01-18 2015-04-28 Ramp Holdings, Inc. Using speech to text for detecting commercials and aligning edited episodes with transcripts
US9904672B2 (en) * 2015-06-30 2018-02-27 Facebook, Inc. Machine-translation based corrections

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105244022A (en) * 2015-09-28 2016-01-13 科大讯飞股份有限公司 Audio and video subtitle generation method and apparatus
CN109145149A (en) * 2018-08-16 2019-01-04 科大讯飞股份有限公司 A kind of information alignment schemes, device, equipment and readable storage medium storing program for executing
CN109300468A (en) * 2018-09-12 2019-02-01 科大讯飞股份有限公司 A kind of voice annotation method and device
CN109830229A (en) * 2018-12-11 2019-05-31 平安科技(深圳)有限公司 Audio corpus intelligence cleaning method, device, storage medium and computer equipment
CN110265001A (en) * 2019-05-06 2019-09-20 平安科技(深圳)有限公司 Corpus screening technique, device and computer equipment for speech recognition training
CN110263322A (en) * 2019-05-06 2019-09-20 平安科技(深圳)有限公司 Audio for speech recognition corpus screening technique, device and computer equipment

Also Published As

Publication number Publication date
CN111091834A (en) 2020-05-01

Similar Documents

Publication Publication Date Title
CN111091834B (en) Text and audio alignment method and related product
CN111161739B (en) Speech recognition method and related product
CN103971684B (en) A kind of add punctuate method, system and language model method for building up, device
CN111291566B (en) Event main body recognition method, device and storage medium
CN110750996B (en) Method and device for generating multimedia information and readable storage medium
CN112287696B (en) Post-translation editing method and device, electronic equipment and storage medium
CN112037769B (en) Training data generation method and device and computer readable storage medium
CN110516203B (en) Dispute focus analysis method, device, electronic equipment and computer-readable medium
CN113660432B (en) Translation subtitle making method and device, electronic equipment and storage medium
US11770590B1 (en) Providing subtitle for video content in spoken language
WO2023151424A1 (en) Method and apparatus for adjusting playback rate of audio picture of video
CN108052686B (en) Abstract extraction method and related equipment
CN112382295A (en) Voice recognition method, device, equipment and readable storage medium
WO2022206198A1 (en) Audio and text synchronization method and apparatus, device and medium
CN102103612A (en) Information extraction method and device
CN107369450B (en) Recording method and recording apparatus
CN113591491A (en) System, method, device and equipment for correcting voice translation text
CN110428668B (en) Data extraction method and device, computer system and readable storage medium
CN114155841A (en) Voice recognition method, device, equipment and storage medium
CN114611496A (en) Dictionary generation method and device, storage medium and electronic device
CN113065353A (en) Entity identification method and device
CN113157946A (en) Entity linking method and device, electronic equipment and storage medium
CN112966505B (en) Method, device and storage medium for extracting persistent hot phrases from text corpus
CN112651854B (en) Voice scheduling method, device, electronic equipment and storage medium
CN116092063B (en) Short video keyword extraction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant