CN111091834A

CN111091834A - Text and audio alignment method and related product

Info

Publication number: CN111091834A
Application number: CN201911342808.7A
Authority: CN
Inventors: 王庆然; 高建清; 万根顺; 黄佑银; 崔芳
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2019-12-23
Filing date: 2019-12-23
Publication date: 2020-05-01
Anticipated expiration: 2039-12-23
Also published as: CN111091834B

Abstract

The embodiment of the application discloses a text and audio alignment method and a related product, wherein the method comprises the following steps: carrying out voice recognition on the collected audio data to obtain a recognition text, and acquiring a corpus text of the audio data; matching and segmenting the corpus text and the identification text to obtain a small-segment corpus text and a small-segment identification text; repairing the small-segment corpus text according to the small-segment identification text to obtain a repaired small-segment text; and acquiring the time boundary of the small section of the identification text, and extracting an audio segment corresponding to the time boundary from the audio data as the matching audio of the repaired small section of the text. The technical scheme has the advantage of low cost.

Description

Text and audio alignment method and related product

Technical Field

The application relates to the technical field, in particular to a text and audio alignment method and a related product.

Background

The method for acquiring the voice training data is different from the method for acquiring the training data such as the text pictures and the like, and has the characteristics of high acquisition difficulty, high labeling cost and the like, so that the voice training data is more difficult to acquire. The voice resources contain more private information, such as personal privacy and business privacy, which increasingly makes the acquisition of voice data difficult. Voice resources obtained from public ways such as the internet may face problems such as poor sound quality or undesirable audio scenes. In order to achieve better effect, each research institution or company can only manually record audio meeting the requirements of the specification and the scene, and the audio annotation is performed by manually at great cost. The available training audio data produced by this method is expensive. The cost of existing voice training data is high.

Disclosure of Invention

The embodiment of the application provides a text and audio alignment method and a related product, so that voice training data can be acquired at low cost, and the method has the advantage of reducing the voice training cost.

In a first aspect, a method for aligning text and audio is provided, the method comprising the steps of:

carrying out voice recognition on the collected audio data to obtain a recognition text, and acquiring a corpus text of the audio data;

matching and segmenting the corpus text and the identification text to obtain a small-segment corpus text and a small-segment identification text; repairing the small-segment corpus text according to the small-segment identification text to obtain a repaired small-segment text;

and acquiring the time boundary of the small section of the identification text, and extracting an audio segment corresponding to the time boundary from the audio data as the matching audio of the repaired small section of the text.

In a second aspect, there is provided a text-to-audio alignment apparatus, the apparatus comprising:

the voice recognition unit is used for carrying out voice recognition on the collected audio data to obtain a recognition text;

the matching and segmenting unit is used for matching and segmenting the corpus text and the identification text to obtain a small-segment corpus text and a small-segment identification text;

the repairing unit is used for repairing the small-segment corpus text according to the small-segment identification text to obtain a repaired small-segment text;

and the processing unit is used for acquiring the time boundary of the small section of the identification text and extracting an audio segment corresponding to the time boundary from the audio data as the matching audio of the repaired small section of text.

In a third aspect, an embodiment of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program for electronic data exchange, where the computer program causes a computer to perform some or all of the steps described in the first aspect of the embodiment of the present application.

In a fourth aspect, embodiments of the present application provide a computer program product, where the computer program product includes a non-transitory computer-readable storage medium storing a computer program, where the computer program is operable to cause a computer to perform some or all of the steps as described in the first aspect of embodiments of the present application. The computer program product may be a software installation package.

It can be seen that the technical scheme provided by the application processes the collected audio data and the corpus text, compares the corpus text with the identification text to correct the corpus text, and segments the audio data, so that the segmented audio can correspond to the small segment text, and the confidence of the small segment text and the segmented audio can be improved through the comparison processing of the corpus text and the identification text, thereby improving the accuracy of the small segment text.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart illustrating a text-to-audio alignment method according to an embodiment of the present application.

Fig. 2 is a schematic flowchart of a method for acquiring a short text according to the second embodiment of the present application.

Fig. 2-1 is a schematic diagram between an anchor point and a character string provided in the second embodiment of the present application.

Fig. 2-2 is a schematic comparison diagram of two strings to be matched, a1 and B1, provided in the second embodiment of the present application.

Fig. 3 is a schematic flow chart of a method for repairing a text according to a third embodiment of the present application.

Fig. 3-1 is a schematic diagram of repairing a corpus text and a recognition text provided in the third embodiment of the present application.

Fig. 4 is a schematic structural diagram of a text-to-audio alignment apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Training audio data acquisition is costly, but there is a large amount of audio and corresponding corpus text on the network between available and unavailable for direct use, and these data are relatively easy and inexpensive to acquire. Such as program audio (original audio) and corresponding lead text (corpus text) of a television station, and dubbing audio of a movie and corresponding station word text, etc. However, such a long audio text cannot be directly used as training audio data, because the long audio cannot be directly trained, it is difficult to find a better way to automatically cut an audio file, and find a text label corresponding to each speech segment from a long original corpus text. Meanwhile, the obtained original corpus texts may have the problems of deficiency and irregularity, so that the texts need to be labeled and aligned manually and the texts need to be repaired properly.

Example one

Referring to fig. 1, fig. 1 provides a text and audio alignment method, where the method is executed on an electronic device, and the electronic device may be a general computer, a server, or other devices, and certainly in practical applications, the electronic device may also be a data processing center, a cloud platform, or other devices, and the present application does not limit a specific implementation manner of the electronic device. In addition, the text illustrated in this embodiment is only illustrated by using a short piece of text information, and in practical application, the method provided in this embodiment may be applied to a long text, and of course, may also be applied to a short text. As shown in fig. 1, the method comprises the steps of:

step S101, obtaining audio data and a corpus text corresponding to the audio data.

The original corpus text substantially corresponds to the original audio, such as a novel text and a corresponding recording on a network, a program audio and a corresponding introduction text of a television station, or a speech text corresponding to a movie.

And S102, carrying out voice recognition on the audio data to obtain a recognition text.

Decoding the audio by using a speech recognition technology (such as a semantic recognition model) to obtain a corresponding recognition text, and meanwhile, obtaining the recognition confidence coefficient of each word in the recognition text; and then, after the recognition text and the corresponding audio are aligned forcibly by using the acoustic model again, the time boundary of each word in the recognition text can be obtained.

And S103, matching and segmenting the corpus text and the identification text to obtain a small-segment corpus text and a small-segment identification text.

For a specific implementation of the step S103, reference may be made to the description of the second embodiment, which is not described herein again.

And S104, repairing the small-section corpus text according to the small-section identification text to obtain a repaired small-section text.

For a specific implementation of the step S104, reference may be made to the description of the third embodiment, which is not described herein again.

And S105, acquiring the time boundary of the small section of the identification text, and extracting an audio segment corresponding to the time boundary from the audio data as the matching audio of the repaired small section of the text.

The implementation method of the step S105 may specifically include:

and marking time end points (namely marking head and tail time end points) for the repaired short section text according to the head and tail time end points of the recognized text segment corresponding to each section of the repaired short section text, and cutting the original audio file according to the time end points, thereby obtaining the audio segment corresponding to the repaired short section text.

For example, the beginning and end time endpoints of the corresponding recognized text segment for repairing the short segment of text are: 0.05-0.10, namely the head time end point is 5 seconds, and the tail time end point is 10 seconds, then time end points [ 0.05,0.10 ] can be marked on the repaired short section text, audio segments [ 0.05,0.10 ] are extracted from the original audio file, and the repaired short section text and the audio segments [ 0.05,0.10 ] are used as training audio.

The technical scheme provided by the application processes the collected audio data and the corpus text, compares the corpus text with the identification text to correct the corpus text, and performs segmentation processing on the audio data, so that the segmented audio can correspond to the small segment text, and can improve the confidence coefficient of the small segment text and the segmented audio through the comparison processing of the corpus text and the identification text, thereby improving the accuracy of the small segment text, and therefore the technical scheme provided by the application can automatically acquire the training audio through software, and the cost is reduced.

Example two

A technical solution provided in embodiment two of the present application is a refinement solution of step S103 in embodiment one, and the solution of this embodiment may be executed by an electronic device, and the expression form of the electronic device may refer to the description of embodiment one, and an implementation scenario of this embodiment may also be the same as that of embodiment one, and is not described here again. Referring to fig. 2, fig. 2 provides a method for acquiring a short text, as shown in fig. 2, the method includes the following steps:

step S201, marking a plurality of anchor points on the corpus text and the identification text to obtain a marked corpus text and a marked identification text.

The implementation method of the step S201 may include:

and replacing punctuation marks of the corpus text and the recognition text with semantic marks to obtain a rough matching result of the corpus text and a rough matching result of the recognition text.

Specifically, segmentation may be performed according to punctuation, and the segmentable punctuation such as comma, period, semicolon, question mark and exclamation mark may be replaced by special characters, such as "@", although in practical applications, other symbols may be used, such as "@", so that during matching process, @ and @ can be naturally matched, and @ can become semantic mark.

For example, the corpus text "this is a plant that is impoverished, but it is a plant on a large grassland that other animals may be lazy to quench. "the punctuation mark is changed into semantic mark to become coarse matching result; "this is a plant that is imponderable @ but is a plant that other animals on the grassland may lazy to quench their thirst @".

Step S202, performing rough matching and fine matching on the marked corpus text and the marked recognition text, adjusting the positions of the anchor points, and cutting the marked corpus text and the marked recognition text by taking the adjusted positions of the anchor points as boundaries to obtain a small-segment corpus text and a small-segment recognition text.

The implementation method of the step S202 may specifically include:

and performing coarse matching on the n corpus characters in the anchor point peripheral setting area of the marked corpus text and the n identification characters in the anchor point peripheral setting area of the marked identification text to coarsely adjust the positions of the anchor points, and performing fine matching on the w corpus characters and the p corpus characters between the two anchor points in the coarsely adjusted positions to finely adjust the positions of the anchor points to obtain the adjusted positions of the anchor points.

The coarsely matching the n corpus character strings in the anchor point peripheral setting area of the tagged corpus text with the n identification character strings in the anchor point peripheral setting area of the tagged identification text to coarsely adjust the positions of the anchor points may specifically include:

and modifying the n corpus characters according to the n recognition characters to enable the two characters to be the same, acquiring the number x of the modified characters, if x is smaller than or equal to the matching threshold, not adjusting the anchor point position, and if x is larger than the matching threshold, moving the anchor point position until x is smaller than or equal to the matching threshold.

For better illustration, the following describes a practical example of the implementation method of coarse matching.

Referring to fig. 2-1, fig. 2-1 is a schematic diagram between an anchor point and n characters, and for convenience of description, a corpus text is used as a character string a, and an identification text is used as a character string B.

As shown in fig. 2-1, two horizontal columns and two vertical columns in the figure respectively represent a character string a and a character string B to be matched, and the lengths are la and lb respectively. The number of the "anchor marks" to be set may be freely determined, for example, three "anchor marks" may be set in each of the two long strings, and the length of the interval may also be freely set, and here, for convenience, the "anchor marks" are set at the same interval. Taking 0.25 × lb, 0.5 × lb and 0.75 × lb of the character string B as anchor marks, respectively taking 0.25 × la, 0.5 × la and 0.75 × la of the character string a as anchor marks, performing character matching, finding a semantic mark @closestto the corresponding position point of the shorter character string B (the semantic mark is found here to obtain a matching sub-string, then comparing the matching sub-string to determine an editing distance, if the semantic mark is not found, comparing the whole shorter character string B with the character string a to improve comparison difficulty and calculation amount) and forming n characters of the character string B, if n is 5, obtaining 5 characters "a B @ c d" (abcd is a representative mark, wherein ab respectively represents the first two characters closest to the @ symbol, cd respectively represents the last two characters closest to the @ symbol), and editing the corresponding position of the character string a according to a minimum matching distance algorithm, and finding n characters with the editing distance smaller than a set threshold value to serve as anchor point marks. For example, at 0.5 × lb in the string B, there is a segment of content "· living in such a world @ me very happy. ·", n characters where the latest @ symbol is located, such as 5 characters "world @ me very" (first sub-string), and when there is a sentence "· · true very happy. · in such a world" i "corresponding to about 0.5 × la in the string a, and it can match with" world @ me "(second sub-string), the transformation between the two only needs to insert (this modified character operation) a word" li "and delete (another modified character operation) a word" heart ", so the modified character number x is 2 (where the modified character number x is specifically, two identical numbers of times that need to insert, delete or replace, for example, the above-mentioned number needs to be inserted once and deleted once, so x is 2), if the matching threshold is 3, the matching is determined to be successful (if x is larger than the matching threshold, the positions of the anchor marks need to be changed until x is smaller than or equal to the matching threshold), and the two substrings are used as the anchor marks. By this method, after the third "anchor mark" is determined, the matching regions of the character string a and the character string B are divided into four segments, and the following matching process can match in the four regions respectively, and the matching in the four regions is a fine matching process, and for convenience of description, the four regions are respectively denoted as a1, a2, A3, a4, B1, B2, B3, and B4.

Performing fine matching on w corpus characters and p corpus characters between two anchor points of the coarse adjustment position to adjust the positions of the anchor points again specifically may include:

and performing fine matching on each character of the w corpus characters and the P corpus characters to obtain an editing distance of each character, obtaining the maximum editing distance y in the w corpus characters, if y is smaller than or equal to a matching threshold, not adjusting the anchor point position, and if y is larger than the matching threshold, moving the anchor point position until y is smaller than or equal to the matching threshold.

Also taking the above example as an example, referring to fig. 2-1, the matching is divided into four regions for analysis, and after the "anchor mark" is set, the matching process is divided into four blocks, and then the "matrix matching" algorithm designed by the present invention may be adopted. The main idea of the "matrix matching" algorithm is to establish a full-connection matrix of a sub-segment to be matched, as shown in fig. 2-2, two segments of a string a1 (containing w corpus characters) and a string B1 (containing p corpus characters) to be matched have lengths of 8 and 6 (i.e., w equals 8 and p equals 6). Wherein each word in string B1 corresponds to each word in the a1 string, which corresponds to maintaining a matrix size of 6 x 8; the algorithm process is as follows:

a: finding a substring B1 with a smaller length, traversing each word of B1, firstly starting matching according to an editing distance of 0, and starting matching from the first character;

b: traversing each character in B1, if the edit distance L of a character in A1 string is less than the threshold value 0, the matching is successful, the corresponding position in A1 string is marked with a matching success mark ed, and the edit distance is calculated, namely the two characters are completely the same, the distance is 0, and the different distance is 1.

In FIG. 2-2, for example, the first character "@" in B1 and the first character "@" in A1 are taken to match, the edit distance is 0, less than the threshold 0, and the match is made, when the match flag is marked on the first character "@" in the A1 string;

c: the next matching character starts matching from after the matching marker until the end of segment B1. And if all matching is completed, the matching is quitted, otherwise, the sub-sections which are not successfully matched in the B1 and the corresponding sub-sections in the A1 are respectively matched, and the global editing distance threshold is added with 1, so that the probability of successful matching of other sub-sections is increased. For example, in fig. 2-2, the sub-segments are "no" in B1 and "true very" in the a string, respectively, when only one character is not successfully matched;

d: and C, repeating the step C until all characters are matched successfully. In fig. 2-2, the "no" and "really very" cannot be matched successfully until the edit distance threshold is set to 3, at which time all characters are matched, so that the maximum edit distance y is 3.

The technical scheme provided by the application supports the implementation of the method of the first embodiment, so that the method has the advantage of saving cost.

EXAMPLE III

An embodiment of the present application provides a refinement scheme of step S104 in the first embodiment, and specifically provides a method for repairing a text, an application scenario of the present embodiment is the same as the first embodiment and the second embodiment, and the technical scheme of the present embodiment may also be executed by an electronic device, and the method includes, as shown in fig. 3, the following steps:

step S301, a short section text W1 (one of the short section corpus texts) of the corpus text and a short section text W2 (one of the short section identification texts) of the identification text are obtained.

The W1 and the W2 are matched short texts.

Step S302, the short section text W1 and the short section text W2 are aligned to obtain an aligned corpus text and an aligned recognition text.

And S303, repairing corresponding characters of the small segment of text W1 according to the confidence coefficient of each character in the aligned recognition text to obtain a repaired small segment of text.

The implementation method of step S303 may specifically include:

if the confidence coefficient is larger than a confidence threshold value, determining the character string of the repaired short section of text as the character of the aligned recognition text;

and if the confidence coefficient is smaller than a first threshold value, determining that the characters of the repaired short text segment are corresponding characters of the aligned corpus text.

Referring to fig. 3-1, the W1 may be a corpus text B1, the W2 may be an identification text a1, and the alignment operation is performed as shown in fig. 3-1.

Referring to FIG. 3-1, since the first character "@" of the A1 and B1 texts is in a matching state, the first character "@" is directly used as the first character of the repaired short segment text C1. When a replacement error occurs when the second character of C1 is constructed, the confidence level of 'me' in the recognized text A1 is different from that of 'you' in the corpus text, at this time, the confidence level of 'me' in the query recognized text A1 is 99.3%, if the threshold value is set to be 98%, the confidence level of the current character is found to exceed the threshold value, namely the engine decoding result is determined to be credible, the character which believes the recognized text is selected, and the second character of C1 is set as 'me';

when the third character of C1 is found to be unmatched, and the recognized text A1 is different from the recognized text B1 in the form of 'very' word, two pieces of unmatched text are aligned according to the word segmentation information, the word 'very' word in A1 corresponds to the word 'not' word in B1, and a replacement error occurs. Corresponding the adverb 'true' in the A1 to the empty character in the B1, when a deletion error occurs, processing the deletion error firstly, checking the confidence coefficient of the adverb 'true' to be 87%, setting the threshold of the confidence coefficient of the deletion error to be 95%, determining that the adverb 'true' is not credible, adding the empty character in the C1, and setting the confidence coefficient of the 'very' in the replacement error to be 99% which is higher than the threshold 98%, determining that the 'very' is more credible, and setting the corresponding position of the C1 to be 'very';

when the fourth character of C1 is constructed, the characters are found to be matched and written directly;

when the sixth character of C1 is constructed, an insertion error occurs, since the insertion error belongs to the corpus text, there is no confidence information and no time boundary information, while the sensitivity of the speech recognition module is turned high. Here, the inserted wrong mood word "bar" appearing in the corpus text B1 is discarded;

when the last character of C1 is constructed, if @ is found to be matched, the repairing process of C1 is completed to obtain C1.

The third embodiment of the application can realize the repair of the small section of text, and improves the matching degree of the text and the audio file, so that the third embodiment of the application has the advantages of improving the accuracy of training audio and further improving the training accuracy.

Example four

In accordance with a fourth embodiment of the present application, there is provided an apparatus, with reference to fig. 4, where fig. 4 provides a text-to-audio alignment apparatus, the apparatus comprising:

a speech recognition unit 401, configured to perform speech recognition on the collected audio data to obtain a recognition text;

a matching and segmenting unit 402, configured to perform matching and segmenting on the corpus text and the identification text to obtain a small segment corpus text and a small segment identification text;

for a specific implementation of the matching and splitting unit 402, reference may be made to the description of the second embodiment, which is not described herein again,

a repairing unit 403, configured to repair the small corpus text according to the small segment identification text to obtain a repaired small segment text;

for a specific implementation of the repair unit 403, reference may be made to the description of the third embodiment, which is not described herein again.

And the processing unit 404 is configured to obtain a time boundary of the short segment identification text, and extract an audio segment corresponding to the time boundary from the audio data as matching audio of the repaired short segment text.

In an alternative, the matching slicing unit 402 may include: the device comprises a marking module, a matching module and a cutting module;

the marking module is used for marking a plurality of anchor points on the corpus text and the identification text to obtain a marked corpus text and a marked identification text;

the matching module is used for performing rough matching and fine matching on the marked corpus text and the marked recognition text and adjusting the positions of the anchor points;

and the cutting module is used for cutting the marked corpus text and the marked recognition text by taking the adjusted positions of the anchor points as boundaries to obtain a small-segment corpus text and a small-segment recognition text.

Optionally, the marking module is specifically configured to replace punctuation marks of the corpus text and the identification text with semantic marks to obtain the marked corpus text and the marked identification text.

Optionally, the matching module is specifically configured to perform coarse matching on n corpus characters in the anchor point peripheral setting area of the tagged corpus text and n recognition characters in the anchor point peripheral setting area of the tagged recognition text to coarsely adjust the positions of the anchor points, and then perform fine matching on w corpus characters between two anchor points in the coarse-adjusted positions and p corpus characters to finely adjust the positions of the anchor points, so as to obtain adjusted positions of the anchor points.

Optionally, the matching module may further include: a coarse matching submodule and a fine matching submodule;

and the rough matching sub-module is used for modifying the n corpus characters according to the n recognition characters to enable the two characters to be the same, acquiring the number x of the modified characters, if the x is smaller than or equal to the matching threshold, not adjusting the anchor point position, and if the x is larger than the matching threshold, moving the anchor point position until the x is smaller than or equal to the matching threshold.

And the fine matching sub-module is used for performing fine matching on each character of the w corpus characters and the P corpus characters to obtain the editing distance of each character, obtaining the maximum editing distance y in the w corpus characters, if y is smaller than or equal to the matching threshold, not adjusting the anchor point position, and if y is larger than the matching threshold, moving the anchor point position until y is smaller than or equal to the matching threshold.

In an alternative, the repair unit may include: an alignment module and a repair module;

the alignment module is used for performing alignment operation on the small-segment corpus text and the small-segment identification text to obtain an aligned corpus text and an aligned identification text;

and the repairing module is used for repairing the corresponding character of the aligned corpus text according to the confidence coefficient of each character in the aligned recognition text to obtain a small repaired segment of text.

Optionally, the repairing module is specifically configured to determine that the character string of the repaired short segment of text is a character of the aligned recognized text if the confidence is greater than a confidence threshold; and if the confidence coefficient is smaller than a first threshold value, determining that the characters of the repaired small text segment are corresponding characters of the aligned corpus text.

In the embodiment of the present application, the electronic device may be divided into the functional units according to the method example, for example, each functional unit may be divided corresponding to each function, or two or more functions may be integrated into one processing unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit. It should be noted that the division of the unit in the embodiment of the present application is schematic, and is only a logic function division, and there may be another division manner in actual implementation.

Embodiments of the present application also provide a computer storage medium, where the computer storage medium stores a computer program for electronic data exchange, the computer program enabling a computer to execute part or all of the steps of any one of the methods described in the above method embodiments, and the computer includes an electronic device.

Embodiments of the present application also provide a computer program product comprising a non-transitory computer readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps of any of the methods as described in the above method embodiments. The computer program product may be a software installation package, the computer comprising an electronic device.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the above-described division of the units is only one type of division of logical functions, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit may be stored in a computer readable memory if it is implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a memory, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the above-mentioned method of the embodiments of the present application. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, which may include: flash Memory disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method for aligning text and audio, the method comprising the steps of:

2. The method according to claim 1, wherein the matching and segmenting the corpus text and the recognition text to obtain a corpus short-segment text and a recognition short-segment text specifically comprises:

marking a plurality of anchor points on the corpus text and the identification text to obtain a marked corpus text and a marked identification text;

and performing rough matching and fine matching on the marked corpus text and the marked recognition text, adjusting the positions of the anchor points, and cutting the marked corpus text and the marked recognition text by taking the adjusted positions of the anchor points as boundaries to obtain a small-segment corpus text and a small-segment recognition text.

3. The method according to claim 2, wherein the labeling the corpus text and the recognition text with a plurality of anchors to obtain a labeled corpus text and a labeled recognition text specifically comprises:

and replacing punctuation marks of the corpus text and the identification text with semantic marks to obtain the marked corpus text and the marked identification text.

4. The method according to claim 2, wherein the performing coarse matching and fine matching on the labeled corpus text and the labeled recognition text, and the adjusting the positions of the anchor points specifically comprises:

5. The method according to claim 4, wherein the performing rough matching on the n corpus character strings in the anchor-surrounding set area of the tagged corpus text and the n recognition character strings in the anchor-surrounding set area of the tagged recognition text to coarsely adjust the positions of the anchors specifically comprises:

6. The method according to claim 2, wherein the fine matching of w linguistic characters and p linguistic characters between two anchor points in the coarse adjustment position to readjust the positions of the anchor points comprises:

7. The method according to claim 1, wherein the repairing the short segment corpus text according to the short segment identification text to obtain a repaired short segment text specifically comprises:

and performing alignment operation on the small-segment corpus text and the small-segment identification text to obtain an aligned corpus text and an aligned identification text, and repairing corresponding characters of the aligned corpus text according to the confidence coefficient of each character in the aligned identification text to obtain a repaired small-segment text.

8. The method according to claim 7, wherein the repairing the corresponding character of the aligned corpus text according to the confidence level of each character in the aligned recognition text to obtain a repaired short segment text specifically comprises:

9. A text-to-audio alignment apparatus, the apparatus comprising:

the recognition unit is used for carrying out voice recognition on the collected audio data to obtain a recognition text;

10. A computer-readable storage medium, characterized in that a computer program for electronic data exchange is stored, wherein the computer program causes a computer to perform the method according to any one of claims 1-8.