CN113191133B

CN113191133B - Audio text alignment method and system based on Doc2Vec

Info

Publication number: CN113191133B
Application number: CN202110438831.7A
Authority: CN
Inventors: 陈科良; 崔岩松; 任维政; 张晓欢; 樊昌熙; 孙孟寒; 张帅; 崔晨岩
Original assignee: Beijing Huanke Technology Co ltd; Beijing University of Posts and Telecommunications
Current assignee: Beijing Huanke Technology Co ltd; Beijing University of Posts and Telecommunications
Priority date: 2021-04-21
Filing date: 2021-04-21
Publication date: 2021-12-21
Anticipated expiration: 2041-04-21
Also published as: CN113191133A

Abstract

The invention discloses an audio text alignment method and system based on Doc2Vec, wherein the method comprises the following steps: performing threshold estimation on AIC-FCM optimized based on a simulated annealing genetic algorithm, cutting the book-associated long audio into short audio with sentences as dimensionality, performing voice recognition on the short audio, and outputting short text with sentences as dimensionality; extracting paragraphs of the electronic book based on a Doc2Vec model to obtain paragraph texts with the paragraphs as dimensions; and performing text similarity matching on the short text and the paragraph text by using a dynamic matching method based on a threshold prediction method to complete text alignment. Compared with the traditional audio text alignment algorithm, the method is closer to an ideal segmentation result on the aspect of long audio segmentation, the alignment effect is basically equal to that of Doc2vec, and the time complexity is reduced by about 35%.

Description

Audio text alignment method and system based on Doc2Vec

Technical Field

The invention relates to the technical field of natural language processing, in particular to a method for processing the problem of audio text alignment so as to improve the efficiency and quality of audio-to-e-book production.

Background

The electronic book capable of producing sound can be produced based on the association relationship between the audio reading materials and the electronic book, and the electronic book has important practical value in scenes of education, especially language education, but the production of the electronic book is not wide at present. For this reason, such books are mostly made by manual labeling, which limits the development of the books to a great extent. The audio text technology is technically feasible as the book, and at present, some problems still exist, firstly, in the aspect of audio processing, the book audio is generally read and recorded by professionals according to the text of the electronic book, so that the book audio has the characteristics of high matching degree with the text of the electronic book, long single audio and the like. If the longer audio is directly identified, not only is the decoding time consumed more, but also the accuracy of the identification output is reduced. Secondly, in the alignment algorithm, various alignment algorithms are quite complete at present, but the problems of time complexity, high space complexity and the like still exist.

Therefore, how to provide an audio text alignment method with ideal effect is a problem that needs to be solved urgently by those skilled in the art.

Disclosure of Invention

In view of this, the invention provides a Doc2 Vec-based audio text alignment method and system, which have high accuracy and high efficiency.

In order to achieve the purpose, the invention adopts the following technical scheme:

a Doc2 Vec-based audio text alignment method comprises the following steps:

step 1: performing threshold estimation on AIC-FCM optimized based on a simulated annealing genetic algorithm, cutting the book-associated long audio into short audio with sentences as dimensionality, performing voice recognition on the short audio, and outputting short text with sentences as dimensionality;

step 2: extracting paragraphs of the electronic book based on a Doc2Vec model to obtain paragraph texts with the paragraphs as dimensions;

and step 3: and performing text similarity matching on the short text and the paragraph text by using a dynamic matching method based on a threshold prediction method to complete text alignment.

Preferably, the step1 specifically includes:

step 11: carrying out global search and interactive mutation operation by using a simulated annealing genetic algorithm, and simultaneously combining the simulated operation to obtain a clustering center;

step 12: the fuzzy C-means clustering algorithm carries out fuzzy clustering when the clustering numbers of the characteristic parameters are respectively 1 and 2 based on the clustering center;

step 13: judging the optimal clustering number C through the Chichi information criterion, and determining the threshold value of double-threshold end point detection according to the optimal clustering number C to finish the audio segmentation along with the book length;

step 14: and preprocessing the short audio, then performing voice recognition, and outputting short texts taking sentences as dimensions.

Preferably, the step 11 specifically includes:

step 111: inputting a random long audio frequency, initializing algorithm parameters, setting a genetic algebra i to be 0 and setting an initial temperature T of an annealing algorithm_i；

Step 112: randomly generating genetic algorithm population C_i(T) representing a cluster center of the audio sample points;

step 113: calculating the genetic algorithm population C_i(T) all fitness F (C)_i(T))；

Step 114: using crossover and mutation operations to make the genetic algorithm population C_i(T) evolution to obtain a new population C_i′(T)；

Step 115: recalculating New population C_i' (T) fitness F (C)_i′(T))；

Step 116: calculating the annealing increment Δ F ═ F (C)_i′(T))-F(C_i(T)), if Δ F > 0, then the new population fitness is enhanced, C_i' (T) is the next generation population; if Δ F ≦ 0, then with probability

Receiving C_i' (T) is the next generation population, if the preset probability of being accepted is not reached finally, the step 114 is returned;

step 117: setting the new population as the next generation population, i.e. C_i+1(T)＝C_i' (T) and lowering the temperature

Wherein α represents an annealing factor;

step 118: increasing the genetic algebra i to i +1, judging whether the obtained clustering center reaches the global minimum value, and if so, outputting the clustering center of the optimized audio sampling point; otherwise, the process returns to step 114 to continue the evolution process.

Preferably, the step 12 specifically includes:

step 121: initializing the membership matrix by using a random number between 0 and 1, and meeting the constraint condition:

wherein u is_ijRepresenting membership degree, and C representing the number of the obtained clusters;

step 122: the objective function F is calculated and,

wherein x is_iRepresenting data to be clustered, m_jRepresenting the clustering center, k representing the number of clusters of the cluster, N representing the number of data to be clustered, and max if the membership error is less than the error threshold epsilon after the nth iteration_ij{|u_ij ⁽ⁿ⁺¹⁾-u_ij ⁽ⁿ⁾If | } < epsilon, the required state is reached and the iteration is stopped; otherwise, go to step 123;

step 123: updating a membership matrix by calculating the membership and meeting constraint conditions, wherein the membership calculation formula is u_ij：

Where k denotes the number of clusters of the cluster, m_lEnumerating C cluster centers, backThe iteration is performed back to step 122.

Preferably, the step 13 specifically includes:

step 131: assuming that the background noise of both the active speech and the stop-and-pause-sound follows a Gaussian distribution

Model, μ_iIs a vector of the mean value of the vectors,

for the covariance matrix, the AIC value for the optimal cluster number C is calculated by the following formula:

wherein N is_iNumber of data for ith cluster, v is dimension of feature space, ε_dIs a penalty factor;

step 132: determining the high and low thresholds of the characteristic parameters according to the optimal clustering number

Preferably, the step2 specifically includes:

step 21: DM model training phase, in the input sentence s_iSliding with fixed-size windows, using sentence vectors of the input sentence for each slide to a position

Context word vector in sum window

Predicting target word x_mTo obtain a sentence vector matrix S_V×NWord vector matrix X_V×NAnd U, b parameters required by the Softmax function;

step 22: and in the DM model deducing stage, a trained model fixed word vector matrix and parameters U, b are utilized, a gradient descent method is adopted to obtain a new sentence vector, and the sentence vector matrix is updated.

Preferably, the step3 specifically includes:

step 31: representing the short text as SText, the paragraph text as PText, and calculating the character length D of all SText_sAnd character length average

Sequentially fetching paragraph text and calculating its length D_P；

Step 32: comparison D_PAnd

relative relationship therebetween if

The segment is longer, a head-to-tail matching mode PS-First-Last is adopted, otherwise, a total matching mode PS-ALL is adopted, and alpha is a threshold value judgment coefficient;

the PS-ALL in ALL matching modes specifically comprises the following steps: and calculating the text similarity by using the vector representation of the SText and the PText, wherein the specific calculation formula is as follows:

wherein X represents the short text after the audio recognition of the book, and the vector thereof is represented as V_X＝(x₁，x₂，…，x_N) Y represents paragraph text, with its vector represented as V_Y＝(y₁，y₂，…，y_N)；

The head-to-tail matching mode PS-First-Last specifically comprises the following steps: extracting the length of the head and tail characters of the paragraph from the paragraph text as

The two texts are sequentially found out from the head and the tail of the segmentTwo short texts SText with highest similarity_firstAnd SText_lastThereby achieving text alignment.

Preferably, the method further comprises checking whether the paragraph ending time point obtained by the PS-ALL matching method is continuous with the paragraph starting time point of the next paragraph, and if not, extending the ending time point to a second before the paragraph starting time.

Preferably, the step 13 further comprises:

and (3) representing the segmentation error rate through the segmentation error rate and guiding an algorithm to correct:

segmentation error rate E_CCan be expressed as:

wherein L is_frameIndicating the number of audio frames cut to length, S_fameIndicating the number of audio frames cut short, ALL_frameTotal number of audio frames, W, representing the slicing_LAnd W_SWeights representing cut-to-length errors and cut-to-short errors, respectively;

if E_C＜ε_EIf the audio segmentation meets the requirements, otherwise, the audio needs to be corrected by adjusting a threshold or manually correcting, wherein epsilon_EIs a preset threshold.

A Doc2 Vec-based audio text alignment system, comprising:

the audio segmentation and identification module: performing threshold estimation on AIC-FCM optimized based on a simulated annealing genetic algorithm, cutting the book-associated long audio into short audio with sentences as dimensionality, performing voice recognition on the short audio, and outputting short text with sentences as dimensionality;

a text paragraph extraction module: extracting paragraphs of the electronic book based on a Doc2Vec model to obtain paragraph texts with the paragraphs as dimensions;

an alignment module: and performing text similarity matching on the short text and the paragraph text by using a dynamic matching method based on a threshold prediction method to complete text alignment.

Compared with the prior art, the audio text alignment method and system based on Doc2Vec disclosed by the invention have the advantages that the audio text alignment technology is taken as a core, the audio book is matched with the matched audio, the comparison relation between the text content and the audio in time is realized, and the listening and the reading are organically combined. Compared with the traditional audio text alignment algorithm, the method is closer to an ideal segmentation result on the aspect of long audio segmentation, the alignment effect is basically equal to that of Doc2vec, and the time complexity is reduced by about 35%.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flow chart of an audio text alignment method based on Doc2 Vec.

FIG. 2 is a flow chart of threshold estimation for AIC-FCM optimized by simulated annealing genetic algorithm.

FIG. 3 is a diagram of a DM model architecture.

Fig. 4 is a schematic diagram of the operation of the dynamic matching scheme based on the threshold prediction method.

FIG. 5 is a diagram illustrating the operating principle of PS-First-Last.

FIG. 6 is a schematic diagram of a text alignment collation scheme.

Fig. 7 is a schematic block diagram of an audio text alignment system based on Doc2 Vec.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention discloses an audio text alignment method based on Doc2Vec, and the final aim of audio and text alignment is to establish an association relation between audio and text in a time dimension, namely to find corresponding text content in an audio time interval. Aligned hierarchies of audio and text generally contain three types: paragraph alignment, sentence alignment, and word alignment. Since the electronic book itself is built with paragraph-based elements, making audio references requires alignment at the paragraph level. As shown in fig. 1, the method specifically includes:

step 1: threshold estimation is carried out on AIC-FCM optimized based on a simulated annealing genetic algorithm, the random long audio is divided into short audio with a sentence as a dimensionality, and the short audio is preprocessed and then is subjected to voice recognition to output a short text with the sentence as the dimensionality;

specifically, as shown in fig. 2, step1 includes:

step 11: carrying out global search and interactive mutation operation by using a genetic algorithm, and simultaneously combining simulation operation to obtain a clustering center;

the genetic algorithm is a self-adaptive global optimization probability search algorithm designed according to selection, crossing and mutation mechanisms in the biological heredity and evolution processes. The method has strong global search capability and can rapidly solve the overall solution in the space. However, genetic algorithms also have weak points of weak local searching capability and slow convergence. The simulated annealing algorithm can effectively get rid of local minimum values, and can reach a global minimum value point with random probability close to 1, so that the weakness of the genetic algorithm can be exactly compensated. Therefore, the simulated annealing heredity combining the two algorithms can enhance the searching capability and the searching efficiency of the clustering algorithm and simultaneously can improve the robustness of the audio threshold detection.

Specifically, the execution steps of the simulated annealing genetic algorithm are as follows:

step1. initializing algorithm parameters, setting genetic algebra i to 0 and initial temperature T of annealing algorithm_i

step2. random generation of genetic algorithm populationsC_i(T), which is the clustering center of the audio sample points;

step3. calculation of genetic Algorithm C_i(T) all fitness F (C)_i(T))；

Step4. Using crossover and mutation operations to make genetic Algorithm population C_i(T) evolution to C_i' (T) by which a possibly better cluster center is obtained;

step5. recalculating the individual fitness F (C) of the new population_i′(T))；

step6. as Δ F ═ F (C)_i′(T))-F(C_i(T)) calculating the annealing increment, if delta F is more than 0, representing that the new population fitness is enhanced, C_i' (T) is the next generation population, and if Δ F.ltoreq.0, the probability is determined

Receiving C_i' (T) is the next generation population; if no new population that can be accepted is eventually available, step4 is returned.

step7. set the new population as the next generation population, i.e. C_i+1(T)＝C_i' (T) and then cooling

Wherein alpha represents an annealing factor, and finally increasing genetic algebra, i is i + 1;

step8, judging whether the termination condition is met, and if the obtained clustering center is met, outputting the clustering center required by the FCM; otherwise, step4 is switched, and the evolution process is continued.

Step 12: the fuzzy C-means clustering algorithm carries out fuzzy clustering when the clustering number of the characteristic parameters is 1 and 2 respectively based on the clustering center;

specifically, a fuzzy C-means clustering algorithm (FCM) fuses the essence of a fuzzy theory and is mainly used for clustering analysis of data. The membership degree of each sample data is calculated by iteratively optimizing the objective function, and then the classification of the data is realized. If X is ═ { X ═ X_i1, 2., N represents a data set, M ═ M ·_j1, 2.. C } stands for numberThe data set X is divided into C clustered center sets, and the objective function F can be expressed as:

where k is the number of clusters of the cluster, u_ijRepresenting data x_iAnd a certain class m_jThe degree of similarity, i.e. the degree of membership, is calculated by the formula:

membership also has a constraint that the sum is equal to 1, i.e.:

||x_i-m_j| represents data x_iAnd a clustering center m_jThe distance of (c).

The objective of the FCM algorithm is to obtain the membership u when the objective function F is minimum through continuous iterative calculation_ijThe iteration process is as follows:

step 121: initializing the membership matrix U by using a random number with a value between 0 and 1, which is required to satisfy a constraint condition

Step 122: the objective function F is calculated and,

wherein x is_iRepresenting data to be clustered, m_jRepresenting the clustering center, k representing the cluster number of the cluster, N representing the number of the data to be clustered, and max if the membership error is less than the error threshold epsilon after the nth iteration_ij{|u_ij ⁽ⁿ⁺¹⁾-u_ij ⁽ⁿ⁾If | } < ε, it can be considered that better has been achievedStatus and stop iteration, otherwise step 123 is performed.

Step 123: by degree of membership

Calculating a new membership matrix, m_lAn enumeration of C cluster centers is represented and then returns to step 122 to continue the iteration.

In short, the central idea of the FCM algorithm is to configure a membership belonging to a cluster for each sample data, and classify the data by the membership.

Step 13: and judging the optimal clustering number C according to the Chi information criterion, and determining the threshold value of double-threshold end point detection according to the optimal clustering number C to finish the audio segmentation along with the book length.

Specifically, the Akabane Information Criterion (AIC) is a minimum information content criterion, which is a criterion for measuring the goodness of fit of a statistical model. The AIC criterion is mainly used to solve the model selection problem, finding a balance between the complexity of the model and the number of parameters. In general, it is a weighted function of the fitting accuracy and the unknown number of parameters, defined as follows:

wherein X ═ { X ═ X_i1, 2, N is the set of data features, P { P ═ P_i1, 2.. C } are model parameters, ln S (X, P) is a likelihood function of the data feature set X and the model parameters P, n is a likelihood function of the model parameters P, and_Pis the number of parameters, ε, of P_dIs a penalty factor.

The model selected by the Chi information criterion is the best model when the AIC is minimum. Assuming that the background noise, e.g. active speech and stop-and-pause-sounds, follows a Gaussian distribution

Model, μ_iIs a vector of the mean value of the vectors,

for covariance matrix, then the AIC value for C cluster number can be calculated as follows:

wherein N is_iV is the dimension of the feature space, the number of data for the ith cluster.

In an application scene of audio end point detection, setting the initial clustering number C to be 2, and determining the high-low threshold of the characteristic parameter according to the optimal clustering number by the following formula:

the energy threshold and the zero-crossing rate threshold are obtained through the above formula, and then the short-time energy and the average zero-crossing rate of the voice signal are detected by using the energy threshold and the zero-crossing rate threshold, it needs to be noted that the numerical value of the voice signal can be obtained only within the threshold, and is discarded when the numerical value exceeds the threshold.

The double-threshold end point detection method is mainly used for detecting the starting point and the end point of a section of voice, and the two thresholds of the method refer to an energy threshold and a zero crossing rate threshold. The short-term energy of a stop-and-pause sound in audio is generally much lower than that of speech, so most of the stop-and-pause sound can be accurately cut off by an energy threshold. Short-time energy E of audio frequency x (n)_nCan be formulated as:

where w (n-m) represents a window function, s_n(m) ═ x (n) × w (n-m) represents a certain frame signal of the audio x (n). However, the energy of some unvoiced consonants and the energy of the stop-and-pause sound are very close to each other in the voice, and if the energy threshold is used alone, the unvoiced consonants can be cut off. The short-time average zero-crossing rate is characterized by the passing of zero value of signal level per secondThe number of times of the audio x (n), the short-time average zero-crossing rate Z of the audio x (n)_nCan be formulated as:

where signal is a sign function, i.e.:

further, there may be a cut-to-length or cut-to-short error in the long audio slicing process using the dual-threshold end-point slicing technique. The cut-to-length error refers to an error in cutting the pause tone into one frame of short phrase tones, which may be caused by excessive noise energy in the pause tone. A truncation error refers to an error that divides a continuous phrase speech into two frames of speech, and this error is generally caused by a pause of a certain part of the continuous speech. In order to calculate the accuracy of audio segmentation and correct the accuracy, the invention introduces segmentation error rate to characterize the segmentation error rate and guide the algorithm to correct the error rate. If with L_frameNumber of audio frames representing cut-to-length, in S_frameIndicating the number of audio frames cut short, in ALL_frameRepresenting the total number of audio frames sliced, then the slicing error rate E_CCan be expressed as:

weights W are defined in the formula for the cut-to-length errors and the cut-to-short errors, respectively_LAnd W_SAnd in general W_S＞W_L. This is because the severity of the error caused by the cut-off of the short audio in the practical application scenario is much greater than the pause sound contained in the short audio, and the influence of this factor on the slicing error rate can be expressed by introducing a weight into the formula.

The segmentation error rate has a threshold value epsilon_EIf E is_C＜ε_EThe short audio segmentation may be considered to have met the requirements, otherwise the short audio needs to be corrected by adjusting the threshold or by manual correction. The threshold value epsilon may be different due to the difference in characteristics of the length, energy, etc. of the stop-consonants of different types of audio_EIt is generally determined experimentally by selecting the same type of audio.

The short audio after error correction of the error can be fed directly to the speech recognition system for recognition.

Step 14: preprocessing short audio, and then performing voice recognition to output short texts taking sentences as dimensions;

specifically, for text data in epub, markdown and other formats, an open-source analysis tool is adopted to extract segment content in the text. The extracted text content may have interfering elements such as punctuation, special characters, etc. in format. For such interfering elements, regular expression methods may be used to process these elements. And obtaining the text information after interference removal through some set regular expressions according to the extracted segment content.

For speech recognition systems. The related art is very mature and has excellent identification effect. Such as open source speech recognition tools like CMU Sphinx, Kaldi, HTK, ASRT, etc., and commercialized platforms like flyaway speech recognition, Baidu AI platform, etc. The voice recognition of the invention utilizes an open-source voice recognition tool to recognize the preprocessed short audio, and the output short text contains the time interval of the short audio in the original long audio.

in particular, the Doc2vec model is developed from a word2vec model, and the Doc2vec model expands the capability of calculating vector representation of long texts (sentences, paragraphs and the like) on the basis of predicting word vectors. This model can obtain a fixed length sentence vector and word vector, where the sentence vector stores context information missing from the topic or word vector of the current paragraph. There are also two training modes for the Doc2vec model: distributed Memory (DM) and Distributed Bag of Words (DBOW), corresponding to CBOW and Skip-gram in the word2vec model, respectively. From the results verified in the experiments of Tomas Mikolov, it can be seen that the paragraph vectors obtained by DM in most classification tasks perform better than DBOW, so the invention uses the DM model to perform the sentence vector calculation, as shown in fig. 3.

The idea of the DM model is to predict a word with the highest probability of occurrence in the current context by inputting a sentence vector and several word vectors in the sentence.

The train thought of the model is as follows: in the input sentence s_iSliding with fixed-size window, using sentence vector of input sentence every time sliding to a position

Context word vector in sum window

Predicting target word x_m. Consistent with the CBOW training method of the wotd2sec model, the final purpose of the DM model training is also to obtain a sentence vector matrix S_V×NWord vector matrix X_V×NAnd U, b, etc. parameters required by the Softmax function. In this process, each prediction of the DM model will use the sentence S_iThe semantic information of (1).

In the inference stage of the model, for a new sentence, the trained DM model, the word vector matrix X and the parameters U, b are fixed, and a gradient descent method is used to obtain a new sentence vector and update the sentence vector matrix S at the same time.

The DM model requires that after a given context, the possibility of obtaining a predicted value is maximized by updating parameters, i.e., maximizing the average log-likelihood function. The mean log-likelihood function is defined as:

where C represents the total number of words, k represents the window width used for training, S_iRepresenting the sentence vector in which the selected context word is located. The subsequent prediction task can be completed by using a multi-classifier such as a Softmax function, and the like, which is a conditional probability function p (x)_k|s_i，x_m-k，...，x_m+k) Is defined as:

wherein y is_jThe expression x_jNormalizing the output value before normalization. If the expression of the clause vector matrix S is h_V×NWord vector matrix X_V×NAnd averaging the extracted row vectors or connecting the extracted row vectors to obtain a vector, wherein the calculation expression of y is as follows:

y＝b+Uh(S_i，x_m-k，…，x_m+k；S，X)。

Specifically, short texts with sentences as dimensions are obtained after the audio of the book is cut and identified and are expressed by SText, and texts stored in the electronic book are all organized with paragraphs as dimensions and are expressed by PText. Because only alignment of paragraph dimensions needs to be achieved between the audio and the electronic book in the actual application scene, if similarity of all texts is calculated, the PText needs to be divided into sentences, and similarity calculation is carried out with the SText one by one. The method can cause a great deal of computing resource waste, so the invention designs a method for dynamically determining whether to adopt total matching (PS-ALL) or head-to-tail matching (PS-First-Last) according to the ratio of the PText length to the mean value of the SText length. The working principle is shown in figure 4:

the scheme first calculates the character length D of all SText_SAnd their average values D_SThen, the paragraph texts in the electronic book are sequentially fetched and the length D thereof is calculated_P。D_PAnd

relative to each otherThe relationship determines whether the classifier will select PS-ALL or PS-First-Last. If D is_P＞αD_SThe paragraph is considered longer and PS-First-Last is used, otherwise PS-ALL is used. Wherein α is a threshold decision coefficient, which can be dynamically adjusted to improve the decision accuracy when processing different types of electronic books.

The principle of PS-ALL is simple, and it directly uses the vector representation of the strext and PText to calculate text similarity, thereby completing text alignment. Specifically, a cosine text similarity calculation algorithm is adopted, the cosine text similarity calculation algorithm is a method for converting the calculation of similarity between texts into the calculation of cosine values of included angles between vectors, and the smaller the included angle of the vectors is, the higher the text similarity is proved to be. The text to be matched can obtain corresponding sentence vectors through a trained Doc2vec model, and the similarity degree of the text to be matched can be obtained through a cosine text similarity calculation formula. If X represents the text after the audio recognition, the vector is represented as V_X＝(x₁，x₂，…，x_N) Y represents the text in the electronic book, and the vector thereof is represented as V_Y＝(y₁，y₂，...，y_N) Then the similarity between X and Y can be expressed as:

the value threshold of phi (X, Y) is [0, 1], and the smaller the value is, the smaller the vector included angle is, the higher the similarity of the two texts is.

The principle of PS-First-Last is relatively complicated, it First takes out the beginning and end of the segment with the character length of

Two texts are sequentially found, and then two short texts SText with the highest similarity with the first and the last texts of the section are sequentially found_firstAnd SText_last. The two short texts actually represent a time interval corresponding to book audio, so that the alignment of the e-book paragraphs and the audio paragraphs can be realized. The working principle is shown in figure 5：

Furthermore, a correction link is added in the scheme finally because the condition that the matching audio interval is too short due to misuse of PS-ALL possibly occurs. This link checks whether ALL the paragraph ending time points obtained by using PS-ALL are continued to the succeeding paragraph starting time points, and if not, extends the ending time point to one second before the next paragraph starting time to ensure that the audio time interval is completely divided, and the text alignment proofreading scheme is as shown in fig. 6.

The embodiment discloses an audio text alignment system based on Doc2Vec, as shown in fig. 7, including:

an alignment module: and performing text similarity matching on the short text and the paragraph text by using a dynamic matching method based on a threshold prediction method to complete text alignment, and finally outputting the text with the audio time stamp.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Accordingly, the invention is not to be limited to the embodiments shown herein.

Claims

1. A Doc2 Vec-based audio text alignment method is characterized by comprising the following steps:

and step 3: performing text similarity matching on the short text and the paragraph text by a dynamic matching method based on a threshold prediction method to complete text alignment;

the step3 specifically includes:

Sequentially fetching paragraph text and calculating its length D_P；

Step 32: comparison D_PAnd

relative relationship therebetween if

wherein X represents a short text after the audio recognition of the book, and the N-dimensional vector thereof is represented as V_X＝(x₁,x₂,...,x_N) Wherein x is_iRepresenting elements in a vector, Y representing paragraph text, with an N-dimensional vector denoted V_Y＝(y₁,y₂,...,y_N) Wherein y is_iRepresenting elements in a vector;

The two short texts SText with the highest similarity with the head and the tail of the segment are found in turn_firstAnd SText_lastThereby achieving text alignment.

2. The Doc2 Vec-based audio text alignment method according to claim 1, wherein the step1 specifically includes:

3. The Doc2 Vec-based audio text alignment method according to claim 2, wherein the step 11 specifically includes:

step 111: inputting a book-associated long audio frequency, initializing algorithm parameters, setting a genetic algebra i to be 0 and setting an initial temperature of an annealing algorithmDegree T_i；

Step 115: recalculating New population C_i' (T) fitness F (C)_i′(T))；

Wherein α represents an annealing factor;

4. The Doc2 Vec-based audio text alignment method according to claim 3, wherein the step 12 specifically includes:

step 122: the objective function F is calculated and,

m_lThe enumeration of C cluster centers is represented and the iteration is performed by returning to step 122.

5. The Doc2 Vec-based audio text alignment method according to claim 4, wherein the step 13 specifically includes:

Model, μ_iIs a vector of the mean value of the vectors,

6. The Doc2 Vec-based audio text alignment method according to claim 1, wherein the step2 specifically includes:

Context word vector in sum window

7. The Doc2 Vec-based audio text alignment method of claim 1, further comprising checking whether the paragraph ending time point obtained by PS-ALL matching is continuous with the next paragraph starting time point, and if not, extending the ending time point to a second before the next paragraph starting time point.

8. The Doc2 Vec-based audio text alignment method of claim 2, wherein the step 13 further comprises:

segmentation error rate E_CCan be expressed as:

wherein L is_frameIndicating the number of audio frames cut to length, S_frameIndicating the number of audio frames cut short, ALL_frameTotal number of audio frames, W, representing the slicing_LAnd W_SWeights representing cut-to-length errors and cut-to-short errors, respectively;

9. A Doc2 Vec-based audio text alignment system, comprising:

an alignment module: the dynamic matching method based on the threshold prediction method is used for matching the text similarity of the short text and the paragraph text to complete text alignment, and the specific process is as follows:

representing the short text as SText, the paragraph text as PText, and calculating the character length D of all SText_SAnd character length average

Sequential take-out sectionText and calculate its length D_P；

Comparison D_PAnd

relative relationship therebetween if