CN113191133B - Audio text alignment method and system based on Doc2Vec - Google Patents

Audio text alignment method and system based on Doc2Vec Download PDF

Info

Publication number
CN113191133B
CN113191133B CN202110438831.7A CN202110438831A CN113191133B CN 113191133 B CN113191133 B CN 113191133B CN 202110438831 A CN202110438831 A CN 202110438831A CN 113191133 B CN113191133 B CN 113191133B
Authority
CN
China
Prior art keywords
text
audio
short
paragraph
representing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110438831.7A
Other languages
Chinese (zh)
Other versions
CN113191133A (en
Inventor
陈科良
崔岩松
任维政
张晓欢
樊昌熙
孙孟寒
张帅
崔晨岩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Huanke Technology Co ltd
Beijing University of Posts and Telecommunications
Original Assignee
Beijing Huanke Technology Co ltd
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Huanke Technology Co ltd, Beijing University of Posts and Telecommunications filed Critical Beijing Huanke Technology Co ltd
Priority to CN202110438831.7A priority Critical patent/CN113191133B/en
Publication of CN113191133A publication Critical patent/CN113191133A/en
Application granted granted Critical
Publication of CN113191133B publication Critical patent/CN113191133B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/65Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/126Evolutionary algorithms, e.g. genetic algorithms or genetic programming

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Physiology (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Genetics & Genomics (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses an audio text alignment method and system based on Doc2Vec, wherein the method comprises the following steps: performing threshold estimation on AIC-FCM optimized based on a simulated annealing genetic algorithm, cutting the book-associated long audio into short audio with sentences as dimensionality, performing voice recognition on the short audio, and outputting short text with sentences as dimensionality; extracting paragraphs of the electronic book based on a Doc2Vec model to obtain paragraph texts with the paragraphs as dimensions; and performing text similarity matching on the short text and the paragraph text by using a dynamic matching method based on a threshold prediction method to complete text alignment. Compared with the traditional audio text alignment algorithm, the method is closer to an ideal segmentation result on the aspect of long audio segmentation, the alignment effect is basically equal to that of Doc2vec, and the time complexity is reduced by about 35%.

Description

Audio text alignment method and system based on Doc2Vec
Technical Field
The invention relates to the technical field of natural language processing, in particular to a method for processing the problem of audio text alignment so as to improve the efficiency and quality of audio-to-e-book production.
Background
The electronic book capable of producing sound can be produced based on the association relationship between the audio reading materials and the electronic book, and the electronic book has important practical value in scenes of education, especially language education, but the production of the electronic book is not wide at present. For this reason, such books are mostly made by manual labeling, which limits the development of the books to a great extent. The audio text technology is technically feasible as the book, and at present, some problems still exist, firstly, in the aspect of audio processing, the book audio is generally read and recorded by professionals according to the text of the electronic book, so that the book audio has the characteristics of high matching degree with the text of the electronic book, long single audio and the like. If the longer audio is directly identified, not only is the decoding time consumed more, but also the accuracy of the identification output is reduced. Secondly, in the alignment algorithm, various alignment algorithms are quite complete at present, but the problems of time complexity, high space complexity and the like still exist.
Therefore, how to provide an audio text alignment method with ideal effect is a problem that needs to be solved urgently by those skilled in the art.
Disclosure of Invention
In view of this, the invention provides a Doc2 Vec-based audio text alignment method and system, which have high accuracy and high efficiency.
In order to achieve the purpose, the invention adopts the following technical scheme:
a Doc2 Vec-based audio text alignment method comprises the following steps:
step 1: performing threshold estimation on AIC-FCM optimized based on a simulated annealing genetic algorithm, cutting the book-associated long audio into short audio with sentences as dimensionality, performing voice recognition on the short audio, and outputting short text with sentences as dimensionality;
step 2: extracting paragraphs of the electronic book based on a Doc2Vec model to obtain paragraph texts with the paragraphs as dimensions;
and step 3: and performing text similarity matching on the short text and the paragraph text by using a dynamic matching method based on a threshold prediction method to complete text alignment.
Preferably, the step1 specifically includes:
step 11: carrying out global search and interactive mutation operation by using a simulated annealing genetic algorithm, and simultaneously combining the simulated operation to obtain a clustering center;
step 12: the fuzzy C-means clustering algorithm carries out fuzzy clustering when the clustering numbers of the characteristic parameters are respectively 1 and 2 based on the clustering center;
step 13: judging the optimal clustering number C through the Chichi information criterion, and determining the threshold value of double-threshold end point detection according to the optimal clustering number C to finish the audio segmentation along with the book length;
step 14: and preprocessing the short audio, then performing voice recognition, and outputting short texts taking sentences as dimensions.
Preferably, the step 11 specifically includes:
step 111: inputting a random long audio frequency, initializing algorithm parameters, setting a genetic algebra i to be 0 and setting an initial temperature T of an annealing algorithmi
Step 112: randomly generating genetic algorithm population Ci(T) representing a cluster center of the audio sample points;
step 113: calculating the genetic algorithm population Ci(T) all fitness F (C)i(T));
Step 114: using crossover and mutation operations to make the genetic algorithm population Ci(T) evolution to obtain a new population Ci′(T);
Step 115: recalculating New population Ci' (T) fitness F (C)i′(T));
Step 116: calculating the annealing increment Δ F ═ F (C)i′(T))-F(Ci(T)), if Δ F > 0, then the new population fitness is enhanced, Ci' (T) is the next generation population; if Δ F ≦ 0, then with probability
Figure RE-GDA0003128884510000031
Receiving Ci' (T) is the next generation population, if the preset probability of being accepted is not reached finally, the step 114 is returned;
step 117: setting the new population as the next generation population, i.e. Ci+1(T)=Ci' (T) and lowering the temperature
Figure RE-GDA0003128884510000032
Wherein α represents an annealing factor;
step 118: increasing the genetic algebra i to i +1, judging whether the obtained clustering center reaches the global minimum value, and if so, outputting the clustering center of the optimized audio sampling point; otherwise, the process returns to step 114 to continue the evolution process.
Preferably, the step 12 specifically includes:
step 121: initializing the membership matrix by using a random number between 0 and 1, and meeting the constraint condition:
Figure RE-GDA0003128884510000033
wherein u isijRepresenting membership degree, and C representing the number of the obtained clusters;
step 122: the objective function F is calculated and,
Figure RE-GDA0003128884510000034
wherein x isiRepresenting data to be clustered, mjRepresenting the clustering center, k representing the number of clusters of the cluster, N representing the number of data to be clustered, and max if the membership error is less than the error threshold epsilon after the nth iterationij{|uij (n+1)-uij (n)If | } < epsilon, the required state is reached and the iteration is stopped; otherwise, go to step 123;
step 123: updating a membership matrix by calculating the membership and meeting constraint conditions, wherein the membership calculation formula is uij
Figure RE-GDA0003128884510000041
Where k denotes the number of clusters of the cluster, mlEnumerating C cluster centers, backThe iteration is performed back to step 122.
Preferably, the step 13 specifically includes:
step 131: assuming that the background noise of both the active speech and the stop-and-pause-sound follows a Gaussian distribution
Figure RE-GDA0003128884510000046
Model, μiIs a vector of the mean value of the vectors,
Figure RE-GDA0003128884510000047
for the covariance matrix, the AIC value for the optimal cluster number C is calculated by the following formula:
Figure RE-GDA0003128884510000042
wherein N isiNumber of data for ith cluster, v is dimension of feature space, εdIs a penalty factor;
step 132: determining the high and low thresholds of the characteristic parameters according to the optimal clustering number
Figure RE-GDA0003128884510000043
Preferably, the step2 specifically includes:
step 21: DM model training phase, in the input sentence siSliding with fixed-size windows, using sentence vectors of the input sentence for each slide to a position
Figure RE-GDA0003128884510000044
Context word vector in sum window
Figure RE-GDA0003128884510000045
Predicting target word xmTo obtain a sentence vector matrix SV×NWord vector matrix XV×NAnd U, b parameters required by the Softmax function;
step 22: and in the DM model deducing stage, a trained model fixed word vector matrix and parameters U, b are utilized, a gradient descent method is adopted to obtain a new sentence vector, and the sentence vector matrix is updated.
Preferably, the step3 specifically includes:
step 31: representing the short text as SText, the paragraph text as PText, and calculating the character length D of all STextsAnd character length average
Figure RE-GDA0003128884510000051
Sequentially fetching paragraph text and calculating its length DP
Step 32: comparison DPAnd
Figure RE-GDA0003128884510000052
relative relationship therebetween if
Figure RE-GDA0003128884510000053
The segment is longer, a head-to-tail matching mode PS-First-Last is adopted, otherwise, a total matching mode PS-ALL is adopted, and alpha is a threshold value judgment coefficient;
the PS-ALL in ALL matching modes specifically comprises the following steps: and calculating the text similarity by using the vector representation of the SText and the PText, wherein the specific calculation formula is as follows:
Figure RE-GDA0003128884510000054
wherein X represents the short text after the audio recognition of the book, and the vector thereof is represented as VX=(x1,x2,…,xN) Y represents paragraph text, with its vector represented as VY=(y1,y2,…,yN);
The head-to-tail matching mode PS-First-Last specifically comprises the following steps: extracting the length of the head and tail characters of the paragraph from the paragraph text as
Figure RE-GDA0003128884510000056
The two texts are sequentially found out from the head and the tail of the segmentTwo short texts SText with highest similarityfirstAnd STextlastThereby achieving text alignment.
Preferably, the method further comprises checking whether the paragraph ending time point obtained by the PS-ALL matching method is continuous with the paragraph starting time point of the next paragraph, and if not, extending the ending time point to a second before the paragraph starting time.
Preferably, the step 13 further comprises:
and (3) representing the segmentation error rate through the segmentation error rate and guiding an algorithm to correct:
segmentation error rate ECCan be expressed as:
Figure RE-GDA0003128884510000055
wherein L isframeIndicating the number of audio frames cut to length, SfameIndicating the number of audio frames cut short, ALLframeTotal number of audio frames, W, representing the slicingLAnd WSWeights representing cut-to-length errors and cut-to-short errors, respectively;
if EC<εEIf the audio segmentation meets the requirements, otherwise, the audio needs to be corrected by adjusting a threshold or manually correcting, wherein epsilonEIs a preset threshold.
A Doc2 Vec-based audio text alignment system, comprising:
the audio segmentation and identification module: performing threshold estimation on AIC-FCM optimized based on a simulated annealing genetic algorithm, cutting the book-associated long audio into short audio with sentences as dimensionality, performing voice recognition on the short audio, and outputting short text with sentences as dimensionality;
a text paragraph extraction module: extracting paragraphs of the electronic book based on a Doc2Vec model to obtain paragraph texts with the paragraphs as dimensions;
an alignment module: and performing text similarity matching on the short text and the paragraph text by using a dynamic matching method based on a threshold prediction method to complete text alignment.
Compared with the prior art, the audio text alignment method and system based on Doc2Vec disclosed by the invention have the advantages that the audio text alignment technology is taken as a core, the audio book is matched with the matched audio, the comparison relation between the text content and the audio in time is realized, and the listening and the reading are organically combined. Compared with the traditional audio text alignment algorithm, the method is closer to an ideal segmentation result on the aspect of long audio segmentation, the alignment effect is basically equal to that of Doc2vec, and the time complexity is reduced by about 35%.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a flow chart of an audio text alignment method based on Doc2 Vec.
FIG. 2 is a flow chart of threshold estimation for AIC-FCM optimized by simulated annealing genetic algorithm.
FIG. 3 is a diagram of a DM model architecture.
Fig. 4 is a schematic diagram of the operation of the dynamic matching scheme based on the threshold prediction method.
FIG. 5 is a diagram illustrating the operating principle of PS-First-Last.
FIG. 6 is a schematic diagram of a text alignment collation scheme.
Fig. 7 is a schematic block diagram of an audio text alignment system based on Doc2 Vec.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention discloses an audio text alignment method based on Doc2Vec, and the final aim of audio and text alignment is to establish an association relation between audio and text in a time dimension, namely to find corresponding text content in an audio time interval. Aligned hierarchies of audio and text generally contain three types: paragraph alignment, sentence alignment, and word alignment. Since the electronic book itself is built with paragraph-based elements, making audio references requires alignment at the paragraph level. As shown in fig. 1, the method specifically includes:
step 1: threshold estimation is carried out on AIC-FCM optimized based on a simulated annealing genetic algorithm, the random long audio is divided into short audio with a sentence as a dimensionality, and the short audio is preprocessed and then is subjected to voice recognition to output a short text with the sentence as the dimensionality;
specifically, as shown in fig. 2, step1 includes:
step 11: carrying out global search and interactive mutation operation by using a genetic algorithm, and simultaneously combining simulation operation to obtain a clustering center;
the genetic algorithm is a self-adaptive global optimization probability search algorithm designed according to selection, crossing and mutation mechanisms in the biological heredity and evolution processes. The method has strong global search capability and can rapidly solve the overall solution in the space. However, genetic algorithms also have weak points of weak local searching capability and slow convergence. The simulated annealing algorithm can effectively get rid of local minimum values, and can reach a global minimum value point with random probability close to 1, so that the weakness of the genetic algorithm can be exactly compensated. Therefore, the simulated annealing heredity combining the two algorithms can enhance the searching capability and the searching efficiency of the clustering algorithm and simultaneously can improve the robustness of the audio threshold detection.
Specifically, the execution steps of the simulated annealing genetic algorithm are as follows:
step1. initializing algorithm parameters, setting genetic algebra i to 0 and initial temperature T of annealing algorithmi
step2. random generation of genetic algorithm populationsCi(T), which is the clustering center of the audio sample points;
step3. calculation of genetic Algorithm Ci(T) all fitness F (C)i(T));
Step4. Using crossover and mutation operations to make genetic Algorithm population Ci(T) evolution to Ci' (T) by which a possibly better cluster center is obtained;
step5. recalculating the individual fitness F (C) of the new populationi′(T));
step6. as Δ F ═ F (C)i′(T))-F(Ci(T)) calculating the annealing increment, if delta F is more than 0, representing that the new population fitness is enhanced, Ci' (T) is the next generation population, and if Δ F.ltoreq.0, the probability is determined
Figure RE-GDA0003128884510000081
Receiving Ci' (T) is the next generation population; if no new population that can be accepted is eventually available, step4 is returned.
step7. set the new population as the next generation population, i.e. Ci+1(T)=Ci' (T) and then cooling
Figure RE-GDA0003128884510000082
Wherein alpha represents an annealing factor, and finally increasing genetic algebra, i is i + 1;
step8, judging whether the termination condition is met, and if the obtained clustering center is met, outputting the clustering center required by the FCM; otherwise, step4 is switched, and the evolution process is continued.
Step 12: the fuzzy C-means clustering algorithm carries out fuzzy clustering when the clustering number of the characteristic parameters is 1 and 2 respectively based on the clustering center;
specifically, a fuzzy C-means clustering algorithm (FCM) fuses the essence of a fuzzy theory and is mainly used for clustering analysis of data. The membership degree of each sample data is calculated by iteratively optimizing the objective function, and then the classification of the data is realized. If X is ═ { X ═ Xi1, 2., N represents a data set, M ═ M ·j1, 2.. C } stands for numberThe data set X is divided into C clustered center sets, and the objective function F can be expressed as:
Figure RE-GDA0003128884510000091
where k is the number of clusters of the cluster, uijRepresenting data xiAnd a certain class mjThe degree of similarity, i.e. the degree of membership, is calculated by the formula:
Figure RE-GDA0003128884510000092
membership also has a constraint that the sum is equal to 1, i.e.:
Figure RE-GDA0003128884510000093
||xi-mj| represents data xiAnd a clustering center mjThe distance of (c).
The objective of the FCM algorithm is to obtain the membership u when the objective function F is minimum through continuous iterative calculationijThe iteration process is as follows:
step 121: initializing the membership matrix U by using a random number with a value between 0 and 1, which is required to satisfy a constraint condition
Figure RE-GDA0003128884510000101
Step 122: the objective function F is calculated and,
Figure RE-GDA0003128884510000102
wherein x isiRepresenting data to be clustered, mjRepresenting the clustering center, k representing the cluster number of the cluster, N representing the number of the data to be clustered, and max if the membership error is less than the error threshold epsilon after the nth iterationij{|uij (n+1)-uij (n)If | } < ε, it can be considered that better has been achievedStatus and stop iteration, otherwise step 123 is performed.
Step 123: by degree of membership
Figure RE-GDA0003128884510000103
Calculating a new membership matrix, mlAn enumeration of C cluster centers is represented and then returns to step 122 to continue the iteration.
In short, the central idea of the FCM algorithm is to configure a membership belonging to a cluster for each sample data, and classify the data by the membership.
Step 13: and judging the optimal clustering number C according to the Chi information criterion, and determining the threshold value of double-threshold end point detection according to the optimal clustering number C to finish the audio segmentation along with the book length.
Specifically, the Akabane Information Criterion (AIC) is a minimum information content criterion, which is a criterion for measuring the goodness of fit of a statistical model. The AIC criterion is mainly used to solve the model selection problem, finding a balance between the complexity of the model and the number of parameters. In general, it is a weighted function of the fitting accuracy and the unknown number of parameters, defined as follows:
Figure RE-GDA0003128884510000104
wherein X ═ { X ═ Xi1, 2, N is the set of data features, P { P ═ Pi1, 2.. C } are model parameters, ln S (X, P) is a likelihood function of the data feature set X and the model parameters P, n is a likelihood function of the model parameters P, andPis the number of parameters, ε, of PdIs a penalty factor.
The model selected by the Chi information criterion is the best model when the AIC is minimum. Assuming that the background noise, e.g. active speech and stop-and-pause-sounds, follows a Gaussian distribution
Figure RE-GDA0003128884510000114
Model, μiIs a vector of the mean value of the vectors,
Figure RE-GDA0003128884510000115
for covariance matrix, then the AIC value for C cluster number can be calculated as follows:
Figure RE-GDA0003128884510000111
wherein N isiV is the dimension of the feature space, the number of data for the ith cluster.
In an application scene of audio end point detection, setting the initial clustering number C to be 2, and determining the high-low threshold of the characteristic parameter according to the optimal clustering number by the following formula:
Figure RE-GDA0003128884510000112
the energy threshold and the zero-crossing rate threshold are obtained through the above formula, and then the short-time energy and the average zero-crossing rate of the voice signal are detected by using the energy threshold and the zero-crossing rate threshold, it needs to be noted that the numerical value of the voice signal can be obtained only within the threshold, and is discarded when the numerical value exceeds the threshold.
The double-threshold end point detection method is mainly used for detecting the starting point and the end point of a section of voice, and the two thresholds of the method refer to an energy threshold and a zero crossing rate threshold. The short-term energy of a stop-and-pause sound in audio is generally much lower than that of speech, so most of the stop-and-pause sound can be accurately cut off by an energy threshold. Short-time energy E of audio frequency x (n)nCan be formulated as:
Figure RE-GDA0003128884510000113
where w (n-m) represents a window function, sn(m) ═ x (n) × w (n-m) represents a certain frame signal of the audio x (n). However, the energy of some unvoiced consonants and the energy of the stop-and-pause sound are very close to each other in the voice, and if the energy threshold is used alone, the unvoiced consonants can be cut off. The short-time average zero-crossing rate is characterized by the passing of zero value of signal level per secondThe number of times of the audio x (n), the short-time average zero-crossing rate Z of the audio x (n)nCan be formulated as:
Figure RE-GDA0003128884510000121
where signal is a sign function, i.e.:
Figure RE-GDA0003128884510000122
further, there may be a cut-to-length or cut-to-short error in the long audio slicing process using the dual-threshold end-point slicing technique. The cut-to-length error refers to an error in cutting the pause tone into one frame of short phrase tones, which may be caused by excessive noise energy in the pause tone. A truncation error refers to an error that divides a continuous phrase speech into two frames of speech, and this error is generally caused by a pause of a certain part of the continuous speech. In order to calculate the accuracy of audio segmentation and correct the accuracy, the invention introduces segmentation error rate to characterize the segmentation error rate and guide the algorithm to correct the error rate. If with LframeNumber of audio frames representing cut-to-length, in SframeIndicating the number of audio frames cut short, in ALLframeRepresenting the total number of audio frames sliced, then the slicing error rate ECCan be expressed as:
Figure RE-GDA0003128884510000123
weights W are defined in the formula for the cut-to-length errors and the cut-to-short errors, respectivelyLAnd WSAnd in general WS>WL. This is because the severity of the error caused by the cut-off of the short audio in the practical application scenario is much greater than the pause sound contained in the short audio, and the influence of this factor on the slicing error rate can be expressed by introducing a weight into the formula.
The segmentation error rate has a threshold value epsilonEIf E isC<εEThe short audio segmentation may be considered to have met the requirements, otherwise the short audio needs to be corrected by adjusting the threshold or by manual correction. The threshold value epsilon may be different due to the difference in characteristics of the length, energy, etc. of the stop-consonants of different types of audioEIt is generally determined experimentally by selecting the same type of audio.
The short audio after error correction of the error can be fed directly to the speech recognition system for recognition.
Step 14: preprocessing short audio, and then performing voice recognition to output short texts taking sentences as dimensions;
specifically, for text data in epub, markdown and other formats, an open-source analysis tool is adopted to extract segment content in the text. The extracted text content may have interfering elements such as punctuation, special characters, etc. in format. For such interfering elements, regular expression methods may be used to process these elements. And obtaining the text information after interference removal through some set regular expressions according to the extracted segment content.
For speech recognition systems. The related art is very mature and has excellent identification effect. Such as open source speech recognition tools like CMU Sphinx, Kaldi, HTK, ASRT, etc., and commercialized platforms like flyaway speech recognition, Baidu AI platform, etc. The voice recognition of the invention utilizes an open-source voice recognition tool to recognize the preprocessed short audio, and the output short text contains the time interval of the short audio in the original long audio.
Step 2: extracting paragraphs of the electronic book based on a Doc2Vec model to obtain paragraph texts with the paragraphs as dimensions;
in particular, the Doc2vec model is developed from a word2vec model, and the Doc2vec model expands the capability of calculating vector representation of long texts (sentences, paragraphs and the like) on the basis of predicting word vectors. This model can obtain a fixed length sentence vector and word vector, where the sentence vector stores context information missing from the topic or word vector of the current paragraph. There are also two training modes for the Doc2vec model: distributed Memory (DM) and Distributed Bag of Words (DBOW), corresponding to CBOW and Skip-gram in the word2vec model, respectively. From the results verified in the experiments of Tomas Mikolov, it can be seen that the paragraph vectors obtained by DM in most classification tasks perform better than DBOW, so the invention uses the DM model to perform the sentence vector calculation, as shown in fig. 3.
The idea of the DM model is to predict a word with the highest probability of occurrence in the current context by inputting a sentence vector and several word vectors in the sentence.
The train thought of the model is as follows: in the input sentence siSliding with fixed-size window, using sentence vector of input sentence every time sliding to a position
Figure RE-GDA0003128884510000141
Context word vector in sum window
Figure RE-GDA0003128884510000142
Predicting target word xm. Consistent with the CBOW training method of the wotd2sec model, the final purpose of the DM model training is also to obtain a sentence vector matrix SV×NWord vector matrix XV×NAnd U, b, etc. parameters required by the Softmax function. In this process, each prediction of the DM model will use the sentence SiThe semantic information of (1).
In the inference stage of the model, for a new sentence, the trained DM model, the word vector matrix X and the parameters U, b are fixed, and a gradient descent method is used to obtain a new sentence vector and update the sentence vector matrix S at the same time.
The DM model requires that after a given context, the possibility of obtaining a predicted value is maximized by updating parameters, i.e., maximizing the average log-likelihood function. The mean log-likelihood function is defined as:
Figure RE-GDA0003128884510000144
where C represents the total number of words, k represents the window width used for training, SiRepresenting the sentence vector in which the selected context word is located. The subsequent prediction task can be completed by using a multi-classifier such as a Softmax function, and the like, which is a conditional probability function p (x)k|si,xm-k,...,xm+k) Is defined as:
Figure RE-GDA0003128884510000143
wherein y isjThe expression xjNormalizing the output value before normalization. If the expression of the clause vector matrix S is hV×NWord vector matrix XV×NAnd averaging the extracted row vectors or connecting the extracted row vectors to obtain a vector, wherein the calculation expression of y is as follows:
y=b+Uh(Si,xm-k,…,xm+k;S,X)。
and step 3: and performing text similarity matching on the short text and the paragraph text by using a dynamic matching method based on a threshold prediction method to complete text alignment.
Specifically, short texts with sentences as dimensions are obtained after the audio of the book is cut and identified and are expressed by SText, and texts stored in the electronic book are all organized with paragraphs as dimensions and are expressed by PText. Because only alignment of paragraph dimensions needs to be achieved between the audio and the electronic book in the actual application scene, if similarity of all texts is calculated, the PText needs to be divided into sentences, and similarity calculation is carried out with the SText one by one. The method can cause a great deal of computing resource waste, so the invention designs a method for dynamically determining whether to adopt total matching (PS-ALL) or head-to-tail matching (PS-First-Last) according to the ratio of the PText length to the mean value of the SText length. The working principle is shown in figure 4:
the scheme first calculates the character length D of all STextSAnd their average values DSThen, the paragraph texts in the electronic book are sequentially fetched and the length D thereof is calculatedP。DPAnd
Figure RE-GDA0003128884510000151
relative to each otherThe relationship determines whether the classifier will select PS-ALL or PS-First-Last. If D isP>αDSThe paragraph is considered longer and PS-First-Last is used, otherwise PS-ALL is used. Wherein α is a threshold decision coefficient, which can be dynamically adjusted to improve the decision accuracy when processing different types of electronic books.
The principle of PS-ALL is simple, and it directly uses the vector representation of the strext and PText to calculate text similarity, thereby completing text alignment. Specifically, a cosine text similarity calculation algorithm is adopted, the cosine text similarity calculation algorithm is a method for converting the calculation of similarity between texts into the calculation of cosine values of included angles between vectors, and the smaller the included angle of the vectors is, the higher the text similarity is proved to be. The text to be matched can obtain corresponding sentence vectors through a trained Doc2vec model, and the similarity degree of the text to be matched can be obtained through a cosine text similarity calculation formula. If X represents the text after the audio recognition, the vector is represented as VX=(x1,x2,…,xN) Y represents the text in the electronic book, and the vector thereof is represented as VY=(y1,y2,...,yN) Then the similarity between X and Y can be expressed as:
Figure RE-GDA0003128884510000161
the value threshold of phi (X, Y) is [0, 1], and the smaller the value is, the smaller the vector included angle is, the higher the similarity of the two texts is.
The principle of PS-First-Last is relatively complicated, it First takes out the beginning and end of the segment with the character length of
Figure RE-GDA0003128884510000162
Two texts are sequentially found, and then two short texts SText with the highest similarity with the first and the last texts of the section are sequentially foundfirstAnd STextlast. The two short texts actually represent a time interval corresponding to book audio, so that the alignment of the e-book paragraphs and the audio paragraphs can be realized. The working principle is shown in figure 5:
Furthermore, a correction link is added in the scheme finally because the condition that the matching audio interval is too short due to misuse of PS-ALL possibly occurs. This link checks whether ALL the paragraph ending time points obtained by using PS-ALL are continued to the succeeding paragraph starting time points, and if not, extends the ending time point to one second before the next paragraph starting time to ensure that the audio time interval is completely divided, and the text alignment proofreading scheme is as shown in fig. 6.
The embodiment discloses an audio text alignment system based on Doc2Vec, as shown in fig. 7, including:
the audio segmentation and identification module: performing threshold estimation on AIC-FCM optimized based on a simulated annealing genetic algorithm, cutting the book-associated long audio into short audio with sentences as dimensionality, performing voice recognition on the short audio, and outputting short text with sentences as dimensionality;
a text paragraph extraction module: extracting paragraphs of the electronic book based on a Doc2Vec model to obtain paragraph texts with the paragraphs as dimensions;
an alignment module: and performing text similarity matching on the short text and the paragraph text by using a dynamic matching method based on a threshold prediction method to complete text alignment, and finally outputting the text with the audio time stamp.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Accordingly, the invention is not to be limited to the embodiments shown herein.

Claims (9)

1. A Doc2 Vec-based audio text alignment method is characterized by comprising the following steps:
step 1: performing threshold estimation on AIC-FCM optimized based on a simulated annealing genetic algorithm, cutting the book-associated long audio into short audio with sentences as dimensionality, performing voice recognition on the short audio, and outputting short text with sentences as dimensionality;
step 2: extracting paragraphs of the electronic book based on a Doc2Vec model to obtain paragraph texts with the paragraphs as dimensions;
and step 3: performing text similarity matching on the short text and the paragraph text by a dynamic matching method based on a threshold prediction method to complete text alignment;
the step3 specifically includes:
step 31: representing the short text as SText, the paragraph text as PText, and calculating the character length D of all STextSAnd character length average
Figure FDA0003347017790000011
Sequentially fetching paragraph text and calculating its length DP
Step 32: comparison DPAnd
Figure FDA0003347017790000012
relative relationship therebetween if
Figure FDA0003347017790000013
The segment is longer, a head-to-tail matching mode PS-First-Last is adopted, otherwise, a total matching mode PS-ALL is adopted, and alpha is a threshold value judgment coefficient;
the PS-ALL in ALL matching modes specifically comprises the following steps: and calculating the text similarity by using the vector representation of the SText and the PText, wherein the specific calculation formula is as follows:
Figure FDA0003347017790000014
wherein X represents a short text after the audio recognition of the book, and the N-dimensional vector thereof is represented as VX=(x1,x2,...,xN) Wherein x isiRepresenting elements in a vector, Y representing paragraph text, with an N-dimensional vector denoted VY=(y1,y2,...,yN) Wherein y isiRepresenting elements in a vector;
the head-to-tail matching mode PS-First-Last specifically comprises the following steps: extracting the length of the head and tail characters of the paragraph from the paragraph text as
Figure FDA0003347017790000021
The two short texts SText with the highest similarity with the head and the tail of the segment are found in turnfirstAnd STextlastThereby achieving text alignment.
2. The Doc2 Vec-based audio text alignment method according to claim 1, wherein the step1 specifically includes:
step 11: carrying out global search and interactive mutation operation by using a simulated annealing genetic algorithm, and simultaneously combining the simulated operation to obtain a clustering center;
step 12: the fuzzy C-means clustering algorithm carries out fuzzy clustering when the clustering numbers of the characteristic parameters are respectively 1 and 2 based on the clustering center;
step 13: judging the optimal clustering number C through the Chichi information criterion, and determining the threshold value of double-threshold end point detection according to the optimal clustering number C to finish the audio segmentation along with the book length;
step 14: and preprocessing the short audio, then performing voice recognition, and outputting short texts taking sentences as dimensions.
3. The Doc2 Vec-based audio text alignment method according to claim 2, wherein the step 11 specifically includes:
step 111: inputting a book-associated long audio frequency, initializing algorithm parameters, setting a genetic algebra i to be 0 and setting an initial temperature of an annealing algorithmDegree Ti
Step 112: randomly generating genetic algorithm population Ci(T) representing a cluster center of the audio sample points;
step 113: calculating the genetic algorithm population Ci(T) all fitness F (C)i(T));
Step 114: using crossover and mutation operations to make the genetic algorithm population Ci(T) evolution to obtain a new population Ci′(T);
Step 115: recalculating New population Ci' (T) fitness F (C)i′(T));
Step 116: calculating the annealing increment Δ F ═ F (C)i′(T))-F(Ci(T)), if Δ F > 0, then the new population fitness is enhanced, Ci' (T) is the next generation population; if Δ F ≦ 0, then with probability
Figure FDA0003347017790000031
Receiving Ci' (T) is the next generation population, if the preset probability of being accepted is not reached finally, the step 114 is returned;
step 117: setting the new population as the next generation population, i.e. Ci+1(T)=Ci' (T) and lowering the temperature
Figure FDA0003347017790000032
Wherein α represents an annealing factor;
step 118: increasing the genetic algebra i to i +1, judging whether the obtained clustering center reaches the global minimum value, and if so, outputting the clustering center of the optimized audio sampling point; otherwise, the process returns to step 114 to continue the evolution process.
4. The Doc2 Vec-based audio text alignment method according to claim 3, wherein the step 12 specifically includes:
step 121: initializing the membership matrix by using a random number between 0 and 1, and meeting the constraint condition:
Figure FDA0003347017790000033
wherein u isijRepresenting membership degree, and C representing the number of the obtained clusters;
step 122: the objective function F is calculated and,
Figure FDA0003347017790000034
wherein x isiRepresenting data to be clustered, mjRepresenting the clustering center, k representing the number of clusters of the cluster, N representing the number of data to be clustered, and max if the membership error is less than the error threshold epsilon after the nth iterationij{|uij (n+1)-uij (n)If | } < epsilon, the required state is reached and the iteration is stopped; otherwise, go to step 123;
step 123: updating a membership matrix by calculating the membership and meeting constraint conditions, wherein the membership calculation formula is uij
Figure FDA0003347017790000035
mlThe enumeration of C cluster centers is represented and the iteration is performed by returning to step 122.
5. The Doc2 Vec-based audio text alignment method according to claim 4, wherein the step 13 specifically includes:
step 131: assuming that the background noise of both the active speech and the stop-and-pause-sound follows a Gaussian distribution
Figure FDA0003347017790000041
Model, μiIs a vector of the mean value of the vectors,
Figure FDA0003347017790000042
for the covariance matrix, the AIC value for the optimal cluster number C is calculated by the following formula:
Figure FDA0003347017790000043
wherein N isiNumber of data for ith cluster, v is dimension of feature space, εdIs a penalty factor;
step 132: determining the high and low thresholds of the characteristic parameters according to the optimal clustering number
Figure FDA0003347017790000044
6. The Doc2 Vec-based audio text alignment method according to claim 1, wherein the step2 specifically includes:
step 21: DM model training phase, in the input sentence siSliding with fixed-size windows, using sentence vectors of the input sentence for each slide to a position
Figure FDA0003347017790000045
Context word vector in sum window
Figure FDA0003347017790000046
Predicting target word xmTo obtain a sentence vector matrix SV×NWord vector matrix XV×NAnd U, b parameters required by the Softmax function;
step 22: and in the DM model deducing stage, a trained model fixed word vector matrix and parameters U, b are utilized, a gradient descent method is adopted to obtain a new sentence vector, and the sentence vector matrix is updated.
7. The Doc2 Vec-based audio text alignment method of claim 1, further comprising checking whether the paragraph ending time point obtained by PS-ALL matching is continuous with the next paragraph starting time point, and if not, extending the ending time point to a second before the next paragraph starting time point.
8. The Doc2 Vec-based audio text alignment method of claim 2, wherein the step 13 further comprises:
and (3) representing the segmentation error rate through the segmentation error rate and guiding an algorithm to correct:
segmentation error rate ECCan be expressed as:
Figure FDA0003347017790000051
wherein L isframeIndicating the number of audio frames cut to length, SframeIndicating the number of audio frames cut short, ALLframeTotal number of audio frames, W, representing the slicingLAnd WSWeights representing cut-to-length errors and cut-to-short errors, respectively;
if EC<εEIf the audio segmentation meets the requirements, otherwise, the audio needs to be corrected by adjusting a threshold or manually correcting, wherein epsilonEIs a preset threshold.
9. A Doc2 Vec-based audio text alignment system, comprising:
the audio segmentation and identification module: performing threshold estimation on AIC-FCM optimized based on a simulated annealing genetic algorithm, cutting the book-associated long audio into short audio with sentences as dimensionality, performing voice recognition on the short audio, and outputting short text with sentences as dimensionality;
a text paragraph extraction module: extracting paragraphs of the electronic book based on a Doc2Vec model to obtain paragraph texts with the paragraphs as dimensions;
an alignment module: the dynamic matching method based on the threshold prediction method is used for matching the text similarity of the short text and the paragraph text to complete text alignment, and the specific process is as follows:
representing the short text as SText, the paragraph text as PText, and calculating the character length D of all STextSAnd character length average
Figure FDA0003347017790000052
Sequential take-out sectionText and calculate its length DP
Comparison DPAnd
Figure FDA0003347017790000053
relative relationship therebetween if
Figure FDA0003347017790000054
The segment is longer, a head-to-tail matching mode PS-First-Last is adopted, otherwise, a total matching mode PS-ALL is adopted, and alpha is a threshold value judgment coefficient;
the PS-ALL in ALL matching modes specifically comprises the following steps: and calculating the text similarity by using the vector representation of the SText and the PText, wherein the specific calculation formula is as follows:
Figure FDA0003347017790000061
wherein X represents a short text after the audio recognition of the book, and the N-dimensional vector thereof is represented as VX=(x1,x2,...,xN) Wherein x isiRepresenting elements in a vector, Y representing paragraph text, with an N-dimensional vector denoted VY=(y1,y2,...,yN) Wherein y isiRepresenting elements in a vector;
the head-to-tail matching mode PS-First-Last specifically comprises the following steps: extracting the length of the head and tail characters of the paragraph from the paragraph text as
Figure FDA0003347017790000062
The two short texts SText with the highest similarity with the head and the tail of the segment are found in turnfirstAnd STextlastThereby achieving text alignment.
CN202110438831.7A 2021-04-21 2021-04-21 Audio text alignment method and system based on Doc2Vec Active CN113191133B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110438831.7A CN113191133B (en) 2021-04-21 2021-04-21 Audio text alignment method and system based on Doc2Vec

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110438831.7A CN113191133B (en) 2021-04-21 2021-04-21 Audio text alignment method and system based on Doc2Vec

Publications (2)

Publication Number Publication Date
CN113191133A CN113191133A (en) 2021-07-30
CN113191133B true CN113191133B (en) 2021-12-21

Family

ID=76978588

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110438831.7A Active CN113191133B (en) 2021-04-21 2021-04-21 Audio text alignment method and system based on Doc2Vec

Country Status (1)

Country Link
CN (1) CN113191133B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114222193B (en) * 2021-12-03 2024-01-05 北京影谱科技股份有限公司 Video subtitle time alignment model training method and system
CN114630238B (en) * 2022-03-15 2024-05-17 广州宏牌音响有限公司 Stage sound box volume control method and device, electronic equipment and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110110136A (en) * 2019-02-27 2019-08-09 咪咕数字传媒有限公司 A kind of text sound matching process, electronic equipment and storage medium
CN111398832A (en) * 2020-03-19 2020-07-10 哈尔滨工程大学 Bus battery SOC prediction method based on ANFIS model
CN111459446A (en) * 2020-03-27 2020-07-28 掌阅科技股份有限公司 Resource processing method of electronic book, computing equipment and computer storage medium
CN112259083A (en) * 2020-10-16 2021-01-22 北京猿力未来科技有限公司 Audio processing method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101710333B (en) * 2009-11-26 2012-07-04 西北工业大学 Network text segmenting method based on genetic algorithm

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110110136A (en) * 2019-02-27 2019-08-09 咪咕数字传媒有限公司 A kind of text sound matching process, electronic equipment and storage medium
CN111398832A (en) * 2020-03-19 2020-07-10 哈尔滨工程大学 Bus battery SOC prediction method based on ANFIS model
CN111459446A (en) * 2020-03-27 2020-07-28 掌阅科技股份有限公司 Resource processing method of electronic book, computing equipment and computer storage medium
CN112259083A (en) * 2020-10-16 2021-01-22 北京猿力未来科技有限公司 Audio processing method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《【模型篇】段落向量化——doc2vec模型》;水笔小新;《https://zhuanlan.zhihu.com/p/138909653》;20200508;全文 *
《一种改进的模糊C一均值聚类算法》;徐艺萍等;《徐州工程学院学报》;20080430;第23卷(第4期);34-36页 *

Also Published As

Publication number Publication date
CN113191133A (en) 2021-07-30

Similar Documents

Publication Publication Date Title
US8990084B2 (en) Method of active learning for automatic speech recognition
JP4571822B2 (en) Language model discrimination training for text and speech classification
CN108052499B (en) Text error correction method and device based on artificial intelligence and computer readable medium
EP0763816A2 (en) Discriminative utterance verification for connected digits recognition
CN111640418B (en) Prosodic phrase identification method and device and electronic equipment
US20070067171A1 (en) Updating hidden conditional random field model parameters after processing individual training samples
US20090055182A1 (en) Discriminative Training of Hidden Markov Models for Continuous Speech Recognition
CN113191133B (en) Audio text alignment method and system based on Doc2Vec
JP5752060B2 (en) Information processing apparatus, large vocabulary continuous speech recognition method and program
CN115617955B (en) Hierarchical prediction model training method, punctuation symbol recovery method and device
CN111984780A (en) Multi-intention recognition model training method, multi-intention recognition method and related device
CN111986650B (en) Method and system for assisting voice evaluation by means of language identification
JPH0250198A (en) Voice recognizing system
CN116189671B (en) Data mining method and system for language teaching
CN116050419B (en) Unsupervised identification method and system oriented to scientific literature knowledge entity
JP5288378B2 (en) Acoustic model speaker adaptation apparatus and computer program therefor
JP3920749B2 (en) Acoustic model creation method for speech recognition, apparatus thereof, program thereof and recording medium thereof, speech recognition apparatus using acoustic model
JP2938866B1 (en) Statistical language model generation device and speech recognition device
CN114927144A (en) Voice emotion recognition method based on attention mechanism and multi-task learning
CN114898776A (en) Voice emotion recognition method of multi-scale feature combined multi-task CNN decision tree
Chen et al. Research on Chinese audio and text alignment algorithm based on AIC-FCM and Doc2Vec
JPH10254477A (en) Phonemic boundary detector and speech recognition device
Xu et al. A Novel Information Integration Algorithm for Speech Recognition System: Basing on Adaptive Clustering and Supervised State of Acoustic Feature
Chen et al. A Novel Information Integration Algorithm for Speech Recognition System: Basing on Adaptive Clustering and Supervised State of Acoustic Feature
CN116825091A (en) Fake identification analysis system with text content combing advantage

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant