CN105551485B

CN105551485B - Voice file retrieval method and system

Info

Publication number: CN105551485B
Application number: CN201510882391.9A
Authority: CN
Inventors: 王建社; 柳林; 冯翔; 胡国平
Original assignee: Iflytek Information Technology Co Ltd
Current assignee: Iflytek Information Technology Co Ltd
Priority date: 2015-11-30
Filing date: 2015-11-30
Publication date: 2020-04-21
Anticipated expiration: 2035-11-30
Also published as: CN105551485A

Abstract

The invention discloses a method and a system for searching a voice file, wherein the method comprises the following steps: training a user interest model corresponding to the retrieval key words; acquiring each voice file to be retrieved; performing voice transcription on the voice file to obtain a transcription result; obtaining a text file corresponding to the voice file and multi-knowledge-source characteristics of each word in the text file according to the transcription result; performing confidence reevaluation on each word by using the multi-knowledge-source characteristics, and filtering out meaningless words and sentences in the text file; calculating the correlation degree of each text file and the user interest model according to the confidence coefficient reestimation result; and displaying the retrieved voice file according to the relevance. By using the invention, the efficiency and the accuracy of voice file retrieval can be improved.

Description

Voice file retrieval method and system

Technical Field

The invention relates to the field of voice signal processing, in particular to a voice file retrieval method and a voice file retrieval system.

Background

With the continuous development of speech processing technology, in more and more applications, related technical personnel have tried to acquire required information from speech data, such as retrieving a speech file required by a specific application scenario from massive speech data. The traditional method for retrieving useful files from a large number of voice files mainly comprises two methods:

the first method is to perform manual voice file monitoring to find useful files with high correlation, and this method needs to consume a lot of manpower and material resources and is inefficient.

The second method is to transcribe the voice file to obtain a text file and then search the text file. Because the accuracy of the voice transcription cannot be well guaranteed due to the influence of factors such as a complex noise environment, a far field and the like, when the voice file is searched, in order to guarantee the accuracy of the search, the result of the voice transcription generally needs to be manually checked, and therefore the problems of high labor consumption and low efficiency exist.

Disclosure of Invention

The invention provides a voice file retrieval method and a voice file retrieval system, which aim to solve the problems of low efficiency and poor accuracy caused by voice transcription errors in the conventional voice file retrieval.

Therefore, the invention provides the following technical scheme:

a method of voice file retrieval comprising:

training a user interest model corresponding to the retrieval key words;

acquiring each voice file to be retrieved;

performing voice transcription on the voice file to obtain a transcription result;

obtaining a text file corresponding to the voice file and multi-knowledge-source characteristics of each word in the text file according to the transcription result;

performing confidence reevaluation on each word by using the multi-knowledge-source characteristics, and filtering out meaningless words and sentences in the text file;

calculating the correlation degree of each text file and the user interest model according to the confidence coefficient reestimation result;

and displaying the retrieved voice file information according to the relevance.

Preferably, the search keyword is one or more search keywords input by a user during searching, or one or more keywords collected from some specific scene corpus in advance.

Preferably, the training of the user interest model corresponding to the search keyword includes:

collecting the corpus containing the search keywords;

calculating word vectors of all words in the corpus;

and training a regression model by using the word vector, and taking the regression model as a user interest model.

Preferably, the transcription result is in a word-level confusion network format, and the time position, the acoustic model score, the language model score and the original confidence of each word in the voice file are stored in the confusion network;

the multi-knowledge-source features include at least two of the following features: a word posterior probability; poor posterior probability of competing words; scoring a language model; the frame average acoustic model score.

Preferably, the method further comprises:

segmenting each word in the confusion network to obtain phoneme information corresponding to the word;

the multi-knowledge-source features further include any one or more of: the posterior probability of the phoneme and the variance of the state frame corresponding to each word; a word position coefficient; word length; whether it is a stop word; a duration; the number of competing words; short time average energy.

Preferably, the performing confidence evaluation on each word in the text file includes:

generating a group of multi-dimensional feature vectors for each word according to the multi-knowledge-source features;

and calculating the confidence coefficient of each word by using the pre-trained regression model and the multi-dimensional feature vector of each word.

Preferably, the calculating the relevance of each text file to the user interest model according to the confidence reevaluation result includes:

for each text file, calculating a word vector of each word in the text file;

taking the confidence coefficient reestimation result of each word as the weight of the word, and carrying out weighted average on word vectors of all words appearing in the text file to obtain the vector of the text file:

and calculating the correlation degree of the text file and the user interest model according to the vector of the text file.

Preferably, the displaying the retrieved voice file information according to the relevancy includes:

sequentially displaying the voice file information with the correlation degree larger than a set threshold value according to the correlation degree from large to small; or

And displaying the voice file information with set number in turn from large to small according to the degree of correlation.

Preferably, the method further comprises:

setting correlation threshold values aiming at different importance levels;

determining the importance level of each voice file according to the relevance of each text file and the user interest model and the relevance threshold;

and when the voice file information is displayed, displaying the importance level information of the voice file.

A voice document retrieval system comprising:

the model training module is used for training a user interest model corresponding to the retrieval key words;

the voice file acquisition module is used for acquiring each voice file to be retrieved;

the voice transcription module is used for carrying out voice transcription on the voice file to obtain a transcription result;

the text file generating module is used for obtaining a text file corresponding to the voice file according to the transcription result;

the characteristic acquisition module is used for acquiring multi-knowledge-source characteristics of each word in the text file;

the confidence coefficient reestimation module is used for reestimating the confidence coefficient of each word by utilizing the multi-knowledge-source characteristics;

the filtering module is used for filtering meaningless words and sentences in the text file;

the relevancy calculation module is used for calculating the relevancy between each text file and the user interest model according to the confidence reestimation result;

and the display module is used for displaying the retrieved voice file information according to the relevancy.

Preferably, the model training module comprises:

the corpus collection unit is used for collecting the corpus containing the retrieval keywords;

the word vector calculation unit is used for calculating word vectors of all words in the corpus;

and the training unit is used for training a regression model by using the word vector, and taking the regression model as a user interest model.

Preferably, the transcription result is in a word-level confusion network format, and the time position, the acoustic model score, the language model score and the original confidence of each word in the voice file are stored in the confusion network; the multi-knowledge-source features include at least two of the following features: a word posterior probability; poor posterior probability of competing words; scoring a language model; a frame-averaged acoustic model score;

the confidence reevaluation module includes:

the multi-dimensional feature vector generating unit is used for generating a group of multi-dimensional feature vectors for each word according to the multi-knowledge source features;

and the confidence coefficient calculation unit is used for calculating the confidence coefficient of each word by utilizing the pre-trained regression model and the multi-dimensional feature vector of each word.

Preferably, the correlation calculation module includes:

the word vector calculation unit is used for calculating the word vector of each word in each text file;

the document vector calculation unit is used for taking the confidence coefficient reestimation result of each word as the weight of the word, and carrying out weighted average on word vectors of all words appearing in the text document to obtain the vector of the text document:

and the relevancy calculation unit is used for calculating the relevancy of the text file and the user interest model according to the vector of the text file.

Preferably, the presentation module is specifically configured to present, in order from high to low, the voice files whose correlation is greater than the set threshold, or present, in order from high to low, the voice files in the set number.

Preferably, the system further comprises:

the setting module is used for setting correlation threshold values aiming at different importance levels;

the level determining module is used for determining the importance level of each voice file according to the relevance of each text file and the user interest model and the relevance threshold;

the display module is further used for displaying the importance level information of the voice file when displaying the voice file information. Aiming at the phenomenon that a certain number of transcription errors exist in a text file obtained by voice transcription, the method and the system provided by the embodiment of the invention extract multiple knowledge source characteristics of each word in the text file obtained by voice transcription, utilize the multiple knowledge source characteristics to carry out confidence coefficient reestimation on each word, filter nonsense words and sentences in the text file, and calculate the correlation degree between each text file and the user interest model according to a confidence coefficient reestimation result; and displaying the retrieved voice files according to the relevancy, thereby effectively reducing the influence of transcription errors on file sequencing. The voice file retrieval method and the voice file retrieval system not only greatly improve the efficiency of voice file retrieval, but also ensure the accuracy of retrieval results.

Drawings

In order to more clearly illustrate the embodiments of the present application or technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings can be obtained by those skilled in the art according to the drawings.

FIG. 1 is a flow chart of a method for voice file retrieval in accordance with an embodiment of the present invention;

FIG. 2 is a schematic diagram of an architecture of a voice document retrieval system according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a structure of a correlation calculation module according to an embodiment of the present invention;

fig. 4 is another structural diagram of the voice file retrieval system according to the embodiment of the present invention.

Detailed Description

In order to make the technical field of the invention better understand the scheme of the embodiment of the invention, the embodiment of the invention is further described in detail with reference to the drawings and the implementation mode.

As shown in fig. 1, it is a flowchart of a voice file retrieval method according to an embodiment of the present invention, including the following steps:

step 101, training a user interest model corresponding to the search keyword.

It should be noted that the search keyword may be one or more search keywords input by the user during search, or one or more search keywords collected from some specific scenario corpora in advance, which is not limited in the embodiment of the present invention.

The user interest model may adopt a regression model, such as an SVM (support vector machine) model or an RNN (Recurrent Neural Network) model, and when training the regression model, Word vector representations of search keywords may be calculated by using the existing Word Embedding technology, and the regression model is dynamically trained in combination with Word vectors in the text to be searched that are not related to the search words, as a final user interest model. Specifically, it may search corpora containing the search keyword as a normal sample in a large corpus prepared in advance and a text of the speech to be searched, randomly extract some corpora irrelevant to the search keyword as a reverse sample, convert the sample corpora into Word vectors by using a Word Embedding method, and train the regression model by using the normal Word vectors and the reverse Word vectors.

And 102, acquiring each voice file to be retrieved.

And 103, performing voice transcription on the voice file to obtain a transcription result.

Specifically, a large-scale voice transcription technology can be adopted to transcribe the voice file to obtain a transcription result.

In the embodiment of the invention, the transcription result adopts a word-level confusion network format, and not only comprises the optimal candidate word, but also comprises a plurality of competitive candidate words. The confusion network stores information such as time position, acoustic model score, language model score and original confidence of each word in the voice file, so as to facilitate the subsequent acquisition of multi-knowledge-source characteristics of each word. Wherein, the original confidence coefficient can be calculated according to the posterior probability of each word.

It should be noted that, in practical applications, the number of competitive candidate words that can be retained at most in the same position of each word may be set, for example, 15. The competitive candidate words can be selected according to the set quantity and the original confidence degree of each word from large to small, and the candidate words larger than the threshold value can also be selected according to the set confidence degree threshold value. Furthermore, the sum of the original confidence levels of all competing candidate words at the same position is 1.

And 104, obtaining a text file corresponding to the voice file and the multi-knowledge-source characteristics of each word in the text file according to the transcription result.

Specifically, a text file corresponding to the voice file can be obtained through confusion network decoding.

The multi-knowledge-source features may include at least two of the following features: a word posterior probability; poor posterior probability of competing words; scoring a language model; the frame average acoustic model score. Of course, in order to make the subsequent confidence re-estimation result more accurate, the extracted multi-knowledge-source features may further include any one or more of the following: the posterior probability of the phoneme and the variance of the state frame corresponding to each word; a word position coefficient; word length; whether it is a stop word; a duration; the number of competing words; short time average energy, etc.

These features are described below:

(1) word posterior probability: a posterior probability of the current word;

(2) the posterior probability difference of the competing words: the difference between the posterior probabilities of the two optimal candidate words between two adjacent nodes on the confusion network;

(3) scoring a language model; namely the N-Gram language model score of the current word;

(4) frame-averaged acoustic model score: dividing the acoustic model score of the current word by the total number of characteristic frames of the word; for example, extracting acoustic features (e.g., MFCC) from speech with a time frame shift of 10 milliseconds, 1 second speech can extract about 100 frames of features. According to the calculation, the length of the word such as 'science news fly' in the voice file is 0.7 second, and the total frame number is 70 frames after conversion;

(5) phoneme posterior probability: the average value of the posterior probability of each phoneme corresponding to the current word;

(6) state frame variance: the variance of the total number of feature frames on each state corresponding to the current word;

(7) word position coefficient: dividing the position i of the current word in the sentence by the total word number N of the sentence in which the word is positioned;

(8) word length: the total number of words contained in the current word;

(9) whether it is a stop word;

(10) duration: the length of time the current word lasts;

(11) number of competing words: the total number of words between two adjacent nodes in the confusion network;

(12) short-time average energy: the current word corresponds to the short-term average energy of the corresponding segment in the speech file.

It should be noted that, state-level information (i.e., a minimum modeling unit of speech, generally, a word includes multiple phonemes, and each phoneme includes multiple states) may be obtained by performing FA forced segmentation on each word in the confusion network, that is, the posterior probability of each state is obtained, and the posterior probability of each phoneme is the average of the posterior probabilities of all states in the phoneme.

And 105, performing confidence reevaluation on each word by using the multi-knowledge-source characteristics, and filtering out nonsense words and sentences in the text file.

Specifically, a group of multidimensional feature vectors may be generated for each word according to the multi-knowledge source features, and then the confidence of the word may be calculated by using a pre-trained regression model (hereinafter, an SVM model is taken as an example) and the multidimensional feature vectors of each word.

The following describes a process of generating a set of multidimensional feature vectors (taking 18 dimensions as an example) for each word, using two knowledge source features, i.e., the posterior probability and the posterior probability difference of the competing words.

For convenience of description, the following description is made in terms of the subscript order of each feature in the multi-dimensional feature vector:

1) 1 st to 9 th dimensions: word posterior probability WPP (i-1), WPP²(i-1),WPP³(i-1),WPP(i),WPP²(i),WPP³(i),WPP(i+1),WPP²(i+1),WPP³(i +1), where i is the position of the current word in the sentence, and the posterior probability WPP (i) of the word i is defined as follows:

p(i)＝p_ac(i)p_lm(i) (2)

wherein, α_t(i) Indicating the forward probability of word i at time t, β_t(i) Representing the backward probability of the word i at the time t, calculating the forward and backward probability by using the existing forward and backward algorithm, and expressing the set of all candidate words appearing at the time t by using omega; p is a radical of_ac(i) Is a word_iScore of the acoustic model, p_lm(i) Is the language model score for word i.

2) Dimension 10 to 18: poor posterior probability of competing words

Where i is the position of the current word in the sentence, the subscript onetest represents the first candidate, and twobest represents the second candidate.

Calculating the score S of the multi-knowledge-source feature vector on a SVM model trained in advance_word：

S_word＝w₁·x+b₁(3)

In the above formula, w₁Is the normal vector of the SVM classification plane, x is the input multi-knowledge source feature vector, b₁Is a bias parameter (constant), where w₁And b₁Is trained by using the positive example and the negative example data of the words in advance.

Since the output result of the standard SVM classifier is not given in the form of probability, and the embodiment of the present invention needs to use the SVM classifier to obtain the new confidence of the keyword, a transformation must be applied to the output result of the SVM to obtain the score output in the form of probability. Specifically, the output result of the SVM can be transformed by using the existing method, and one of the methods is to perform sigmoid transformation on the output result of the SVM:

wherein, WPP_wordThe word confidence coefficient is a reestimation result; variables A and B are transformation parameters and are obtained by training by adopting a maximum likelihood criterion.

Next, taking the above 12 features as an example, a process of generating a set of multi-dimensional feature vectors (taking 32 dimensions as an example) for each word will be described.

1) 1 st to 9 th dimensions: word posterior probability WPP (i-1), WPP²(i-1),WPP³(i-1),WPP(i),WPP²(i),WPP³(i),WPP(i+1),WPP²(i+1),WPP³(i +1), where i is the position of the current word in the sentence, and the definition of the posterior probability WPP (i) of the word i refers to the above formulas (1), (2).

2) Dimension 10 to 18: poor posterior probability of competing words

3) Dimensions 19 to 21: N-Gram language model score P for a word_lm(i-1),P_lm(i),P_lm(i+1)；

4) 22 nd to 24 th dimensions: frame-averaged acoustic model score P_ac ^(i-1)/N_i-1，P_ac ⁽ⁱ⁾/N_i，P_ac ⁽ⁱ⁺¹⁾/N_i+1In which N is_iRepresenting the number of voice frames corresponding to the word i;

5) dimension 25: phoneme posterior probability PPP of word_i

The acoustic distribution of the phonemes is modeled by adopting a deep neural network (such as RNN), the input of the deep neural network is acoustic characteristics, the output of the deep neural network is posterior probability of the phonemes, and M represents the output dimension of the neural network in the formula. For Chinese, M represents 40 monotonous phonemes plus 42 phonemes for sil (representing silence) and sp (representing interword pause). In the above formula, N_phoneIndicates the total number of phonemes, p (ph), corresponding to the word i_j|O_t) Is that the current speech frame is O_tPosterior probability of time phoneme being j, t_sAnd t_eRespectively representing the start frame and the end frame of the current (to be reestimated) phoneme (obtained during the speech transcription process),

and

the start frame and the end frame (obtained by performing state level segmentation on words) of the s-th state in the current phoneme.

6) Dimension 26: state frame variance σ_sframe

In the above formula, N_sRepresenting the number of states corresponding to the current word, F_sRepresents the number of frames (obtained by state grade segmentation of the word) obtained in the s-th state of the current word, mu_sframeIs N_sThe individual states correspond to the average of the number of frames.

7) Dimension 27: position coefficient of word i_loc/N_w，i_locIndicating the position number of the current word in the sentence, N_wRepresenting the total number of words contained in the current sentence;

8) dimension 28: word length, i.e. the number of words contained in the current word;

9) dimension 29: judging whether the current word is a stop word, if so, judging the current word to be 1, otherwise, judging the current word to be 0;

10) dimension 30: the time length of the current word is in seconds;

11) dimension 31: the total number of competing words corresponding to the current word, namely the total number of arcs between two adjacent nodes in the confusion network;

12) dimension 32: the current keyword corresponds to the short-term average energy of the corresponding segment in the speech file.

The process of performing confidence re-estimation on each word by using the 32-dimensional feature vector generated by the multi-knowledge source features may refer to the description of the above formulas (3) and (4), and is not described herein again.

The above-mentioned filtering of the meaningless words and sentences in the text file may be performed by parsing the transcribed text by using a dependency parsing technique, converting the parsing result into a word vector (e.g., a one-hot vector), classifying the words in the transcribed text by using the word vector as a feature and combining a classifier (e.g., an SVM), and filtering out the meaningless words (e.g., discourse words) and sentences according to the classification result.

It should be noted that, the above two processes of performing confidence reestimation on each word and filtering out meaningless words and sentences in the text file are not in sequence during processing, that is, the confidence reestimation on each word can be performed first, and then the meaningless words and sentences in the text file are filtered out; or filtering out meaningless words and sentences in the text file, and then carrying out confidence coefficient reestimation on each word.

And 106, calculating the correlation degree of each text file and the user interest model according to the confidence coefficient reevaluation result.

Firstly, for the filtered text file, the Word vector of each Word in the filtered text file is calculated by adopting the existing Word Embedding technology and is marked as V.

Then, taking the confidence coefficient reestimation result of each word as the weight of the word, and carrying out weighted average on word vectors of all words appearing in the text file to obtain the vector of the text file:

in the above formula, N_wordWPP for the total number of words contained in the filtered text file_iIndicates the confidence of the ith word, V_iWord vector, V, representing the ith word_docVectors for filtering the later document are shown.

Finally, calculating the correlation between the current text file and a user interest model (taking an SVM model as an example);

S_doc＝w₂·V_doc+b₂(11)

wherein the parameter w₂Normal vector for SVM classification plane, b₂Is a bias parameter (constant) trained from a large amount of training data.

Furthermore, normalization processing can be carried out on the relevance value output by the SVM so as to more intuitively carry out sequencing on the retrieval files.

And 107, displaying the retrieved voice file information according to the relevance.

Specifically, the voice file information with the correlation degree larger than the set threshold value can be displayed in sequence from large to small according to the correlation degree; or the voice file information with set number is displayed in sequence from large to small according to the degree of correlation.

In addition, threshold values corresponding to different levels can be divided for the file relevancy scores to obtain importance levels of the original voice files, such as the levels of high, medium and low, and finally displayed voice file information and the level information of the voice file information are displayed to the user.

It should be noted that the displayed information of the voice file may be information such as a subject name, an abstract, and a link of the voice file, which is not limited in this embodiment of the present invention.

Aiming at the phenomenon that a certain number of transcription errors exist in a text file obtained by voice transcription, extracting multi-knowledge-source characteristics of words in the text file obtained by voice transcription, performing confidence coefficient reestimation on the words by using the multi-knowledge-source characteristics, filtering out nonsense words and sentences in the text file, and calculating the correlation degree between the text file and the user interest model according to a confidence coefficient reestimation result; and displaying the retrieved voice files according to the relevancy, thereby effectively reducing the influence of transcription errors on file sequencing. The voice file ordering method of the embodiment of the invention not only greatly improves the efficiency of voice file retrieval, but also ensures the accuracy of the retrieval result.

Correspondingly, an embodiment of the present invention further provides a voice file retrieval system, as shown in fig. 2, which is a schematic structural diagram of the system.

In this embodiment, the system includes:

a model training module 201, configured to train a user interest model corresponding to the search keyword;

a voice file obtaining module 202, configured to obtain each voice file to be retrieved;

the voice transcription module 203 is used for performing voice transcription on the voice file to obtain a transcription result;

a text file generating module 204, configured to obtain a text file corresponding to the voice file according to the transcription result;

a feature obtaining module 205, configured to obtain multiple knowledge source features of each word in the text file;

a confidence reevaluation module 206, configured to reevaluate confidence of each word in the text file by using the multiple knowledge source features;

the filtering module 207 is used for filtering meaningless words and sentences in the text file;

a relevancy calculation module 208, configured to calculate a relevancy between each text file and the user interest model according to the confidence reestimation result;

and the display module 209 is configured to display the retrieved voice file information according to the relevance.

It should be noted that, in practical application, the search keyword may be one or more search keywords input by a user during search, or one or more search keywords collected from some specific scenario corpora in advance, which is not limited in this embodiment of the present invention.

The user interest model may adopt a regression model, and when training the regression model, the model training module 201 may adopt the existing Word Embedding technology to calculate Word vector representations of the search keywords, and dynamically train the regression model in combination with Word vectors in the text to be searched that are not related to the search words, as a final user interest model. Accordingly, one specific structure of the model training module 201 may include the following elements:

In the embodiment of the present invention, the transcription result is in a word-level confusion network format, which includes not only the optimal candidate word but also a plurality of competing candidate words. The confusion network stores the time position, the acoustic model score, the language model score and the original confidence of each word in the voice file. Additionally, the multiple knowledge-source features include at least two of the following features: a word posterior probability; poor posterior probability of competing words; scoring a language model; the frame average acoustic model score. Of course, in order to make the subsequent confidence re-estimation result more accurate, the multi-knowledge-source feature may further include any one or more of the following: the posterior probability of the phoneme and the variance of the state frame corresponding to each word; a word position coefficient; word length; whether it is a stop word; a duration; the number of competing words; short time average energy, etc. These features have been described in detail above and will not be described in detail here.

Accordingly, the confidence re-estimation module 206 may generate a set of multidimensional feature vectors for each word using the multi-knowledge source features, and then calculate the confidence of each word using a pre-trained regression model (hereinafter, an SVM model is used as an example) and the multidimensional feature vectors of each word. One specific structure of the confidence reevaluation module 206 can include: the system comprises a multi-dimensional feature vector generating unit and a confidence coefficient calculating unit, wherein the multi-dimensional feature vector generating unit is used for generating a group of multi-dimensional feature vectors for each word according to the multi-knowledge source features; the confidence coefficient calculation unit is used for calculating the confidence coefficient of each word by using the pre-trained regression model and the multi-dimensional feature vector of each word.

Fig. 3 is a schematic structural diagram of a correlation calculation module in an embodiment of the present invention, where the correlation calculation module includes:

a word vector calculation unit 31, configured to calculate, for each text file, a word vector of each word in the text file;

the document vector calculation unit 32 is configured to use the confidence re-estimation result of each word as the weight of the word, and perform weighted average on word vectors of all words appearing in the text document to obtain a vector of the text document:

and the correlation calculation unit 33 is configured to calculate the correlation between the text file and the user interest model according to the vector of the text file.

The specific calculation process of each calculation unit can refer to the description in the foregoing embodiment of the method of the present invention, and is not described herein again.

The upper display module 209 can display the retrieved voice file information according to the relevancy. In practical application, the corresponding voice file information may be sequentially displayed in the order of the degree of correlation from large to small, for example, all the voice file information with the degree of correlation larger than a set threshold value may be displayed, or a set number of voice file information may be displayed. The voice file information may be subject name, abstract, link and other information of the voice file, and the embodiment of the present invention is not limited.

Fig. 4 is a schematic diagram of another structure of the voice file retrieval system according to the embodiment of the present invention.

Unlike the embodiment shown in fig. 2, in this embodiment, the system further includes: a setting module 401 and a level determination module 402. The setting module 401 is configured to set correlation threshold values for different importance levels; the level determining module 402 is configured to determine an importance level of each speech file according to the relevance between each text file and the user interest model and the relevance threshold.

Accordingly, in this embodiment, the presentation module 209 is configured to present not only the retrieved voice file information, but also the importance level information of the voice file when presenting the voice file information.

The voice file retrieval system provided by the embodiment of the invention aims at the phenomenon that a certain number of transcription errors exist in a text file obtained by voice transcription, performs confidence coefficient reestimation on each word by extracting multi-knowledge-source characteristics of each word in the text file obtained by voice transcription and utilizing the multi-knowledge-source characteristics, filters out nonsense words and sentences in the text file, and calculates the correlation degree between each text file and the user interest model according to a confidence coefficient reestimation result; and displaying the retrieved voice files according to the relevancy, thereby effectively reducing the influence of transcription errors on file sequencing. The voice file retrieval system of the embodiment of the invention not only greatly improves the efficiency of voice file retrieval, but also ensures the accuracy of the retrieval result.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, they are described in a relatively simple manner, and reference may be made to some descriptions of method embodiments for relevant points. The above-described system embodiments are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The above embodiments of the present invention have been described in detail, and the present invention is described herein using specific embodiments, but the above embodiments are only used to help understanding the method and system of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method for retrieving a voice file, comprising:

training a user interest model corresponding to the retrieval key words;

acquiring each voice file to be retrieved;

performing confidence coefficient reevaluation on each word by using the multi-knowledge-source characteristics, and filtering out nonsense words and sentences in the text file, wherein the performing of the confidence coefficient reevaluation on each word means that the confidence coefficient of each word is recalculated;

and displaying the retrieved voice file information according to the relevance.

2. The method according to claim 1, wherein the search keyword is one or more search keywords input by a user during a search, or one or more keywords previously collected from some specific context corpus.

3. The method of claim 1, wherein training the user interest model corresponding to the search keyword comprises:

collecting the corpus containing the search keywords;

calculating word vectors of all words in the corpus;

4. The method of claim 1, wherein the transcription result is in a word-level confusion network format, wherein the confusion network stores the time position, acoustic model score, language model score and original confidence level of each word in the voice file;

5. The method of claim 4, further comprising:

6. The method of claim 4 or 5, wherein the confidence reestimating the words in the text file comprises:

7. The method of claim 6, wherein said calculating the relevance of each text file to the user interest model based on the confidence reestimation result comprises:

for each text file, calculating a word vector of each word in the text file;

8. The method according to any one of claims 1 to 5, wherein the presenting the retrieved voice file information according to the relevance comprises:

9. The method of claim 8, further comprising:

setting correlation threshold values aiming at different importance levels;

10. A voice document retrieval system, comprising:

the confidence coefficient reestimation module is used for reestimating the confidence coefficient of each word by using the multi-knowledge-source characteristics, wherein the reestimation of the confidence coefficient of each word refers to the reestimation of the confidence coefficient of each word;

11. The system of claim 10, wherein the model training module comprises:

12. The system of claim 10, wherein the transcription result is in a word-level confusion network format, wherein the confusion network stores the time position of each word in the voice file, the acoustic model score, the language model score, and the original confidence level; the multi-knowledge-source features include at least two of the following features: a word posterior probability; poor posterior probability of competing words; scoring a language model; a frame-averaged acoustic model score;

the confidence reevaluation module includes:

13. The system of claim 10, wherein the correlation calculation module comprises:

14. The system according to any one of claims 10 to 13, wherein the presentation module is specifically configured to present, in order from the largest to the smallest in the correlation, the voice files with the correlation larger than the set threshold, or present, in order from the largest to the smallest in the correlation, the voice files with the set number.

15. The system of claim 14, further comprising:

the display module is further used for displaying the importance level information of the voice file when displaying the voice file information.