CN105551485B - Voice file retrieval method and system - Google Patents

Voice file retrieval method and system Download PDF

Info

Publication number
CN105551485B
CN105551485B CN201510882391.9A CN201510882391A CN105551485B CN 105551485 B CN105551485 B CN 105551485B CN 201510882391 A CN201510882391 A CN 201510882391A CN 105551485 B CN105551485 B CN 105551485B
Authority
CN
China
Prior art keywords
word
voice
file
text file
voice file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510882391.9A
Other languages
Chinese (zh)
Other versions
CN105551485A (en
Inventor
王建社
柳林
冯翔
胡国平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Iflytek Information Technology Co Ltd
Original Assignee
Iflytek Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Iflytek Information Technology Co Ltd filed Critical Iflytek Information Technology Co Ltd
Priority to CN201510882391.9A priority Critical patent/CN105551485B/en
Publication of CN105551485A publication Critical patent/CN105551485A/en
Application granted granted Critical
Publication of CN105551485B publication Critical patent/CN105551485B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Abstract

The invention discloses a method and a system for searching a voice file, wherein the method comprises the following steps: training a user interest model corresponding to the retrieval key words; acquiring each voice file to be retrieved; performing voice transcription on the voice file to obtain a transcription result; obtaining a text file corresponding to the voice file and multi-knowledge-source characteristics of each word in the text file according to the transcription result; performing confidence reevaluation on each word by using the multi-knowledge-source characteristics, and filtering out meaningless words and sentences in the text file; calculating the correlation degree of each text file and the user interest model according to the confidence coefficient reestimation result; and displaying the retrieved voice file according to the relevance. By using the invention, the efficiency and the accuracy of voice file retrieval can be improved.

Description

Voice file retrieval method and system
Technical Field
The invention relates to the field of voice signal processing, in particular to a voice file retrieval method and a voice file retrieval system.
Background
With the continuous development of speech processing technology, in more and more applications, related technical personnel have tried to acquire required information from speech data, such as retrieving a speech file required by a specific application scenario from massive speech data. The traditional method for retrieving useful files from a large number of voice files mainly comprises two methods:
the first method is to perform manual voice file monitoring to find useful files with high correlation, and this method needs to consume a lot of manpower and material resources and is inefficient.
The second method is to transcribe the voice file to obtain a text file and then search the text file. Because the accuracy of the voice transcription cannot be well guaranteed due to the influence of factors such as a complex noise environment, a far field and the like, when the voice file is searched, in order to guarantee the accuracy of the search, the result of the voice transcription generally needs to be manually checked, and therefore the problems of high labor consumption and low efficiency exist.
Disclosure of Invention
The invention provides a voice file retrieval method and a voice file retrieval system, which aim to solve the problems of low efficiency and poor accuracy caused by voice transcription errors in the conventional voice file retrieval.
Therefore, the invention provides the following technical scheme:
a method of voice file retrieval comprising:
training a user interest model corresponding to the retrieval key words;
acquiring each voice file to be retrieved;
performing voice transcription on the voice file to obtain a transcription result;
obtaining a text file corresponding to the voice file and multi-knowledge-source characteristics of each word in the text file according to the transcription result;
performing confidence reevaluation on each word by using the multi-knowledge-source characteristics, and filtering out meaningless words and sentences in the text file;
calculating the correlation degree of each text file and the user interest model according to the confidence coefficient reestimation result;
and displaying the retrieved voice file information according to the relevance.
Preferably, the search keyword is one or more search keywords input by a user during searching, or one or more keywords collected from some specific scene corpus in advance.
Preferably, the training of the user interest model corresponding to the search keyword includes:
collecting the corpus containing the search keywords;
calculating word vectors of all words in the corpus;
and training a regression model by using the word vector, and taking the regression model as a user interest model.
Preferably, the transcription result is in a word-level confusion network format, and the time position, the acoustic model score, the language model score and the original confidence of each word in the voice file are stored in the confusion network;
the multi-knowledge-source features include at least two of the following features: a word posterior probability; poor posterior probability of competing words; scoring a language model; the frame average acoustic model score.
Preferably, the method further comprises:
segmenting each word in the confusion network to obtain phoneme information corresponding to the word;
the multi-knowledge-source features further include any one or more of: the posterior probability of the phoneme and the variance of the state frame corresponding to each word; a word position coefficient; word length; whether it is a stop word; a duration; the number of competing words; short time average energy.
Preferably, the performing confidence evaluation on each word in the text file includes:
generating a group of multi-dimensional feature vectors for each word according to the multi-knowledge-source features;
and calculating the confidence coefficient of each word by using the pre-trained regression model and the multi-dimensional feature vector of each word.
Preferably, the calculating the relevance of each text file to the user interest model according to the confidence reevaluation result includes:
for each text file, calculating a word vector of each word in the text file;
taking the confidence coefficient reestimation result of each word as the weight of the word, and carrying out weighted average on word vectors of all words appearing in the text file to obtain the vector of the text file:
and calculating the correlation degree of the text file and the user interest model according to the vector of the text file.
Preferably, the displaying the retrieved voice file information according to the relevancy includes:
sequentially displaying the voice file information with the correlation degree larger than a set threshold value according to the correlation degree from large to small; or
And displaying the voice file information with set number in turn from large to small according to the degree of correlation.
Preferably, the method further comprises:
setting correlation threshold values aiming at different importance levels;
determining the importance level of each voice file according to the relevance of each text file and the user interest model and the relevance threshold;
and when the voice file information is displayed, displaying the importance level information of the voice file.
A voice document retrieval system comprising:
the model training module is used for training a user interest model corresponding to the retrieval key words;
the voice file acquisition module is used for acquiring each voice file to be retrieved;
the voice transcription module is used for carrying out voice transcription on the voice file to obtain a transcription result;
the text file generating module is used for obtaining a text file corresponding to the voice file according to the transcription result;
the characteristic acquisition module is used for acquiring multi-knowledge-source characteristics of each word in the text file;
the confidence coefficient reestimation module is used for reestimating the confidence coefficient of each word by utilizing the multi-knowledge-source characteristics;
the filtering module is used for filtering meaningless words and sentences in the text file;
the relevancy calculation module is used for calculating the relevancy between each text file and the user interest model according to the confidence reestimation result;
and the display module is used for displaying the retrieved voice file information according to the relevancy.
Preferably, the model training module comprises:
the corpus collection unit is used for collecting the corpus containing the retrieval keywords;
the word vector calculation unit is used for calculating word vectors of all words in the corpus;
and the training unit is used for training a regression model by using the word vector, and taking the regression model as a user interest model.
Preferably, the transcription result is in a word-level confusion network format, and the time position, the acoustic model score, the language model score and the original confidence of each word in the voice file are stored in the confusion network; the multi-knowledge-source features include at least two of the following features: a word posterior probability; poor posterior probability of competing words; scoring a language model; a frame-averaged acoustic model score;
the confidence reevaluation module includes:
the multi-dimensional feature vector generating unit is used for generating a group of multi-dimensional feature vectors for each word according to the multi-knowledge source features;
and the confidence coefficient calculation unit is used for calculating the confidence coefficient of each word by utilizing the pre-trained regression model and the multi-dimensional feature vector of each word.
Preferably, the correlation calculation module includes:
the word vector calculation unit is used for calculating the word vector of each word in each text file;
the document vector calculation unit is used for taking the confidence coefficient reestimation result of each word as the weight of the word, and carrying out weighted average on word vectors of all words appearing in the text document to obtain the vector of the text document:
and the relevancy calculation unit is used for calculating the relevancy of the text file and the user interest model according to the vector of the text file.
Preferably, the presentation module is specifically configured to present, in order from high to low, the voice files whose correlation is greater than the set threshold, or present, in order from high to low, the voice files in the set number.
Preferably, the system further comprises:
the setting module is used for setting correlation threshold values aiming at different importance levels;
the level determining module is used for determining the importance level of each voice file according to the relevance of each text file and the user interest model and the relevance threshold;
the display module is further used for displaying the importance level information of the voice file when displaying the voice file information. Aiming at the phenomenon that a certain number of transcription errors exist in a text file obtained by voice transcription, the method and the system provided by the embodiment of the invention extract multiple knowledge source characteristics of each word in the text file obtained by voice transcription, utilize the multiple knowledge source characteristics to carry out confidence coefficient reestimation on each word, filter nonsense words and sentences in the text file, and calculate the correlation degree between each text file and the user interest model according to a confidence coefficient reestimation result; and displaying the retrieved voice files according to the relevancy, thereby effectively reducing the influence of transcription errors on file sequencing. The voice file retrieval method and the voice file retrieval system not only greatly improve the efficiency of voice file retrieval, but also ensure the accuracy of retrieval results.
Drawings
In order to more clearly illustrate the embodiments of the present application or technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings can be obtained by those skilled in the art according to the drawings.
FIG. 1 is a flow chart of a method for voice file retrieval in accordance with an embodiment of the present invention;
FIG. 2 is a schematic diagram of an architecture of a voice document retrieval system according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a structure of a correlation calculation module according to an embodiment of the present invention;
fig. 4 is another structural diagram of the voice file retrieval system according to the embodiment of the present invention.
Detailed Description
In order to make the technical field of the invention better understand the scheme of the embodiment of the invention, the embodiment of the invention is further described in detail with reference to the drawings and the implementation mode.
As shown in fig. 1, it is a flowchart of a voice file retrieval method according to an embodiment of the present invention, including the following steps:
step 101, training a user interest model corresponding to the search keyword.
It should be noted that the search keyword may be one or more search keywords input by the user during search, or one or more search keywords collected from some specific scenario corpora in advance, which is not limited in the embodiment of the present invention.
The user interest model may adopt a regression model, such as an SVM (support vector machine) model or an RNN (Recurrent Neural Network) model, and when training the regression model, Word vector representations of search keywords may be calculated by using the existing Word Embedding technology, and the regression model is dynamically trained in combination with Word vectors in the text to be searched that are not related to the search words, as a final user interest model. Specifically, it may search corpora containing the search keyword as a normal sample in a large corpus prepared in advance and a text of the speech to be searched, randomly extract some corpora irrelevant to the search keyword as a reverse sample, convert the sample corpora into Word vectors by using a Word Embedding method, and train the regression model by using the normal Word vectors and the reverse Word vectors.
And 102, acquiring each voice file to be retrieved.
And 103, performing voice transcription on the voice file to obtain a transcription result.
Specifically, a large-scale voice transcription technology can be adopted to transcribe the voice file to obtain a transcription result.
In the embodiment of the invention, the transcription result adopts a word-level confusion network format, and not only comprises the optimal candidate word, but also comprises a plurality of competitive candidate words. The confusion network stores information such as time position, acoustic model score, language model score and original confidence of each word in the voice file, so as to facilitate the subsequent acquisition of multi-knowledge-source characteristics of each word. Wherein, the original confidence coefficient can be calculated according to the posterior probability of each word.
It should be noted that, in practical applications, the number of competitive candidate words that can be retained at most in the same position of each word may be set, for example, 15. The competitive candidate words can be selected according to the set quantity and the original confidence degree of each word from large to small, and the candidate words larger than the threshold value can also be selected according to the set confidence degree threshold value. Furthermore, the sum of the original confidence levels of all competing candidate words at the same position is 1.
And 104, obtaining a text file corresponding to the voice file and the multi-knowledge-source characteristics of each word in the text file according to the transcription result.
Specifically, a text file corresponding to the voice file can be obtained through confusion network decoding.
The multi-knowledge-source features may include at least two of the following features: a word posterior probability; poor posterior probability of competing words; scoring a language model; the frame average acoustic model score. Of course, in order to make the subsequent confidence re-estimation result more accurate, the extracted multi-knowledge-source features may further include any one or more of the following: the posterior probability of the phoneme and the variance of the state frame corresponding to each word; a word position coefficient; word length; whether it is a stop word; a duration; the number of competing words; short time average energy, etc.
These features are described below:
(1) word posterior probability: a posterior probability of the current word;
(2) the posterior probability difference of the competing words: the difference between the posterior probabilities of the two optimal candidate words between two adjacent nodes on the confusion network;
(3) scoring a language model; namely the N-Gram language model score of the current word;
(4) frame-averaged acoustic model score: dividing the acoustic model score of the current word by the total number of characteristic frames of the word; for example, extracting acoustic features (e.g., MFCC) from speech with a time frame shift of 10 milliseconds, 1 second speech can extract about 100 frames of features. According to the calculation, the length of the word such as 'science news fly' in the voice file is 0.7 second, and the total frame number is 70 frames after conversion;
(5) phoneme posterior probability: the average value of the posterior probability of each phoneme corresponding to the current word;
(6) state frame variance: the variance of the total number of feature frames on each state corresponding to the current word;
(7) word position coefficient: dividing the position i of the current word in the sentence by the total word number N of the sentence in which the word is positioned;
(8) word length: the total number of words contained in the current word;
(9) whether it is a stop word;
(10) duration: the length of time the current word lasts;
(11) number of competing words: the total number of words between two adjacent nodes in the confusion network;
(12) short-time average energy: the current word corresponds to the short-term average energy of the corresponding segment in the speech file.
It should be noted that, state-level information (i.e., a minimum modeling unit of speech, generally, a word includes multiple phonemes, and each phoneme includes multiple states) may be obtained by performing FA forced segmentation on each word in the confusion network, that is, the posterior probability of each state is obtained, and the posterior probability of each phoneme is the average of the posterior probabilities of all states in the phoneme.
And 105, performing confidence reevaluation on each word by using the multi-knowledge-source characteristics, and filtering out nonsense words and sentences in the text file.
Specifically, a group of multidimensional feature vectors may be generated for each word according to the multi-knowledge source features, and then the confidence of the word may be calculated by using a pre-trained regression model (hereinafter, an SVM model is taken as an example) and the multidimensional feature vectors of each word.
The following describes a process of generating a set of multidimensional feature vectors (taking 18 dimensions as an example) for each word, using two knowledge source features, i.e., the posterior probability and the posterior probability difference of the competing words.
For convenience of description, the following description is made in terms of the subscript order of each feature in the multi-dimensional feature vector:
1) 1 st to 9 th dimensions: word posterior probability WPP (i-1), WPP2(i-1),WPP3(i-1),WPP(i),WPP2(i),WPP3(i),WPP(i+1),WPP2(i+1),WPP3(i +1), where i is the position of the current word in the sentence, and the posterior probability WPP (i) of the word i is defined as follows:
Figure BDA0000863412300000081
p(i)=pac(i)plm(i) (2)
wherein, αt(i) Indicating the forward probability of word i at time t, βt(i) Representing the backward probability of the word i at the time t, calculating the forward and backward probability by using the existing forward and backward algorithm, and expressing the set of all candidate words appearing at the time t by using omega; p is a radical ofac(i) Is a wordiScore of the acoustic model, plm(i) Is the language model score for word i.
2) Dimension 10 to 18: poor posterior probability of competing words
Figure BDA0000863412300000082
Figure BDA0000863412300000083
Figure BDA0000863412300000084
Figure BDA0000863412300000085
Figure BDA0000863412300000086
Where i is the position of the current word in the sentence, the subscript onetest represents the first candidate, and twobest represents the second candidate.
Calculating the score S of the multi-knowledge-source feature vector on a SVM model trained in advanceword
Sword=w1·x+b1(3)
In the above formula, w1Is the normal vector of the SVM classification plane, x is the input multi-knowledge source feature vector, b1Is a bias parameter (constant), where w1And b1Is trained by using the positive example and the negative example data of the words in advance.
Since the output result of the standard SVM classifier is not given in the form of probability, and the embodiment of the present invention needs to use the SVM classifier to obtain the new confidence of the keyword, a transformation must be applied to the output result of the SVM to obtain the score output in the form of probability. Specifically, the output result of the SVM can be transformed by using the existing method, and one of the methods is to perform sigmoid transformation on the output result of the SVM:
Figure BDA0000863412300000091
wherein, WPPwordThe word confidence coefficient is a reestimation result; variables A and B are transformation parameters and are obtained by training by adopting a maximum likelihood criterion.
Next, taking the above 12 features as an example, a process of generating a set of multi-dimensional feature vectors (taking 32 dimensions as an example) for each word will be described.
For convenience of description, the following description is made in terms of the subscript order of each feature in the multi-dimensional feature vector:
1) 1 st to 9 th dimensions: word posterior probability WPP (i-1), WPP2(i-1),WPP3(i-1),WPP(i),WPP2(i),WPP3(i),WPP(i+1),WPP2(i+1),WPP3(i +1), where i is the position of the current word in the sentence, and the definition of the posterior probability WPP (i) of the word i refers to the above formulas (1), (2).
2) Dimension 10 to 18: poor posterior probability of competing words
Figure BDA0000863412300000093
Figure BDA0000863412300000094
Figure BDA0000863412300000095
Figure BDA0000863412300000096
Figure BDA0000863412300000097
Where i is the position of the current word in the sentence, the subscript onetest represents the first candidate, and twobest represents the second candidate.
3) Dimensions 19 to 21: N-Gram language model score P for a wordlm(i-1),Plm(i),Plm(i+1);
4) 22 nd to 24 th dimensions: frame-averaged acoustic model score Pac (i-1)/Ni-1,Pac (i)/Ni,Pac (i+1)/Ni+1In which N isiRepresenting the number of voice frames corresponding to the word i;
5) dimension 25: phoneme posterior probability PPP of wordi
Figure BDA0000863412300000092
Figure BDA0000863412300000101
The acoustic distribution of the phonemes is modeled by adopting a deep neural network (such as RNN), the input of the deep neural network is acoustic characteristics, the output of the deep neural network is posterior probability of the phonemes, and M represents the output dimension of the neural network in the formula. For Chinese, M represents 40 monotonous phonemes plus 42 phonemes for sil (representing silence) and sp (representing interword pause). In the above formula, NphoneIndicates the total number of phonemes, p (ph), corresponding to the word ij|Ot) Is that the current speech frame is OtPosterior probability of time phoneme being j, tsAnd teRespectively representing the start frame and the end frame of the current (to be reestimated) phoneme (obtained during the speech transcription process),
Figure BDA0000863412300000104
and
Figure BDA0000863412300000105
the start frame and the end frame (obtained by performing state level segmentation on words) of the s-th state in the current phoneme.
6) Dimension 26: state frame variance σsframe
Figure BDA0000863412300000102
Figure BDA0000863412300000103
In the above formula, NsRepresenting the number of states corresponding to the current word, FsRepresents the number of frames (obtained by state grade segmentation of the word) obtained in the s-th state of the current word, musframeIs NsThe individual states correspond to the average of the number of frames.
7) Dimension 27: position coefficient of word iloc/Nw,ilocIndicating the position number of the current word in the sentence, NwRepresenting the total number of words contained in the current sentence;
8) dimension 28: word length, i.e. the number of words contained in the current word;
9) dimension 29: judging whether the current word is a stop word, if so, judging the current word to be 1, otherwise, judging the current word to be 0;
10) dimension 30: the time length of the current word is in seconds;
11) dimension 31: the total number of competing words corresponding to the current word, namely the total number of arcs between two adjacent nodes in the confusion network;
12) dimension 32: the current keyword corresponds to the short-term average energy of the corresponding segment in the speech file.
The process of performing confidence re-estimation on each word by using the 32-dimensional feature vector generated by the multi-knowledge source features may refer to the description of the above formulas (3) and (4), and is not described herein again.
The above-mentioned filtering of the meaningless words and sentences in the text file may be performed by parsing the transcribed text by using a dependency parsing technique, converting the parsing result into a word vector (e.g., a one-hot vector), classifying the words in the transcribed text by using the word vector as a feature and combining a classifier (e.g., an SVM), and filtering out the meaningless words (e.g., discourse words) and sentences according to the classification result.
It should be noted that, the above two processes of performing confidence reestimation on each word and filtering out meaningless words and sentences in the text file are not in sequence during processing, that is, the confidence reestimation on each word can be performed first, and then the meaningless words and sentences in the text file are filtered out; or filtering out meaningless words and sentences in the text file, and then carrying out confidence coefficient reestimation on each word.
And 106, calculating the correlation degree of each text file and the user interest model according to the confidence coefficient reevaluation result.
Firstly, for the filtered text file, the Word vector of each Word in the filtered text file is calculated by adopting the existing Word Embedding technology and is marked as V.
Then, taking the confidence coefficient reestimation result of each word as the weight of the word, and carrying out weighted average on word vectors of all words appearing in the text file to obtain the vector of the text file:
Figure BDA0000863412300000111
Figure BDA0000863412300000112
in the above formula, NwordWPP for the total number of words contained in the filtered text fileiIndicates the confidence of the ith word, ViWord vector, V, representing the ith worddocVectors for filtering the later document are shown.
Finally, calculating the correlation between the current text file and a user interest model (taking an SVM model as an example);
Sdoc=w2·Vdoc+b2(11)
wherein the parameter w2Normal vector for SVM classification plane, b2Is a bias parameter (constant) trained from a large amount of training data.
Furthermore, normalization processing can be carried out on the relevance value output by the SVM so as to more intuitively carry out sequencing on the retrieval files.
And 107, displaying the retrieved voice file information according to the relevance.
Specifically, the voice file information with the correlation degree larger than the set threshold value can be displayed in sequence from large to small according to the correlation degree; or the voice file information with set number is displayed in sequence from large to small according to the degree of correlation.
In addition, threshold values corresponding to different levels can be divided for the file relevancy scores to obtain importance levels of the original voice files, such as the levels of high, medium and low, and finally displayed voice file information and the level information of the voice file information are displayed to the user.
It should be noted that the displayed information of the voice file may be information such as a subject name, an abstract, and a link of the voice file, which is not limited in this embodiment of the present invention.
Aiming at the phenomenon that a certain number of transcription errors exist in a text file obtained by voice transcription, extracting multi-knowledge-source characteristics of words in the text file obtained by voice transcription, performing confidence coefficient reestimation on the words by using the multi-knowledge-source characteristics, filtering out nonsense words and sentences in the text file, and calculating the correlation degree between the text file and the user interest model according to a confidence coefficient reestimation result; and displaying the retrieved voice files according to the relevancy, thereby effectively reducing the influence of transcription errors on file sequencing. The voice file ordering method of the embodiment of the invention not only greatly improves the efficiency of voice file retrieval, but also ensures the accuracy of the retrieval result.
Correspondingly, an embodiment of the present invention further provides a voice file retrieval system, as shown in fig. 2, which is a schematic structural diagram of the system.
In this embodiment, the system includes:
a model training module 201, configured to train a user interest model corresponding to the search keyword;
a voice file obtaining module 202, configured to obtain each voice file to be retrieved;
the voice transcription module 203 is used for performing voice transcription on the voice file to obtain a transcription result;
a text file generating module 204, configured to obtain a text file corresponding to the voice file according to the transcription result;
a feature obtaining module 205, configured to obtain multiple knowledge source features of each word in the text file;
a confidence reevaluation module 206, configured to reevaluate confidence of each word in the text file by using the multiple knowledge source features;
the filtering module 207 is used for filtering meaningless words and sentences in the text file;
a relevancy calculation module 208, configured to calculate a relevancy between each text file and the user interest model according to the confidence reestimation result;
and the display module 209 is configured to display the retrieved voice file information according to the relevance.
It should be noted that, in practical application, the search keyword may be one or more search keywords input by a user during search, or one or more search keywords collected from some specific scenario corpora in advance, which is not limited in this embodiment of the present invention.
The user interest model may adopt a regression model, and when training the regression model, the model training module 201 may adopt the existing Word Embedding technology to calculate Word vector representations of the search keywords, and dynamically train the regression model in combination with Word vectors in the text to be searched that are not related to the search words, as a final user interest model. Accordingly, one specific structure of the model training module 201 may include the following elements:
the corpus collection unit is used for collecting the corpus containing the retrieval keywords;
the word vector calculation unit is used for calculating word vectors of all words in the corpus;
and the training unit is used for training a regression model by using the word vector, and taking the regression model as a user interest model.
In the embodiment of the present invention, the transcription result is in a word-level confusion network format, which includes not only the optimal candidate word but also a plurality of competing candidate words. The confusion network stores the time position, the acoustic model score, the language model score and the original confidence of each word in the voice file. Additionally, the multiple knowledge-source features include at least two of the following features: a word posterior probability; poor posterior probability of competing words; scoring a language model; the frame average acoustic model score. Of course, in order to make the subsequent confidence re-estimation result more accurate, the multi-knowledge-source feature may further include any one or more of the following: the posterior probability of the phoneme and the variance of the state frame corresponding to each word; a word position coefficient; word length; whether it is a stop word; a duration; the number of competing words; short time average energy, etc. These features have been described in detail above and will not be described in detail here.
Accordingly, the confidence re-estimation module 206 may generate a set of multidimensional feature vectors for each word using the multi-knowledge source features, and then calculate the confidence of each word using a pre-trained regression model (hereinafter, an SVM model is used as an example) and the multidimensional feature vectors of each word. One specific structure of the confidence reevaluation module 206 can include: the system comprises a multi-dimensional feature vector generating unit and a confidence coefficient calculating unit, wherein the multi-dimensional feature vector generating unit is used for generating a group of multi-dimensional feature vectors for each word according to the multi-knowledge source features; the confidence coefficient calculation unit is used for calculating the confidence coefficient of each word by using the pre-trained regression model and the multi-dimensional feature vector of each word.
Fig. 3 is a schematic structural diagram of a correlation calculation module in an embodiment of the present invention, where the correlation calculation module includes:
a word vector calculation unit 31, configured to calculate, for each text file, a word vector of each word in the text file;
the document vector calculation unit 32 is configured to use the confidence re-estimation result of each word as the weight of the word, and perform weighted average on word vectors of all words appearing in the text document to obtain a vector of the text document:
and the correlation calculation unit 33 is configured to calculate the correlation between the text file and the user interest model according to the vector of the text file.
The specific calculation process of each calculation unit can refer to the description in the foregoing embodiment of the method of the present invention, and is not described herein again.
The upper display module 209 can display the retrieved voice file information according to the relevancy. In practical application, the corresponding voice file information may be sequentially displayed in the order of the degree of correlation from large to small, for example, all the voice file information with the degree of correlation larger than a set threshold value may be displayed, or a set number of voice file information may be displayed. The voice file information may be subject name, abstract, link and other information of the voice file, and the embodiment of the present invention is not limited.
Fig. 4 is a schematic diagram of another structure of the voice file retrieval system according to the embodiment of the present invention.
Unlike the embodiment shown in fig. 2, in this embodiment, the system further includes: a setting module 401 and a level determination module 402. The setting module 401 is configured to set correlation threshold values for different importance levels; the level determining module 402 is configured to determine an importance level of each speech file according to the relevance between each text file and the user interest model and the relevance threshold.
Accordingly, in this embodiment, the presentation module 209 is configured to present not only the retrieved voice file information, but also the importance level information of the voice file when presenting the voice file information.
The voice file retrieval system provided by the embodiment of the invention aims at the phenomenon that a certain number of transcription errors exist in a text file obtained by voice transcription, performs confidence coefficient reestimation on each word by extracting multi-knowledge-source characteristics of each word in the text file obtained by voice transcription and utilizing the multi-knowledge-source characteristics, filters out nonsense words and sentences in the text file, and calculates the correlation degree between each text file and the user interest model according to a confidence coefficient reestimation result; and displaying the retrieved voice files according to the relevancy, thereby effectively reducing the influence of transcription errors on file sequencing. The voice file retrieval system of the embodiment of the invention not only greatly improves the efficiency of voice file retrieval, but also ensures the accuracy of the retrieval result.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, they are described in a relatively simple manner, and reference may be made to some descriptions of method embodiments for relevant points. The above-described system embodiments are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
The above embodiments of the present invention have been described in detail, and the present invention is described herein using specific embodiments, but the above embodiments are only used to help understanding the method and system of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (15)

1. A method for retrieving a voice file, comprising:
training a user interest model corresponding to the retrieval key words;
acquiring each voice file to be retrieved;
performing voice transcription on the voice file to obtain a transcription result;
obtaining a text file corresponding to the voice file and multi-knowledge-source characteristics of each word in the text file according to the transcription result;
performing confidence coefficient reevaluation on each word by using the multi-knowledge-source characteristics, and filtering out nonsense words and sentences in the text file, wherein the performing of the confidence coefficient reevaluation on each word means that the confidence coefficient of each word is recalculated;
calculating the correlation degree of each text file and the user interest model according to the confidence coefficient reestimation result;
and displaying the retrieved voice file information according to the relevance.
2. The method according to claim 1, wherein the search keyword is one or more search keywords input by a user during a search, or one or more keywords previously collected from some specific context corpus.
3. The method of claim 1, wherein training the user interest model corresponding to the search keyword comprises:
collecting the corpus containing the search keywords;
calculating word vectors of all words in the corpus;
and training a regression model by using the word vector, and taking the regression model as a user interest model.
4. The method of claim 1, wherein the transcription result is in a word-level confusion network format, wherein the confusion network stores the time position, acoustic model score, language model score and original confidence level of each word in the voice file;
the multi-knowledge-source features include at least two of the following features: a word posterior probability; poor posterior probability of competing words; scoring a language model; the frame average acoustic model score.
5. The method of claim 4, further comprising:
segmenting each word in the confusion network to obtain phoneme information corresponding to the word;
the multi-knowledge-source features further include any one or more of: the posterior probability of the phoneme and the variance of the state frame corresponding to each word; a word position coefficient; word length; whether it is a stop word; a duration; the number of competing words; short time average energy.
6. The method of claim 4 or 5, wherein the confidence reestimating the words in the text file comprises:
generating a group of multi-dimensional feature vectors for each word according to the multi-knowledge-source features;
and calculating the confidence coefficient of each word by using the pre-trained regression model and the multi-dimensional feature vector of each word.
7. The method of claim 6, wherein said calculating the relevance of each text file to the user interest model based on the confidence reestimation result comprises:
for each text file, calculating a word vector of each word in the text file;
taking the confidence coefficient reestimation result of each word as the weight of the word, and carrying out weighted average on word vectors of all words appearing in the text file to obtain the vector of the text file:
and calculating the correlation degree of the text file and the user interest model according to the vector of the text file.
8. The method according to any one of claims 1 to 5, wherein the presenting the retrieved voice file information according to the relevance comprises:
sequentially displaying the voice file information with the correlation degree larger than a set threshold value according to the correlation degree from large to small; or
And displaying the voice file information with set number in turn from large to small according to the degree of correlation.
9. The method of claim 8, further comprising:
setting correlation threshold values aiming at different importance levels;
determining the importance level of each voice file according to the relevance of each text file and the user interest model and the relevance threshold;
and when the voice file information is displayed, displaying the importance level information of the voice file.
10. A voice document retrieval system, comprising:
the model training module is used for training a user interest model corresponding to the retrieval key words;
the voice file acquisition module is used for acquiring each voice file to be retrieved;
the voice transcription module is used for carrying out voice transcription on the voice file to obtain a transcription result;
the text file generating module is used for obtaining a text file corresponding to the voice file according to the transcription result;
the characteristic acquisition module is used for acquiring multi-knowledge-source characteristics of each word in the text file;
the confidence coefficient reestimation module is used for reestimating the confidence coefficient of each word by using the multi-knowledge-source characteristics, wherein the reestimation of the confidence coefficient of each word refers to the reestimation of the confidence coefficient of each word;
the filtering module is used for filtering meaningless words and sentences in the text file;
the relevancy calculation module is used for calculating the relevancy between each text file and the user interest model according to the confidence reestimation result;
and the display module is used for displaying the retrieved voice file information according to the relevancy.
11. The system of claim 10, wherein the model training module comprises:
the corpus collection unit is used for collecting the corpus containing the retrieval keywords;
the word vector calculation unit is used for calculating word vectors of all words in the corpus;
and the training unit is used for training a regression model by using the word vector, and taking the regression model as a user interest model.
12. The system of claim 10, wherein the transcription result is in a word-level confusion network format, wherein the confusion network stores the time position of each word in the voice file, the acoustic model score, the language model score, and the original confidence level; the multi-knowledge-source features include at least two of the following features: a word posterior probability; poor posterior probability of competing words; scoring a language model; a frame-averaged acoustic model score;
the confidence reevaluation module includes:
the multi-dimensional feature vector generating unit is used for generating a group of multi-dimensional feature vectors for each word according to the multi-knowledge source features;
and the confidence coefficient calculation unit is used for calculating the confidence coefficient of each word by utilizing the pre-trained regression model and the multi-dimensional feature vector of each word.
13. The system of claim 10, wherein the correlation calculation module comprises:
the word vector calculation unit is used for calculating the word vector of each word in each text file;
the document vector calculation unit is used for taking the confidence coefficient reestimation result of each word as the weight of the word, and carrying out weighted average on word vectors of all words appearing in the text document to obtain the vector of the text document:
and the relevancy calculation unit is used for calculating the relevancy of the text file and the user interest model according to the vector of the text file.
14. The system according to any one of claims 10 to 13, wherein the presentation module is specifically configured to present, in order from the largest to the smallest in the correlation, the voice files with the correlation larger than the set threshold, or present, in order from the largest to the smallest in the correlation, the voice files with the set number.
15. The system of claim 14, further comprising:
the setting module is used for setting correlation threshold values aiming at different importance levels;
the level determining module is used for determining the importance level of each voice file according to the relevance of each text file and the user interest model and the relevance threshold;
the display module is further used for displaying the importance level information of the voice file when displaying the voice file information.
CN201510882391.9A 2015-11-30 2015-11-30 Voice file retrieval method and system Active CN105551485B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510882391.9A CN105551485B (en) 2015-11-30 2015-11-30 Voice file retrieval method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510882391.9A CN105551485B (en) 2015-11-30 2015-11-30 Voice file retrieval method and system

Publications (2)

Publication Number Publication Date
CN105551485A CN105551485A (en) 2016-05-04
CN105551485B true CN105551485B (en) 2020-04-21

Family

ID=55830634

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510882391.9A Active CN105551485B (en) 2015-11-30 2015-11-30 Voice file retrieval method and system

Country Status (1)

Country Link
CN (1) CN105551485B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2556023B (en) * 2016-08-15 2022-02-09 Intrasonics Sarl Audio matching
CN106202574A (en) * 2016-08-19 2016-12-07 清华大学 The appraisal procedure recommended towards microblog topic and device
CN107194260A (en) * 2017-04-20 2017-09-22 中国科学院软件研究所 A kind of Linux Kernel association CVE intelligent Forecastings based on machine learning
CN108615526B (en) * 2018-05-08 2020-07-07 腾讯科技(深圳)有限公司 Method, device, terminal and storage medium for detecting keywords in voice signal
CN109376224B (en) * 2018-10-24 2020-07-21 深圳市壹鸽科技有限公司 Corpus filtering method and apparatus
CN109708256B (en) * 2018-12-06 2020-07-03 珠海格力电器股份有限公司 Voice determination method and device, storage medium and air conditioner
CN111429912B (en) * 2020-03-17 2023-02-10 厦门快商通科技股份有限公司 Keyword detection method, system, mobile terminal and storage medium
CN111179939B (en) * 2020-04-13 2020-07-28 北京海天瑞声科技股份有限公司 Voice transcription method, voice transcription device and computer storage medium
CN113314108B (en) * 2021-06-16 2024-02-13 深圳前海微众银行股份有限公司 Method, apparatus, device, storage medium and program product for processing voice data

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0651372A2 (en) * 1993-10-27 1995-05-03 AT&T Corp. Automatic speech recognition (ASR) processing using confidence measures
GB2364814A (en) * 2000-07-12 2002-02-06 Canon Kk Speech recognition

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101021856B (en) * 2006-10-11 2010-10-27 北京新岸线网络技术有限公司 Distributing speech searching system
CN101510222B (en) * 2009-02-20 2012-05-30 北京大学 Multilayer index voice document searching method
CN102023994B (en) * 2009-09-22 2013-05-22 株式会社理光 Device for retrieving voice file and method thereof
CN102314876B (en) * 2010-06-29 2013-04-10 株式会社理光 Speech retrieval method and system
CN103793515A (en) * 2014-02-11 2014-05-14 安徽科大讯飞信息科技股份有限公司 Service voice intelligent search and analysis system and method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0651372A2 (en) * 1993-10-27 1995-05-03 AT&T Corp. Automatic speech recognition (ASR) processing using confidence measures
GB2364814A (en) * 2000-07-12 2002-02-06 Canon Kk Speech recognition

Also Published As

Publication number Publication date
CN105551485A (en) 2016-05-04

Similar Documents

Publication Publication Date Title
CN105551485B (en) Voice file retrieval method and system
CN108597541B (en) Speech emotion recognition method and system for enhancing anger and happiness recognition
CN106328147B (en) Speech recognition method and device
TWI536364B (en) Automatic speech recognition method and system
US10515292B2 (en) Joint acoustic and visual processing
CN108681574B (en) Text abstract-based non-fact question-answer selection method and system
WO2018097091A1 (en) Model creation device, text search device, model creation method, text search method, data structure, and program
CN109331470B (en) Method, device, equipment and medium for processing answering game based on voice recognition
CN108538286A (en) A kind of method and computer of speech recognition
CN110164447B (en) Spoken language scoring method and device
CN113505200B (en) Sentence-level Chinese event detection method combined with document key information
CN110347787B (en) Interview method and device based on AI auxiliary interview scene and terminal equipment
CN102280106A (en) VWS method and apparatus used for mobile communication terminal
CN111341305A (en) Audio data labeling method, device and system
CN106446018B (en) Query information processing method and device based on artificial intelligence
JPWO2009081861A1 (en) Word category estimation device, word category estimation method, speech recognition device, speech recognition method, program, and recording medium
CN111329494B (en) Depression reference data acquisition method and device
CN110992988B (en) Speech emotion recognition method and device based on domain confrontation
US20210151036A1 (en) Detection of correctness of pronunciation
KR101988165B1 (en) Method and system for improving the accuracy of speech recognition technology based on text data analysis for deaf students
CN104347071B (en) Method and system for generating reference answers of spoken language test
CN105869622B (en) Chinese hot word detection method and device
JP2010257425A (en) Topic boundary detection device and computer program
CN109074809B (en) Information processing apparatus, information processing method, and computer-readable storage medium
Bharti et al. Automated speech to sign language conversion using Google API and NLP

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant