CN104199825A - Information inquiry method and system - Google Patents

Information inquiry method and system Download PDF

Info

Publication number
CN104199825A
CN104199825A CN201410352847.6A CN201410352847A CN104199825A CN 104199825 A CN104199825 A CN 104199825A CN 201410352847 A CN201410352847 A CN 201410352847A CN 104199825 A CN104199825 A CN 104199825A
Authority
CN
China
Prior art keywords
template
compression
templates
word
checked
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410352847.6A
Other languages
Chinese (zh)
Inventor
王东
王晓曦
赵芳
刘荣
游世学
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING ZHONGKE HUILIAN INFORMATION TECHNOLOGY Co Ltd
Tsinghua University
Original Assignee
BEIJING ZHONGKE HUILIAN INFORMATION TECHNOLOGY Co Ltd
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING ZHONGKE HUILIAN INFORMATION TECHNOLOGY Co Ltd, Tsinghua University filed Critical BEIJING ZHONGKE HUILIAN INFORMATION TECHNOLOGY Co Ltd
Priority to CN201410352847.6A priority Critical patent/CN104199825A/en
Publication of CN104199825A publication Critical patent/CN104199825A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/211Schema design and management
    • G06F16/212Schema design and management with details for data modelling support
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9032Query formulation
    • G06F16/90332Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Abstract

The invention provides an information inquiry method and system. The method comprises the steps that input speech signals are identified to letter symbols to be output to obtain character strings to be inquired; the character strings to be inquired are matched with multiple compression templates in a template set respectively according to set matching rules, and a first template in the compression templates matched with the character strings to be inquired is obtained; the template set comprises multiple templates which are shared and merged according to digraph compression to obtain one or more compression templates; response information corresponding to the first template is obtained by inquiring a knowledge base; the response information is output through speech and/or letters. The information inquiry method solves the problems that a traditional inquiry method based on template matching is more and more complex in processing logic with increase of the number and type of system templates, and inquiry efficiency is lowered.

Description

A kind of information query method and system
Technical field
The application relates to areas of information technology, particularly relates to a kind of information query method and system.
Background technology
In recent years, along with the development of natural language processing technique, Intelligent Answer System has been a great concern, fashionable from chat software ' little Huang chicken ', and to being popular in the robot of replying of each macroreticular platform, Intelligent Answer System is applied in various fields.The question answering system of a high-quality solves the common problem of client, reduces artificial expense, and continuous service in 24 hours can be provided.
Yet, most of question answering systems are all to using form that text keys in as the input of question answering system, loaded down with trivial details time-consuming, particularly on the on-keyboard equipment such as mobile terminal (as mobile phone), or for operating difficulties crowds such as the elderly, disabled persons, text input becomes extremely difficult.
Therefore, the question answering system based on phonetic entry is arisen at the historic moment.Although that voice-based Intelligent Answer System has is quick and easy, suitable device and crowd advantage widely.Yet, voice have also been brought to new problem as input mode:
The convenience of phonetic entry is brought larger randomness, and therefore, quantity and the pattern of corresponding system template need to increase along with the increase of this randomness.The searching method of tradition based on template matches is along with the increase of system template quantity and pattern, and processing logic becomes increasingly complex, and search efficiency reduces, need to consume the plenty of time mates, increased period of reservation of number, user experiences poor, has also increased the processing load of equipment and system simultaneously.
Summary of the invention
The application provides a kind of information query method and system, and to solve, the searching method search efficiency of tradition based on template matches reduces, problem of a specified duration consuming time, and the equipment bringing thus and the heavy problem of system processing load.
The application discloses a kind of information query method, comprising:
The voice signal of input is identified as to letter symbol output, obtains character string to be checked;
Described character string to be checked is mated respectively to the first template in the compression template that obtains matching with described character string to be checked with a plurality of compression templates under template set according to setting matched rule; Wherein, comprise a plurality of templates under described template set, described a plurality of templates are shared merging according to digraph compression, obtain one or more compression templates;
From knowledge base, inquiry obtains response message corresponding to described the first template;
By voice and/or word, export described response message.
Alternatively, described a plurality of templates are shared merging according to digraph compression in the following manner, obtain one or more compression templates:
Gather a plurality of sample datas, described a plurality of sample data Yi Ziwei unit is carried out to Data Division;
According to described a plurality of sample datas semantic sequence separately, the word obtaining after splitting is arranged by graph structure form, obtain described a plurality of template; Wherein, the data structure of described a plurality of templates is graph structure;
According to graph structure can shared Sub tactic pattern, to meeting a plurality of templates of context-free grammar, share merging respectively, obtain described one or more compression template.
Alternatively, described according to graph structure can shared Sub tactic pattern, to meeting a plurality of templates of context-free grammar, share merging respectively, obtain described one or more compression template, comprising:
According to graph structure can shared Sub tactic pattern, respectively the same word and/or the different word that meet in a plurality of templates of context-free grammar are shared to merging, obtain described one or more compression template; Wherein, the data structure of described compression template is digraph compression;
Wherein,
When the described a plurality of templates that meet context-free grammar, when word at the same position place of each self-corresponding graph structure is identical, to share form, merge same word;
When the described a plurality of templates that meet context-free grammar, when word at the same position place of each self-corresponding graph structure is different, with split form, retain different words.
Alternatively, described described character string to be checked is mated with a plurality of compression templates under template set respectively according to setting matched rule, the first template in the compression template that obtains matching with described character string to be checked, comprising:
Described character string to be checked Yi Ziwei unit is carried out to Data Division, and the word obtaining after splitting is arranged by graph structure form;
Obtain respectively set to be checked corresponding to character string to be checked after arrangement, and, a plurality of compression template set that described a plurality of compression templates are corresponding;
Calculate respectively the path of mating between described set to be checked and the set of described a plurality of compression template;
From the set of described a plurality of compression template, obtain one with described set to be checked between mate the path of path minimum;
The template of the minimum path indication of obtaining is defined as to described the first template.
Alternatively, the described path of mating of calculating respectively between described set to be checked and the set of described a plurality of compression template, comprising:
Define a Token, the corresponding set of described Token v (i, j, h, s), wherein, and i, j is respectively the state of described set v in set I and set J; H is the historical path of described set v process in set I and set J, the matching distance that s is described historical path; Wherein, described set I is set corresponding to compression template, and described set J is set corresponding to described character string to be checked;
In each state of described set I and described set J, add an automatic cycle limit;
To adding set I and set J behind circulation limit to carry out figure expanded search, obtain accumulating search history and matching distance; And, obtain distance metric;
To described accumulation search history, matching distance and the summation of described distance metric, obtain described coupling path.
Alternatively, described distance metric comprises: D (w1, w2), and wherein, D (w1, w2) is used to indicate the distance metric between word w1 and word w2;
The described distance metric that obtains, comprising: use following formula to obtain distance metric:
D ( w 1 , w 2 ) = min r Σ k M ( x r k , y r k ) ;
Wherein, described x is the phone string of word w1; Described y is the phone string of word w2; R is the alignment thereof of x and y; with k the phoneme for the x based on described alignment thereof r and y; the confusion matrix of k the phoneme of the x of expression based on described alignment thereof r and y;
Wherein said alignment thereof comprises: described phone string x and described phone string y align from beginning to end.
Correspondingly, disclosed herein as well is a kind of information query system, comprising:
Sound identification module, for the voice signal of input being identified as to letter symbol output, obtains character string to be checked;
Template matches module, for described character string to be checked is mated with a plurality of compression templates under template set respectively according to setting matched rule, the first template in the compression template that obtains matching with described character string to be checked; Wherein, comprise a plurality of templates under described template set, described a plurality of templates are shared merging according to digraph compression, obtain one or more compression templates;
Answer generation module, for obtaining response message corresponding to described the first template from knowledge base inquiry;
Output module, for exporting described response message by voice and/or word.
Alternatively, described a plurality of templates, by sharing merging with lower module according to digraph compression, obtain one or more compression templates:
Acquisition module, for gathering a plurality of sample datas, carries out Data Division by described a plurality of sample data Yi Ziwei unit;
Arrange module, for according to described a plurality of sample datas semantic sequence separately, the word obtaining after splitting is arranged by graph structure form, obtain described a plurality of template; Wherein, the data structure of described a plurality of templates is graph structure;
Template acquisition module, for according to graph structure can shared Sub tactic pattern, to meeting a plurality of templates of context-free grammar, share merging respectively, obtain described one or more compression template.
Alternatively, described template acquisition module, specifically for according to graph structure can shared Sub tactic pattern, respectively the same word and/or the different word that meet in a plurality of templates of context-free grammar are shared to merging, obtain described one or more compression template; Wherein, the data structure of described compression template is digraph compression;
Wherein,
When the described a plurality of templates that meet context-free grammar, when word at the same position place of each self-corresponding graph structure is identical, to share form, merge same word;
When the described a plurality of templates that meet context-free grammar, when word at the same position place of each self-corresponding graph structure is different, with split form, retain different words.
Alternatively, described template matches module, comprising:
Data Division module, for described character string to be checked Yi Ziwei unit is carried out to Data Division, and arranges the word obtaining after splitting by graph structure form;
Set acquisition module, for obtaining respectively set to be checked corresponding to character string to be checked after arrangement, and, a plurality of compression template set that described a plurality of compression templates are corresponding;
Computing module, for calculating respectively the path of mating between described set to be checked and the set of described a plurality of compression template;
Path acquisition module, for from the set of described a plurality of compression template, obtain one with described set to be checked between mate the path of path minimum;
Determination module, for being defined as described the first template by the template of the minimum path indication of obtaining.
Compared with prior art, the application comprises following advantage:
First, information query method based on phonetic entry in the embodiment of the present application and system, by described character string to be checked is mated with a plurality of compression templates under template set respectively according to setting matched rule, the first template in the compression template that obtains matching with described character string to be checked.Because a plurality of templates under template set are shared reconstruct according to digraph compression, and then make a large amount of different templates compressions be merged into one or more compression templates of negligible amounts, logical operation processing procedure while having simplified search coupling, effectively improved search efficiency, simplify searching algorithm, solved inefficiencies and the complicacy of extensive template search procedure.Meanwhile, reduced the processing load of equipment and system.
Further, user is with the form input problem of oral account, through speech recognition, template matches, passes to knowledge base matching result, and from knowledge base, inquiry obtains corresponding response message output, reduce user's operation, improved user's experience, opened up extensively user crowd.
Certainly, arbitrary product of enforcement the application not necessarily needs to reach above-described all advantages simultaneously.
Accompanying drawing explanation
Fig. 1 is the flow chart of steps of a kind of information query method in the embodiment of the present application one;
Fig. 2 is the flow chart of steps of a kind of information query method in the embodiment of the present application two;
Fig. 3 is the schematic diagram of the FST structure of middle template 1 embodiment illustrated in fig. 2 based on word;
Fig. 4 is the schematic diagram of the FST structure of middle template 2 embodiment illustrated in fig. 2 based on word;
Fig. 5 is the schematic diagram of the FST structure of middle template 3 embodiment illustrated in fig. 2 based on word;
Fig. 6 is that middle template 1 embodiment illustrated in fig. 2, template 2 and template 3 are shared the schematic diagram of the FST structure of the compression template after merging based on word;
Fig. 7 is the flow chart of steps of a kind of information query method in the embodiment of the present application three;
Fig. 8 is the system architecture schematic diagram of a kind of intelligent sound question answering system in embodiment illustrated in fig. 7;
Fig. 9 is the structured flowchart of a kind of information query system in the embodiment of the present application four;
Figure 10 is the structured flowchart of a kind of information query system in the embodiment of the present application five.
Embodiment
For the application's above-mentioned purpose, feature and advantage can be become apparent more, below in conjunction with the drawings and specific embodiments, the application is described in further detail.
Embodiment mono-
With reference to Fig. 1, show the flow chart of steps of a kind of information query method in the embodiment of the present application one.In this application, described information query method, comprising:
Step 102, is identified as letter symbol output by the voice signal of input, obtains character string to be checked.
In the present embodiment, by front-end processing (Front End Processing, FE), search and decoding (Search and Decoding), the voice signal of input is identified as to letter symbol output.When the language materials such as Broadcast Journalism or phone, session recording are processed, also need to do corresponding front end pre-service work, as: long phonetic segmentation is become to voice snippet input, and speech/non-speech is differentiated, and wide and narrow strip is differentiated, and men and women's sound is differentiated and music clip rejecting etc.
Wherein,
In front-end processing process, basic task is to the extraction of phonetic feature and normalized.Conventional feature has MFCC (Mel Frequency Cepstrum Coefficient, Mel frequency cepstral coefficient) feature and PLP (Perceptual Linear Predictive, perception linear predictor coefficient) feature.On feature extraction basis, conventionally need to carry out certain normalization, as: average normalized, reduces channel effect; Variance normalized, reduces additive noise impact.By front-end processing, improved the precision of acoustic model (Acoustic Model, AM), and the robustness of acoustic model to factors such as sound channel, speaker, additive noises.
In search and decode procedure, can utilize the acoustic model, the language model (Language Model, LM) that train, and the pronunciation dictionary (Lexicon) that contacts these two models, voice signal is identified as to letter symbol output.
Step 104, mates described character string to be checked respectively the first template in the compression template that obtains matching with described character string to be checked with a plurality of compression templates under template set according to setting matched rule.
Preferably, comprise a plurality of templates under described template set, described a plurality of templates are shared merging according to digraph compression, obtain one or more compression templates.Wherein, the data structure of described a plurality of templates can be graph structure (also, FST structure; Wherein, FST, Finite State Transducer, finite state transfer).In the present embodiment, can be according to FST structure, a plurality of templates are shared to merging (also, carrying out the compression of FST structure based on FST structure shared) according to digraph compression, obtain one or more compression templates, described compression template can comprise a plurality of templates.Due to, described compression template is based on the data of FST structure being compressed to the data that obtain, therefore the data structure of described compression template can be also FST structure.
In the present embodiment, FST is the expansion to finite-state automata (Finite State Machine, FSM), can be used for simplifying a finite state transfer process of expression.Wherein, a FST can be expressed as a digraph (A, R), and wherein A is some set, and R is limit set.Each limit in R can be expressed as again a five-tuple (s, e, w, t, c), and wherein s and e are respectively the initial and arrival state on this limit, and w and t are respectively input element and output element, the weights that c is this limit.FST can be used for expressing context-free grammar (Context-Free Grammar, CFG, context-free grammar), and therefore all templates that meet context-free grammar can be expressed as a FST.
Step 106, from knowledge base, inquiry obtains response message corresponding to described the first template.
Step 108, exports described response message by voice and/or word.
In sum, information query method described in the present embodiment, by described character string to be checked is mated with a plurality of compression templates under template set respectively according to setting matched rule, the first template in the compression template that obtains matching with described character string to be checked.Because a plurality of templates under template set are shared reconstruct according to digraph compression, and then make a large amount of different templates compressions be merged into one or more compression templates of negligible amounts, logical operation processing procedure while having simplified search coupling, effectively improved search efficiency, simplify searching algorithm, solved inefficiencies and the complicacy of extensive template search procedure.Meanwhile, reduced the processing load of equipment and system.
Further, user is with the form input problem of oral account, through speech recognition, template matches, passes to knowledge base matching result, and from knowledge base, inquiry obtains corresponding response message output, reduce user's operation, improved user's experience, opened up extensively user crowd.
Embodiment bis-
With reference to Fig. 2, show the flow chart of steps of a kind of information query method in the embodiment of the present application two.In the present embodiment, for what realize described information query method, can be a terminal device/device.
The described information query method of the present embodiment, comprising:
Step 202, terminal is identified as letter symbol output by the voice signal of input, obtains character string to be checked.
In the present embodiment, described terminal is the also input of receiving speech information both, and voice signal corresponding to voice messaging is identified as to letter symbol output; Also can directly receive the input of text message.
Step 204, terminal device mates described character string to be checked respectively according to setting matched rule with a plurality of compression templates under template set, the first template in the compression template that obtains matching with described character string to be checked.
Preferably, comprise a plurality of templates under described template set, described a plurality of templates are shared merging according to digraph compression, obtain one or more compression templates.
Below in conjunction with Fig. 3, Fig. 4, Fig. 5 and Fig. 6, a kind of generation of the compression template based on word being obtained to flow process describes.Wherein, Fig. 3 is the schematic diagram of the FST structure of template 1 based on word in the present embodiment; Fig. 4 is the schematic diagram of the FST structure of template 2 based on word in the present embodiment; Fig. 5 is the schematic diagram of the FST structure of template 3 based on word in the present embodiment; Fig. 6 is that template 1 in the present embodiment, template 2 and template 3 are shared the schematic diagram of the FST structure of the compression template after merging based on word.Here it should be noted that, the numeral " 0,1,2,3,4,5,6 " in Fig. 3 (or Fig. 4 or Fig. 5) represents respectively a node in FST structure, meanwhile, also represents each state under the set of template 3 correspondences.
Wherein, between template 1, template 2 and template 3, meet context-free grammar
< template 1>:=be may I ask Lu xun's birthday
< template 2>:=be may I ask poplar and is celebrated a birthday
Whom < template 3>:=be may I ask and celebrated a birthday
Preferably, described a plurality of templates can be shared merging according to digraph compression in the following manner, obtain one or more compression templates:
First, terminal device gathers a plurality of sample datas, and described a plurality of sample data Yi Ziwei unit is carried out to Data Division.
Template 1, template 2 and template 3 are obtained in terminal collection, and template 1, template 2 and template 3 An Yiziwei units are split.Wherein, template 1 is split as by word: ask-ask-Lu-fast-Sheng-; Template 2 is split as by word: ask-ask-poplar-mistake-Sheng-; Template 3 is divided into by stroke: ask-ask-who-mistake-Sheng-.
Then, terminal device, according to described a plurality of sample datas semantic sequence separately, is arranged the word obtaining after splitting by graph structure form, obtain described a plurality of template.Wherein, the data structure of described a plurality of templates is graph structure.
Preferably, described graph structure is FST structure.Template 1, template 2 and template 3 after splitting by word are arranged with graph structure form according to original semantic sequence, obtain respectively the template of the FST structure as shown in Fig. 3, Fig. 4 and Fig. 5.
Finally, terminal device can shared Sub tactic pattern according to graph structure, shares merging respectively to meeting a plurality of templates of context-free grammar, obtains described one or more compression template.
Wherein, terminal device can shared Sub tactic pattern according to graph structure, to meeting a plurality of templates of context-free grammar, share merging and can refer to respectively: as, in Fig. 3, Fig. 4 and Fig. 5, respectively " please, please, ask " at sequence number 0 place merged, " ask, ask, ask " at sequence number 1 place merges, and " Shandong, poplar, who " at sequence number 2 places merges.The merging mode of other position can be with reference to the merging mode at above-mentioned sequence number 0,1,2 places.
In the present embodiment, terminal device can shared Sub tactic pattern according to graph structure, respectively the same word and/or the different word that meet in a plurality of templates of context-free grammar is shared to merging, obtains described one or more compression template.Wherein, the data structure of described compression template is digraph compression.Preferably, described digraph compression can be FST structure.Wherein,
When the described a plurality of templates that meet context-free grammar, when word at the same position place of each self-corresponding graph structure is identical, to share form, merge same word;
When the described a plurality of templates that meet context-free grammar, when word at the same position place of each self-corresponding graph structure is different, with split form, retain different words.
For example, with reference to Fig. 3, Fig. 4 and Fig. 5, terminal is according to the FST structure of template 1, template 2 and the template 3 correspondences order on (" left side " " right side " be here with reference to diagram indication) from left to right, successively the template 1, template 2 and the template 3 that meet context-free grammar are shared to merging based on word, obtain a described compression template.That is, by same word (as, ask, ask, cross, give birth to, the sharing of the word such as day) share merging; By different words with decision tree split form retain (as, the triliteral San Ge of splitting into of who, Yang Helu branch retains).And then obtain the compression template of as shown in Figure 6 one the FST structure after merging based on word.
As can be seen here, in the present embodiment, by the FST based on word, merge, a plurality of templates that meet context-free grammar can be combined, form a large FST, and by deterministic and minimizing, make the sharable part of template fully shared, thus conserve space and computing cost greatly.Effectively improved the room and time efficiency of template matches, further, what the present embodiment was realized is the shared method of FST based on word.Although word can increase without limitation in Chinese, but Chinese Character Set is relative closure (quantity of word substantially remains unchanged within a very long time), that is to say, even if there are neologisms to increase, in the present embodiment, any neologisms that newly increase can be split as existing a plurality of Chinese character, and then realize the shared merging of template.FST based on word shares method not only can break away from the dependence to Words partition system, has simplified not logining the processing of neologisms, and can realize further sharing of template, thereby further compressed search volume.
Here it should be noted that, when template is carried out to structural remodeling, also can Yi Ciwei unit share merging reconstruct.Yet for new word, gathering way of neologisms is very fast, therefore, if adopt, based on word, carry out FST structural remodeling, make in template matches process, match search method will be limited to Words partition system, greatly limit sharing of template, then generation efficiency problem.In the present embodiment, adopt and carry out FST structural remodeling based on word, can effectively increase path and share, reduce the volume of FST, and then improved search matching efficiency.
In the present embodiment, above-mentioned steps 204 specifically can comprise following sub-step:
Sub-step 2042, terminal device carries out Data Division by described character string to be checked Yi Ziwei unit, and the word obtaining after splitting is arranged by graph structure form.
Sub-step 2044, terminal device obtains respectively set to be checked corresponding to character string to be checked after arrangement, and, a plurality of compression template set that described a plurality of compression templates are corresponding.
Sub-step 2046, terminal device calculates respectively the path of mating between described set to be checked and the set of described a plurality of compression template.
Preferably, terminal device calculate between described set to be checked and the set of described a plurality of compression template mate path time, can be according to following flow process calculating:
The first step, a Token of terminal device definition, the corresponding set of described Token v (i, j, h, s), wherein, and i, j is respectively the state of described set v in set I and set J; H is the historical path of described set v process in set I and set J, the matching distance that s is described historical path.Wherein, described set I is set corresponding to compression template, and described set J is set corresponding to described character string to be checked.
In the present embodiment, described Token can be the data structure of a memory search state, as, in Token, can store the data messages such as historical coupling path and historical matching distance.
Second step, terminal device adds an automatic cycle limit in each state of described set I and described set J.
In the present embodiment, the set of template 1 correspondence shown in Fig. 3 can represent with set I, also, and each state in can corresponding set I of " 0,1,2,3,4,5,6 " shown in Fig. 3.In like manner, also can gather each state in J with " 0,1,2,3,4,5,6 " corresponding expression.
The 3rd step, terminal device, to adding set I and set J behind circulation limit to carry out figure expanded search, obtains accumulating search history and matching distance; And, obtain distance metric.
In the present embodiment, distance metric D (w1, w2) has described the distance metric between word (or word) w1 and word (or word) w2.In traditional exact matching algorithm, D (w1, w2) is defined as follows:
D ( w 1 , w 2 ) = 0 w 1 = = w 2 &infin; w 1 &NotEqual; w 2
Wherein, when two words (as, word w1 and word w2) or two words (as, word w1 and word w2) when identical, that is, during w1==w2, determine that the value of the distance metric D between word w1 and word w2 (or, word w1 and word w2) is 0.When word w1 different with word w2 (or, word w1 with to have any one word in word w2 different), that is, during w1 ≠ w2, determine that the value of the distance metric between word w1 and word w2 (or, word w1 and word w2) be infinity.
Adopt traditional accurate matching algorithm, only having when two words are identical could the match is successful, can it fails to match when two words are different.When input message is fileinfo, adopt traditional accurate matching algorithm to mate, the template preparation obtaining is higher.But; when input message is voice messaging; at voice signal, be identified as in the process of text message; owing to being subject to the impact of the various factorss such as accent, noise; recognition result often there will be some random errors, adopts traditional accurate matching algorithm (completely precisely algorithm) to be difficult to that the match is successful in FST template.As " whom may I ask celebrates a birthday " the words is probably identified as " may I ask whose birthday " by speech recognition system, although the identification of most word is correct, not mating of " mistake " word causes whole search unsuccessfully.In speech recognition this minor error almost every have, make traditional F ST searching method almost cannot obtain result.
In the present embodiment, in search procedure, allow between search string and template to exist not mating to a certain degree, particularly when calculating match measure, consider coupling between pronunciation similarity.Therefore, preferably, distance metric D can be defined as to editing distance, be defined as follows:
D ( w 1 , w 2 ) = 0 w 1 = = w 2 cs w 1 &NotEqual; w 2 ; w 1 , w 2 &NotEqual; < eps > cd w 1 = = < eps > , w 2 &NotEqual; < eps > ci w 1 &NotEqual; < eps > w 2 = = < eps >
The value of corresponding distance metric when wherein cs, cd, ci can be illustrated respectively in word w1 and word w2 (or, word w1 and word w2) wrong, deletion error and inserting error occur to replace.Wherein, the value of described distance metric can determine according to actual conditions, as, be set to 12 or other meet the score value of practical application scene.Meanwhile, in order to make algorithm simple, introduced a kind of limit that is input as sky (<eps>), thereby made different wrong processing be unified in the calculating of distance metric D.
Take word w1 and word w2 is example,
When word w1 is identical with word w2, that is, during w1==w2, the value of the distance metric D between word w1 and word w2 is 0.
When word w1 different with word w2, and, when word w1 and word w2 are not all empty,, w1 ≠ w2, w1, during w2 ≠ <eps>, determines that between word w1 and word w2, mistake is replaced in generation, wherein, when cs represents that word w1 and word w2 occur to replace mistake, the value of metric range D.
When word w1 is empty, when word w2 is not empty,, w1==<eps>, during w2 ≠ <eps>, determine between word w1 and word w2 deletion error occurs, wherein, when cd represents that deletion error occurs for word w1 and word w2, the value of metric range D.
When word w1 is not empty, when word w2 is empty,, w1 ≠ <eps>, during w2==<eps>, determine between word w1 and word w2 inserting error occurs, wherein, when ci represents that inserting error occurs for word w1 and word w2, the value of metric range D.
As can be seen here, pass through in the present embodiment the computing method of distance metric D more freely, the accuracy that improves template matches.For example, we can be by TF-IDF (Term Frequency – Inverse Document Frequency, the weighting technique of prospecting for information retrieval and information) be incorporated into a calculating of D, those important domain-specific words are given to higher weight; Or according to the result of grammatical analysis, main body character is added to computation weight, make coupling that matching process more pays close attention to key word (or word) whether.Particularly importantly, these weights can be attached in the limit weight of FST, thereby needn't make any change to search procedure.In the present embodiment, utilize the dirigibility of D, for the specific fault pattern in phonetic entry, carry out effective compensation.The present embodiment is introduced fuzzy matching in FST search procedure, allows certain insertion, deletion and replacement mistake, and introduces sound confusion matrix matching error is carried out to weight revaluation, thereby has solved the Fault-Tolerant Problems of voice answer system.The method of searching for generally based on pronunciation similarity, not only can strengthen the fault-tolerance of system, and can compensate for the specific fault pattern of speech recognition, thereby is conducive to the effective understanding to user's input, and then improves the performance of whole system.
The 4th step, terminal device, to described accumulation search history, matching distance and the summation of described distance metric, obtains described coupling path.
In the present embodiment, FST fuzzy matching algorithm object is to find the coupling path of difference minimum between an I and J.Its basic ideas are that global path matching task is decomposed into part route matching task, and the result of part coupling is kept in a data structure that is called Token, by Token, in I and two FST of J, expands and route matching.This expansion and coupling are until a certain Token arrives the final state of two FST, and the coupling path that this Token records is best matching result.
Sub-step 2048, terminal device from the set of described a plurality of compression template, obtain one with described set to be checked between mate the path of path minimum.
Sub-step 20410, terminal device is defined as described the first template by the template of the minimum path indication of obtaining.
Step 206, terminal device is inquired about and is obtained response message corresponding to described the first template from knowledge base.
Step 208, terminal device is exported described response message by voice and/or word.
In sum, information query method described in the present embodiment, by described character string to be checked is mated with a plurality of compression templates under template set respectively according to setting matched rule, the first template in the compression template that obtains matching with described character string to be checked.Because a plurality of templates under template set are shared reconstruct according to digraph compression, and then make a large amount of different templates compressions be merged into one or more compression templates of negligible amounts, logical operation processing procedure while having simplified search coupling, effectively improved search efficiency, simplify searching algorithm, solved inefficiencies and the complicacy of extensive template search procedure.Meanwhile, reduced the processing load of equipment and system.
Further, user is with the form input problem of oral account, through speech recognition, template matches, passes to knowledge base matching result, and from knowledge base, inquiry obtains corresponding response message output, reduce user's operation, improved user's experience, opened up extensively user crowd.
Embodiment tri-
In conjunction with above-described embodiment, the example with an intelligent sound question answering system based on the fuzzy matching of Chinese character finite state transfer is elaborated to the steps flow chart of a kind of information query method in the present embodiment below.
In the present embodiment, with reference to Fig. 7, show the flow chart of steps of a kind of information query method in the embodiment of the present application three.Described information query method, specifically comprises:
Step 702, voice answer system is set up the shared template set of the FST based on word, obtains system template storehouse.
In the present embodiment, for realize the present embodiment information query method can be an intelligent sound question answering system.With reference to Fig. 8, show the system architecture schematic diagram of a kind of intelligent sound question answering system in the embodiment of the present application three.
Preferably, for realizing the voice answer system 800 of the information query method described in the present embodiment, can comprise: speech recognition subsystem 802 and question and answer subsystem 804.Further, in described question and answer subsystem 804, can comprise: template matches module 8042 and answer generation module 8044; Further, in described question and answer subsystem 804, can also comprise: system template storehouse 8046 and knowledge base 8048.User using oral account form problem is passed to speech recognition subsystem 802 as input, speech recognition subsystem 802 is passed to question and answer subsystem 804 after phonetic entry is converted to text.Question and answer subsystem 804 obtains after these text inputs, template matches module 8042 is mated with the template in system template storehouse 8046, obtain immediate matching template, and then answer generation module 8044 obtains the corresponding answer of problem by search knowledge base 8048, and export answer.
In the present embodiment, described voice answer system, before carrying out automatic question answering process, needs to set up the shared template set of the FST based on word, i.e. system template storehouse.
Preferably, a plurality of sample template switchs can be become to the FST template of FST structure.Then, according to context-free grammar, a plurality of FST templates after conversion are divided into a plurality of groups.Respectively in every group meet context-free cannot a plurality of FST template Yi Ziwei unit split and share.Wherein, if the word at same position place is identical, a plurality of FST templates are shared described same word; If the word at same position place is different,, with the form of decision tree division, a plurality of different words are merged to compression.Through sharing and/or the merging of different words of same word, obtain one and comprise all FST templates context-free grammar, after sharing that meet, the shared template of the FST based on word is for sharing template.Each group can but be not limited only to a corresponding shared template, using organizing the data of corresponding a plurality of shared template in system template storehouse, store more, obtain sharing template set.
Step 704, voice answer system receives user's phonetic entry, and the voice signal of input is identified as to letter signal output.
In the present embodiment, the speech recognition subsystem under described voice answer system is identified as letter signal output by the voice signal of input; Wherein, the letter signal of output is character string to be checked.
Step 706, voice answer system carries out fuzzy matching by the shared template set in described character string to be checked and system template storehouse, obtains matching template.
In the present embodiment, can adopt FST fuzzy matching algorithm to carry out fuzzy matching to the shared template set in described character string to be checked and system template storehouse.Preferably, the question and answer subsystem under voice answer system is defined as I by template set; Word string to be checked is also changed into FST structure, and be defined as J.FST fuzzy matching algorithm object is to find the coupling path of difference minimum between an I and J.Its basic ideas are that global path matching task is decomposed into part route matching task, and the result of part coupling is kept in a data structure that is called Token, by Token, in I and two FST of J, expands and route matching.This expansion and coupling are until a certain Token arrives the final state of two FST, and the coupling path that this Token records is best matching result.
Wherein, voice answer system is when carrying out fuzzy matching by the shared template set in described character string to be checked and system template storehouse, and the FST fuzzy matching algorithm of employing specifically can be as follows:
Definition:
T: current active Token list
T.top (n): front n minimum Token set of matching distance in Token list T
Merge (T): Token in T is merged, when the state of a plurality of Token is identical to (i, j), retain the Token of matching distance minimum
Prune (T): the Token in T is carried out to beta pruning
Original state set in M.S:FST M
Final state set in M.E:FST M
E (m, M): the limit expanded list to state m in FST M
St (m, e, M): the state in FST M, state m being arrived by limit e
E.w: the input character that limit e comprises
E.c: the weighted value c of limit e
Eh (h, (e1, e2)): add limit to (e1, e2) to historical h
D (w1, w2): represent the distance metric between two Chinese character w1 and w2
<eps>: zero input character
Algorithm:
Initialization:
(1) I and each state of J are added to self-loopa limit, its input and output character is <eps>
(2)for?each(i∈I.S&&j∈J.S):
T=T∪v(i,j,{},0)}
Search procedure:
Preferably, in voice answer system, although the output of speech recognition has significant random error, but these mistakes are not rambling, wherein overwhelming majority pronunciation and orthoepy all have similarity, we can utilize this similarity to carry out the calculating of standard FST fuzzy matching algorithm middle distance tolerance D, and the close erroneous matching of pronouncing is given to less matching distance, and the erroneous matching that differs larger to pronunciation is given larger matching distance.
In the present embodiment, the similarity based on pronouncing between phoneme is calculated the pronunciation similarity between word or word.Phoneme is minimum phonetic unit, and each phoneme has the pronunciation characteristic of oneself.Meanwhile, the primitive number of phoneme is less, as shown in table 1, has 35 phonemes, thereby calculate easy in Chinese standard mandarin.
Table 1
Phoneme between there is obvious similarity, these similaritys can represent with confusion matrix M, wherein (i, j) individual element M (i, j) provides the degree of obscuring of phoneme i and j.In order more to reflect the obscure rule of speech recognition system to close pronunciation, we are expressed as phone string by voice identification result, and contrast with the phone string of this Received Pronunciation, can obtain the possibility that a certain phoneme is identified as another phoneme, are expressed as:
M(i,j)=P(i|j)=C(i|j)/C(j)
Wherein C (j) is the number of phoneme in Received Pronunciation, and C (i|j) is identified as the number of phoneme i for voice identification result Plays pronunciation j, and P (i|j) is identified as the probability of phoneme i for phoneme j.Based on M (i, j), we can calculate the distance metric D (w1, w2) of two words or word (w1, w2).If the phone string of w1 is x, the phone string of w2 is y,,
D ( w 1 , w 2 ) = min r &Sigma; k M ( x r k , y r k )
Wherein, r can be a kind of alignment thereof of x and y, with k phoneme (allowing empty phoneme to exist) for the x based on r alignment thereof and y.K the phoneme to the phone string x based on alignment thereof r and phone string y sued for peace, obtain the minimum value of k the phoneme of x based on r alignment thereof and y, described minimum value is D (w1, w2) value, the value of the D (w1, w2) calculating is used for to above-mentioned concrete algorithmic procedure.Wherein, described r alignment thereof can but be not limited only to: described phone string x and described phone string y align from beginning to end.For example:
When the character string quantity of phone string x and phone string y is identical, as, phone string x is: t 1t 2t 3; Phone string y is: T 1t 2t 3; Wherein, t 1, t 2, t 3, T 1, T 2and T 3can be respectively vowel based on or consonant primitive.Now, alignment thereof is: first determine head and the tail alignment, that is, and t 1with T 1alignment, t 3with T 3alignment.Then, at definite t 2with T 2alignment.
When the character string quantity of phone string x and phone string y is different, as, phone string x is: t 4t 5t 6t 8; Phone string y is: T 4t 5t 6t 7t 8; Wherein, t 4, t 5, t 6, t 8, T 4, T 5, T 6, T 7and T 8can be respectively vowel primitive or consonant primitive.Now, alignment thereof is: first, determine head and the tail alignment, that is, and t 4with T 4alignment, t 8with T 8alignment.Then, t 5t 6with T 5t 6t 7there is multiple alignment mode: if t 5with T 5alignment, t 6with T 6t 7alignment; If t 5with T 5t 6alignment, t 6with T 7alignment.Based on formula calculate respectively the distance metric of every kind of alignment thereof, and the value of therefrom obtaining distance metric minimum is as final data result of calculation.
As can be seen here, the method of searching for generally based on pronunciation similarity, not only can strengthen the fault-tolerance of system, and can compensate for the specific fault pattern of speech recognition, thereby be conducive to the effective understanding to user's input, and then improve the performance of whole system.
Step 808, voice answer system is searched for and is obtained corresponding answer according to the matching template obtaining from knowledge base.
In the present embodiment, the question and answer subsystem under described voice answer system obtains by search knowledge base the answer that described matching template is corresponding.Wherein, knowledge base, according to system template storehouse correspondence establishment, stores answer corresponding to all template problems in described system template storehouse.
In sum, information query method described in the present embodiment, by described character string to be checked is mated with a plurality of compression templates under template set respectively according to setting matched rule, the first template in the compression template that obtains matching with described character string to be checked.Because a plurality of templates under template set are shared reconstruct according to digraph compression, and then make a large amount of different templates compressions be merged into one or more compression templates of negligible amounts, logical operation processing procedure while having simplified search coupling, effectively improved search efficiency, simplify searching algorithm, solved inefficiencies and the complicacy of extensive template search procedure.Meanwhile, reduced the processing load of equipment and system.
Further, user is with the form input problem of oral account, through speech recognition, template matches, passes to knowledge base matching result, and from knowledge base, inquiry obtains corresponding response message output, reduce user's operation, improved user's experience, opened up extensively user crowd.
It should be noted that, for aforesaid embodiment of the method, for simple description, therefore it is all expressed as to a series of combination of actions, but those skilled in the art should know, the application is not subject to the restriction of described sequence of movement, because according to the application, some step can adopt other orders or carry out simultaneously.Secondly, those skilled in the art also should know, the embodiment described in instructions all belongs to preferred embodiment, and related action might not be that the application is necessary.In said method embodiment, graph structure is also FST structure, and also digraph compresses: the template of FST structure is compressed and shared based on FST structure.
Embodiment tetra-
Explanation based on said method embodiment, the application also provides corresponding information query system embodiment, realizes the content described in said method embodiment.
With reference to Fig. 9, show the structured flowchart of a kind of information query system in the embodiment of the present application four.In the present embodiment, described information query system, comprising:
Sound identification module 902, for the voice signal of input being identified as to letter symbol output, obtains character string to be checked.
Template matches module 904, for described character string to be checked is mated with a plurality of compression templates under template set respectively according to setting matched rule, the first template in the compression template that obtains matching with described character string to be checked.
Wherein, comprise a plurality of templates under described template set, described a plurality of templates are shared merging according to digraph compression, obtain one or more compression templates.
Answer generation module 906, for obtaining response message corresponding to described the first template from knowledge base inquiry.
Output module 908, for exporting described response message by voice and/or word.
In sum, information query system described in the present embodiment, by described character string to be checked is mated with a plurality of compression templates under template set respectively according to setting matched rule, the first template in the compression template that obtains matching with described character string to be checked.Because a plurality of templates under template set are shared reconstruct according to digraph compression, and then make a large amount of different templates compressions be merged into one or more compression templates of negligible amounts, logical operation processing procedure while having simplified search coupling, effectively improved search efficiency, simplify searching algorithm, solved inefficiencies and the complicacy of extensive template search procedure.Meanwhile, reduced the processing load of equipment and system.
Further, user is with the form input problem of oral account, through speech recognition, template matches, passes to knowledge base matching result, and from knowledge base, inquiry obtains corresponding response message output, reduce user's operation, improved user's experience, opened up extensively user crowd.
Embodiment five
With reference to Figure 10, show the structured flowchart of a kind of information query system in the embodiment of the present application five.In the present embodiment, described information query system, comprising:
Sound identification module 1002, for the voice signal of input being identified as to letter symbol output, obtains character string to be checked.
Template matches module 1004, for described character string to be checked is mated with a plurality of compression templates under template set respectively according to setting matched rule, the first template in the compression template that obtains matching with described character string to be checked.
Wherein, comprise a plurality of templates under described template set, described a plurality of templates are shared merging according to digraph compression, obtain one or more compression templates.Preferably, described a plurality of templates, by sharing merging with lower module according to digraph compression, obtain one or more compression templates:
Acquisition module, for gathering a plurality of sample datas, carries out Data Division by described a plurality of sample data Yi Ziwei unit;
Arrange module, for according to described a plurality of sample datas semantic sequence separately, the word obtaining after splitting is arranged by graph structure form, obtain described a plurality of template; Wherein, the data structure of described a plurality of templates is graph structure;
Template acquisition module, for according to graph structure can shared Sub tactic pattern, to meeting a plurality of templates of context-free grammar, share merging respectively, obtain described one or more compression template.
In the present embodiment, described template acquisition module, specifically for according to graph structure can shared Sub tactic pattern, respectively the same word and/or the different word that meet in a plurality of templates of context-free grammar are shared to merging, obtain described one or more compression template; Wherein, the data structure of described compression template is digraph compression;
Wherein,
When the described a plurality of templates that meet context-free grammar, when word at the same position place of each self-corresponding graph structure is identical, to share form, merge same word;
When the described a plurality of templates that meet context-free grammar, when word at the same position place of each self-corresponding graph structure is different, with split form, retain different words.
Preferably, in the present embodiment, described template matches module 1004 can comprise:
Data Division module 10042, for described character string to be checked Yi Ziwei unit is carried out to Data Division, and arranges the word obtaining after splitting by graph structure form.
Set acquisition module 10044, for obtaining respectively set to be checked corresponding to character string to be checked after arrangement, and, a plurality of compression template set that described a plurality of compression templates are corresponding.
Computing module 10046, for calculating respectively the path of mating between described set to be checked and the set of described a plurality of compression template.
In the present embodiment, described computing module 10046 can comprise:
Definition module, for defining a Token, the corresponding set of described Token v (i, j, h, s).
Preferably, i, j is respectively the state of described set v in set I and set J; H is the historical path of described set v process in set I and set J, the matching distance that s is described historical path.Wherein, described set I is set corresponding to compression template, and described set J is set corresponding to described character string to be checked.
Add module, for each state at described set I and described set J, add an automatic cycle limit.
Path calculation module, for to adding set I and set J behind circulation limit to carry out figure expanded search, obtains accumulating search history and matching distance; And, obtain distance metric; To described accumulation search history, matching distance and the summation of described distance metric, obtain described coupling path.
Preferably, described distance metric comprises: D (w1, w2), and wherein, D (w1, w2) is used to indicate the distance metric between word w1 and word w2;
The described distance metric that obtains, comprising: use following formula to obtain distance metric:
D ( w 1 , w 2 ) = min r &Sigma; k M ( x r k , y r k ) ;
Preferably, described x is the phone string of word w1; Described y is the phone string of word w2; R is the alignment thereof of x and y; with k the phoneme for the x based on described alignment thereof r and y; the confusion matrix of k the phoneme of the x of expression based on described alignment thereof r and y.
Wherein, described alignment thereof comprises: described phone string x and described phone string y align from beginning to end.
Path acquisition module 10048, for from the set of described a plurality of compression template, obtain one with described set to be checked between mate the path of path minimum.
Determination module 100410, for being defined as described the first template by the template of the minimum path indication of obtaining.
Answer generation module 1006, for obtaining response message corresponding to described the first template from knowledge base inquiry.
Output module 1008, for exporting described response message by voice and/or word.
In sum, information query method described in the present embodiment, by described character string to be checked is mated with a plurality of compression templates under template set respectively according to setting matched rule, the first template in the compression template that obtains matching with described character string to be checked.Because a plurality of templates under template set are shared reconstruct according to digraph compression, and then make a large amount of different templates compressions be merged into one or more compression templates of negligible amounts, logical operation processing procedure while having simplified search coupling, effectively improved search efficiency, simplify searching algorithm, solved inefficiencies and the complicacy of extensive template search procedure.Meanwhile, reduced the processing load of equipment and system.
Further, user is with the form input problem of oral account, through speech recognition, template matches, passes to knowledge base matching result, and from knowledge base, inquiry obtains corresponding response message output, reduce user's operation, improved user's experience, opened up extensively user crowd.
For said system embodiment, because it is substantially similar to embodiment of the method, so description is fairly simple, relevant part is referring to the part explanation of embodiment of the method.
Each embodiment in this instructions all adopts the mode of going forward one by one to describe, and each embodiment stresses is the difference with other embodiment, between each embodiment identical similar part mutually referring to.
Those skilled in the art are easy to expect: the combination in any application of above-mentioned each embodiment is all feasible, therefore the combination in any between above-mentioned each embodiment is all the application's embodiment, but this instructions has not just described in detail one by one at this as space is limited.
A kind of information inquiry side and the system that above the application are provided are described in detail, applied specific case herein the application's principle and embodiment are set forth, the explanation of above embodiment is just for helping to understand the application's method and core concept thereof; Meanwhile, for one of ordinary skill in the art, the thought according to the application, all will change in specific embodiments and applications, and in sum, this description should not be construed as the restriction to the application.

Claims (10)

1. an information query method, is characterized in that, comprising:
The voice signal of input is identified as to letter symbol output, obtains character string to be checked;
Described character string to be checked is mated respectively to the first template in the compression template that obtains matching with described character string to be checked with a plurality of compression templates under template set according to setting matched rule; Wherein, comprise a plurality of templates under described template set, described a plurality of templates are shared merging according to digraph compression, obtain one or more compression templates;
From knowledge base, inquiry obtains response message corresponding to described the first template;
By voice and/or word, export described response message.
2. the method for claim 1, is characterized in that, described a plurality of templates are shared merging according to digraph compression in the following manner, obtain one or more compression templates:
Gather a plurality of sample datas, described a plurality of sample data Yi Ziwei unit is carried out to Data Division;
According to described a plurality of sample datas semantic sequence separately, the word obtaining after splitting is arranged by graph structure form, obtain described a plurality of template; Wherein, the data structure of described a plurality of templates is graph structure;
According to graph structure can shared Sub tactic pattern, to meeting a plurality of templates of context-free grammar, share merging respectively, obtain described one or more compression template.
3. method as claimed in claim 2, is characterized in that, described according to graph structure can shared Sub tactic pattern, to meeting a plurality of templates of context-free grammar, share merging respectively, obtain described one or more compression template, comprising:
According to graph structure can shared Sub tactic pattern, respectively the same word and/or the different word that meet in a plurality of templates of context-free grammar are shared to merging, obtain described one or more compression template; Wherein, the data structure of described compression template is digraph compression;
Wherein,
When the described a plurality of templates that meet context-free grammar, when word at the same position place of each self-corresponding graph structure is identical, to share form, merge same word;
When the described a plurality of templates that meet context-free grammar, when word at the same position place of each self-corresponding graph structure is different, with split form, retain different words.
4. the method for claim 1, it is characterized in that, described described character string to be checked is mated with a plurality of compression templates under template set respectively according to setting matched rule, the first template in the compression template that obtains matching with described character string to be checked, comprising:
Described character string to be checked Yi Ziwei unit is carried out to Data Division, and the word obtaining after splitting is arranged by graph structure form;
Obtain respectively set to be checked corresponding to character string to be checked after arrangement, and, a plurality of compression template set that described a plurality of compression templates are corresponding;
Calculate respectively the path of mating between described set to be checked and the set of described a plurality of compression template;
From the set of described a plurality of compression template, obtain one with described set to be checked between mate the path of path minimum;
The template of the minimum path indication of obtaining is defined as to described the first template.
5. method as claimed in claim 4, is characterized in that, the described path of mating of calculating respectively between described set to be checked and the set of described a plurality of compression template, comprising:
Define a Token, the corresponding set of described Token v (i, j, h, s), wherein, and i, j is respectively the state of described set v in set I and set J; H is the historical path of described set v process in set I and set J, the matching distance that s is described historical path; Wherein, described set I is set corresponding to compression template, and described set J is set corresponding to described character string to be checked;
In each state of described set I and described set J, add an automatic cycle limit;
To adding set I and set J behind circulation limit to carry out figure expanded search, obtain accumulating search history and matching distance; And, obtain distance metric;
To described accumulation search history, matching distance and the summation of described distance metric, obtain described coupling path.
6. method as claimed in claim 5, is characterized in that, described distance metric comprises: D (w1, w2), and wherein, D (w1, w2) is used to indicate the distance metric between word w1 and word w2;
The described distance metric that obtains, comprising: use following formula to obtain distance metric:
D ( w 1 , w 2 ) = min r &Sigma; k M ( x r k , y r k ) ;
Wherein, described x is the phone string of word w1; Described y is the phone string of word w2; R is the alignment thereof of x and y; with k the phoneme for the x based on described alignment thereof r and y; the confusion matrix of k the phoneme of the x of expression based on described alignment thereof r and y;
Wherein said alignment thereof comprises: described phone string x and described phone string y align from beginning to end.
7. an information query system, is characterized in that, comprising:
Sound identification module, for the voice signal of input being identified as to letter symbol output, obtains character string to be checked;
Template matches module, for described character string to be checked is mated with a plurality of compression templates under template set respectively according to setting matched rule, the first template in the compression template that obtains matching with described character string to be checked; Wherein, comprise a plurality of templates under described template set, described a plurality of templates are shared merging according to digraph compression, obtain one or more compression templates;
Answer generation module, for obtaining response message corresponding to described the first template from knowledge base inquiry;
Output module, for exporting described response message by voice and/or word.
8. system as claimed in claim 7, is characterized in that, described a plurality of templates, by sharing merging with lower module according to digraph compression, obtain one or more compression templates:
Acquisition module, for gathering a plurality of sample datas, carries out Data Division by described a plurality of sample data Yi Ziwei unit;
Arrange module, for according to described a plurality of sample datas semantic sequence separately, the word obtaining after splitting is arranged by graph structure form, obtain described a plurality of template; Wherein, the data structure of described a plurality of templates is graph structure;
Template acquisition module, for according to graph structure can shared Sub tactic pattern, to meeting a plurality of templates of context-free grammar, share merging respectively, obtain described one or more compression template.
9. system as claimed in claim 8, is characterized in that,
Described template acquisition module, specifically for according to graph structure can shared Sub tactic pattern, respectively the same word and/or the different word that meet in a plurality of templates of context-free grammar are shared to merging, obtain described one or more compression template; Wherein, the data structure of described compression template is digraph compression;
Wherein,
When the described a plurality of templates that meet context-free grammar, when word at the same position place of each self-corresponding graph structure is identical, to share form, merge same word;
When the described a plurality of templates that meet context-free grammar, when word at the same position place of each self-corresponding graph structure is different, with split form, retain different words.
10. system as claimed in claim 7, is characterized in that, described template matches module, comprising:
Data Division module, for described character string to be checked Yi Ziwei unit is carried out to Data Division, and arranges the word obtaining after splitting by graph structure form;
Set acquisition module, for obtaining respectively set to be checked corresponding to character string to be checked after arrangement, and, a plurality of compression template set that described a plurality of compression templates are corresponding;
Computing module, for calculating respectively the path of mating between described set to be checked and the set of described a plurality of compression template;
Path acquisition module, for from the set of described a plurality of compression template, obtain one with described set to be checked between mate the path of path minimum;
Determination module, for being defined as described the first template by the template of the minimum path indication of obtaining.
CN201410352847.6A 2014-07-23 2014-07-23 Information inquiry method and system Pending CN104199825A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410352847.6A CN104199825A (en) 2014-07-23 2014-07-23 Information inquiry method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410352847.6A CN104199825A (en) 2014-07-23 2014-07-23 Information inquiry method and system

Publications (1)

Publication Number Publication Date
CN104199825A true CN104199825A (en) 2014-12-10

Family

ID=52085118

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410352847.6A Pending CN104199825A (en) 2014-07-23 2014-07-23 Information inquiry method and system

Country Status (1)

Country Link
CN (1) CN104199825A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106471502A (en) * 2016-06-29 2017-03-01 深圳狗尾草智能科技有限公司 Intension recognizing method based on water conservancy diversion and system
CN106547785A (en) * 2015-09-22 2017-03-29 阿里巴巴集团控股有限公司 Information getting method and system in knowledge base
CN106959976A (en) * 2016-01-12 2017-07-18 腾讯科技(深圳)有限公司 A kind of search processing method and device
CN107430618A (en) * 2015-03-20 2017-12-01 谷歌公司 Realize the system and method interacted with host computer device progress user speech
CN107430616A (en) * 2015-03-13 2017-12-01 微软技术许可有限责任公司 The interactive mode of speech polling re-forms
WO2018103490A1 (en) * 2016-12-06 2018-06-14 Authpaper Limited A method and system for compressing data
CN109885733A (en) * 2019-01-18 2019-06-14 清华大学 The diagram data compression method and device of tree query are generated for target
CN111464707A (en) * 2020-03-30 2020-07-28 中国建设银行股份有限公司 Outbound call processing method, device and system
CN109344237B (en) * 2016-08-23 2020-11-17 上海智臻智能网络科技股份有限公司 Information processing method and device for man-machine interaction

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5706398A (en) * 1995-05-03 1998-01-06 Assefa; Eskinder Method and apparatus for compressing and decompressing voice signals, that includes a predetermined set of syllabic sounds capable of representing all possible syllabic sounds
EP1324213A2 (en) * 2001-12-05 2003-07-02 Microsoft Corporation Grammar authoring system
CN1514387A (en) * 2002-12-31 2004-07-21 中国科学院计算技术研究所 Sound distinguishing method in speech sound inquiry
CN101464896A (en) * 2009-01-23 2009-06-24 安徽科大讯飞信息科技股份有限公司 Voice fuzzy retrieval method and apparatus
CN102968409A (en) * 2012-11-23 2013-03-13 海信集团有限公司 Intelligent human-machine interaction semantic analysis method and interaction system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5706398A (en) * 1995-05-03 1998-01-06 Assefa; Eskinder Method and apparatus for compressing and decompressing voice signals, that includes a predetermined set of syllabic sounds capable of representing all possible syllabic sounds
EP1324213A2 (en) * 2001-12-05 2003-07-02 Microsoft Corporation Grammar authoring system
CN1514387A (en) * 2002-12-31 2004-07-21 中国科学院计算技术研究所 Sound distinguishing method in speech sound inquiry
CN101464896A (en) * 2009-01-23 2009-06-24 安徽科大讯飞信息科技股份有限公司 Voice fuzzy retrieval method and apparatus
CN102968409A (en) * 2012-11-23 2013-03-13 海信集团有限公司 Intelligent human-machine interaction semantic analysis method and interaction system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李伟: "基于内容的汉语语音检索技术研究与系统实现", 《中国博士学位论文全文数据库 信息科技辑(月刊)》 *

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107430616A (en) * 2015-03-13 2017-12-01 微软技术许可有限责任公司 The interactive mode of speech polling re-forms
CN107430616B (en) * 2015-03-13 2020-12-29 微软技术许可有限责任公司 Interactive reformulation of voice queries
CN107430618A (en) * 2015-03-20 2017-12-01 谷歌公司 Realize the system and method interacted with host computer device progress user speech
CN106547785A (en) * 2015-09-22 2017-03-29 阿里巴巴集团控股有限公司 Information getting method and system in knowledge base
CN106547785B (en) * 2015-09-22 2020-08-04 阿里巴巴集团控股有限公司 Method and system for acquiring information in knowledge base
JP2018525717A (en) * 2016-01-12 2018-09-06 ▲騰▼▲訊▼科技(深▲セン▼)有限公司 Search processing method and device
US10713302B2 (en) 2016-01-12 2020-07-14 Tencent Technology (Shenzhen) Company Limited Search processing method and device
CN106959976A (en) * 2016-01-12 2017-07-18 腾讯科技(深圳)有限公司 A kind of search processing method and device
WO2017121355A1 (en) * 2016-01-12 2017-07-20 腾讯科技(深圳)有限公司 Search processing method and device
CN106959976B (en) * 2016-01-12 2020-08-14 腾讯科技(深圳)有限公司 Search processing method and device
CN106471502A (en) * 2016-06-29 2017-03-01 深圳狗尾草智能科技有限公司 Intension recognizing method based on water conservancy diversion and system
WO2018000279A1 (en) * 2016-06-29 2018-01-04 深圳狗尾草智能科技有限公司 Diversion-based intention recognition method and system
CN109344237B (en) * 2016-08-23 2020-11-17 上海智臻智能网络科技股份有限公司 Information processing method and device for man-machine interaction
WO2018103490A1 (en) * 2016-12-06 2018-06-14 Authpaper Limited A method and system for compressing data
CN109661779A (en) * 2016-12-06 2019-04-19 奥斯佩普尔有限公司 Method and system for compressed data
CN109661779B (en) * 2016-12-06 2023-12-26 奥斯佩普尔有限公司 Method and system for compressing data
CN109885733B (en) * 2019-01-18 2020-09-15 清华大学 Graph data compression method and device for target spanning tree query
CN109885733A (en) * 2019-01-18 2019-06-14 清华大学 The diagram data compression method and device of tree query are generated for target
CN111464707A (en) * 2020-03-30 2020-07-28 中国建设银行股份有限公司 Outbound call processing method, device and system

Similar Documents

Publication Publication Date Title
CN111933129B (en) Audio processing method, language model training method and device and computer equipment
CN110534095B (en) Speech recognition method, apparatus, device and computer readable storage medium
CN104199825A (en) Information inquiry method and system
US10878808B1 (en) Speech processing dialog management
Tur et al. What is left to be understood in ATIS?
CN101510222B (en) Multilayer index voice document searching method
CN113836277A (en) Machine learning system for digital assistant
KR102057184B1 (en) Interest determination system, interest determination method, and storage medium
WO2003010754A1 (en) Speech input search system
CN103325370A (en) Voice identification method and voice identification system
KR20200109914A (en) A natural language processing system, a learning method for the same and computer-readable recording medium with program
CN116110405B (en) Land-air conversation speaker identification method and equipment based on semi-supervised learning
CN112151015A (en) Keyword detection method and device, electronic equipment and storage medium
CN113178193A (en) Chinese self-defined awakening and Internet of things interaction method based on intelligent voice chip
CN109767758A (en) Vehicle-mounted voice analysis method, system, storage medium and equipment
Moyal et al. Phonetic search methods for large speech databases
Zhu et al. Catslu: The 1st chinese audio-textual spoken language understanding challenge
Kaushik et al. Automatic audio sentiment extraction using keyword spotting.
CN113793599B (en) Training method of voice recognition model, voice recognition method and device
Williams Zero Shot Intent Classification Using Long-Short Term Memory Networks.
CN117099157A (en) Multitasking learning for end-to-end automatic speech recognition confidence and erasure estimation
Avram et al. Romanian speech recognition experiments from the robin project
CN110809796B (en) Speech recognition system and method with decoupled wake phrases
CN107609096B (en) Intelligent lawyer expert response method
Diwan et al. Reduce and reconstruct: ASR for low-resource phonetic languages

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20141210