CN113674734B - Information query method and system based on voice recognition, equipment and storage medium - Google Patents
Information query method and system based on voice recognition, equipment and storage medium Download PDFInfo
- Publication number
- CN113674734B CN113674734B CN202110971706.2A CN202110971706A CN113674734B CN 113674734 B CN113674734 B CN 113674734B CN 202110971706 A CN202110971706 A CN 202110971706A CN 113674734 B CN113674734 B CN 113674734B
- Authority
- CN
- China
- Prior art keywords
- decoding
- results
- recognition
- text
- outputting
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 62
- 238000012216 screening Methods 0.000 claims abstract description 11
- 238000012163 sequencing technique Methods 0.000 claims description 27
- 238000004590 computer program Methods 0.000 claims description 12
- 238000006243 chemical reaction Methods 0.000 claims description 5
- 230000008569 process Effects 0.000 description 23
- 238000010586 diagram Methods 0.000 description 10
- 238000004891 communication Methods 0.000 description 9
- 230000008901 benefit Effects 0.000 description 8
- 238000012549 training Methods 0.000 description 6
- 238000012545 processing Methods 0.000 description 4
- 238000013461 design Methods 0.000 description 3
- 230000014509 gene expression Effects 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000000802 evaporation-induced self-assembly Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000013518 transcription Methods 0.000 description 1
- 230000035897 transcription Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3343—Query execution using phonetics
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/338—Presentation of query results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/0018—Speech coding using phonetic or linguistical decoding of the source; Reconstruction using text-to-speech synthesis
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Signal Processing (AREA)
- General Health & Medical Sciences (AREA)
- Machine Translation (AREA)
Abstract
The application discloses an information query method, an information query system, computer equipment and a computer readable storage medium based on voice recognition. The technical scheme of the method comprises the following steps: the encoding step is used for inputting audio data, extracting characteristic values by using a transducer encoder, and outputting a two-dimensional characteristic value sequence; decoding: after carrying out streaming voice decoding and recognition by adopting a decoder combining a transducer and an N-gram based on a two-dimensional characteristic value sequence, screening and outputting M results of first text sorting, carrying out non-streaming voice decoding and recognition by adopting a transducer model based on the two-dimensional characteristic value sequence and M results of first matched text sorting, and outputting N results of second text sorting, wherein N and M are positive integers which are greater than or equal to 1, and N is less than or equal to M; and the assignment weighting step is used for carrying out assignment weighting based on the hot word dictionary in the text sorting result output by the decoding step and outputting an optimal query result. The invention improves the accuracy of voice recognition by adding the language model and the hotword weight function.
Description
Technical Field
The present disclosure relates to the field of information query, and in particular, to a method, a system, a computer device, and a computer readable storage medium for information query based on speech recognition.
Background
At present, along with the development of voice recognition technology, the voice recognition user can be found to have high acceptance degree and convenient operation, the privacy problem of the user is not involved, and the popularization of the voice recognition related application is more convenient.
From the scene, speech recognition can be classified into streaming speech recognition and non-streaming speech recognition. Non-streaming speech recognition (off-line recognition) refers to the recognition of a pattern after a user has spoken a sentence or a segment of speech, while streaming speech recognition refers to the simultaneous recognition of a pattern while the user is still speaking. Because of its low latency, streaming speech recognition has a wide range of applications in industry, such as dictation transcription.
After being proposed in the field of natural language, the transducer model has been expanded to many fields such as computer vision and voice. The transducer model has better accuracy in the streaming voice recognition scenario.
N-Gram is a language model commonly used in large vocabulary continuous speech recognition, and N-Gram is an algorithm based on a statistical language model. The basic idea is to perform a sliding window operation of size N on the content in the text according to bytes, forming a sequence of byte fragments of length N.
Currently, the unified model of streaming and non-streaming speech recognition used in the prior art is usually a shared Encoder, CTC decoder, and Attention decoder, where the shared Encoder includes multiple layers of transformers or correlators, the CTC decoder includes one full-connection layer and one softmax layer, and the Attention decoder includes multiple layers of transformers.
However, the unified model of the streaming and non-streaming voice recognition lacks specific special words and voice recognition under different contexts, and meanwhile, the probability estimation accuracy of the prior art voice recognition method for unusual words is low, and weight cannot be added to the specific special words so as to improve the recognition accuracy of the special nouns.
The invention provides a solution for solving information inquiry based on voice recognition, which realizes an algorithm model combining stream type and non-stream type voice recognition, and simultaneously, two language models designed in the decoding process and hotword weighting after decoding are finished so as to increase the recognition rate of specific proper nouns and special speaking.
Disclosure of Invention
The embodiment of the application provides an information query solution based on voice recognition, which realizes an algorithm model combining streaming and non-streaming voice recognition, and simultaneously, two language models designed in the decoding process and hotword weighting after decoding are finished so as to increase the recognition rate of specific proper nouns and special speaking.
In a first aspect, an embodiment of the present application provides a method for querying information based on speech recognition, where streaming and non-streaming mixed speech recognition is adopted, the method includes:
encoding: inputting audio data, extracting characteristic values by using a transducer encoder, and outputting a two-dimensional characteristic value sequence;
decoding: after carrying out streaming voice decoding and recognition by adopting a decoder combining a transducer and an N-gram based on the two-dimensional characteristic value sequence, screening and outputting M results of first text sequencing, carrying out non-streaming voice decoding and recognition by adopting a transducer model based on the two-dimensional characteristic value sequence and the M results of first matched text sequencing, and outputting N results of second text sequencing, wherein N and M are positive integers which are greater than or equal to 1, and N is less than or equal to M;
assignment weighting: and in the text sorting result output in the decoding step, carrying out assignment weighting based on the hot word dictionary, and outputting an optimal query result.
In some embodiments, the decoding step includes:
the transform and n-gram combine decoding steps: the conversion model performs decoding recognition based on the combination of the daily term corpus and the n-gram model based on the specific special corpus, adopts the prefixeamsearch to screen and sort decoding recognition results, and outputs M results of the first text sorting.
In some embodiments, the decoding step further includes:
conformer decoding step: and based on the two-dimensional eigenvalue sequence and the M results of the first decoded text sequencing, adopting a con model to decode and identify the combined voice and context, and outputting N results of the second text sequencing, wherein N is a positive integer greater than or equal to 1 and N is less than or equal to M.
In some embodiments, the assigning a weight step includes:
traversing: performing hot word traversal matching on the text sequencing results output in the decoding step and the hot word dictionary one by one;
weighting: if the hot word matching is successful, the text which is successfully matched is added with a weighted score, and finally, one item with the highest score is selected as the best output result.
In a second aspect, an embodiment of the present application provides an information query system based on speech recognition, which adopts mixed speech recognition of streaming and non-streaming, and adopts any one of the above information query methods based on speech recognition, where the system includes:
and a coding module: inputting audio data, extracting characteristic values by using a transducer encoder, and outputting a two-dimensional characteristic value sequence;
and a decoding module: after carrying out streaming voice decoding and recognition by adopting a decoder combining a transducer and an N-gram based on the two-dimensional characteristic value sequence, screening and outputting M results of first text sequencing, carrying out non-streaming voice decoding and recognition by adopting a transducer model based on the two-dimensional characteristic value sequence and the M results of first matched text sequencing, and outputting N results of second text sequencing, wherein N and M are positive integers which are greater than or equal to 1, and N is less than or equal to M;
assignment weighting module: and in the text sorting result output in the decoding step, carrying out assignment weighting based on the hot word dictionary, and outputting an optimal query result.
In some embodiments, the decoding module includes:
the transducer and n-gram combine decoding modules: the conversion model performs decoding recognition based on the combination of the daily term corpus and the n-gram model based on the specific special corpus, adopts the prefixeamsearch to screen and sort decoding recognition results, and outputs M results of the first text sorting.
In some embodiments, the decoding module further includes:
conformer decoding module: and based on the two-dimensional eigenvalue sequence and the M results of the first decoded text sequencing, adopting a con model to decode and identify the combined voice and context, and outputting N results of the second text sequencing, wherein N is a positive integer greater than or equal to 1 and N is less than or equal to M.
In some embodiments, the assignment weighting module includes:
and (5) traversing a module: performing hot word traversal matching on the text sequencing results output in the decoding step and the hot word dictionary one by one;
and a weighting module: if the hot word matching is successful, the text which is successfully matched is added with a weighted score, and finally, one item with the highest score is selected as the best output result.
In a third aspect, embodiments of the present application provide a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the speech recognition based information query method according to the first aspect as described above when executing the computer program.
In a fourth aspect, embodiments of the present application provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the speech recognition based information query method as described in the first aspect above.
Compared with the related prior art, the method has the following outstanding beneficial effects:
1. the model adopts a flow-non-flow combined voice recognition algorithm model;
2. in the invention, a language model based on a transformer and an n-gram is added in a preface, and in a preliminary screening stage, a language model is trained by using a corpus with more specific corpus, so that recall rate of specific nouns and speech operation is increased;
3. the invention uses a transducer model, which can increase the probability estimation accuracy of the non-appearing words, and uses an n-gram model, so that the word weight in the decoding process is more controllable, for example, the weight is increased for the silently written specific words in the n-gram, thereby improving the recognition accuracy of the specific nouns;
4. the invention adds the language model and the hotword weight module in the decoding process and after decoding so as to improve the recognition accuracy in the special speech recognition field.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:
FIG. 1 is a flow chart of a method for querying information based on speech recognition according to the present invention;
FIG. 2 is a schematic diagram of a decoding step according to the present invention;
FIG. 3 is a schematic diagram of the assignment weighting steps of the present invention;
FIG. 4 is a schematic diagram of an embodiment of the method of the present invention;
FIG. 5 is a schematic diagram of a system based on speech recognition information according to the present invention;
fig. 6 is a schematic diagram of a hardware structure of a computer device according to an embodiment of the present application.
In the above figures:
100 information inquiry system
10 encoding module 20 decoding module
30 assignment weighting module
81. A processor; 82. a memory; 83. a communication interface; 80. a bus.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described and illustrated below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden on the person of ordinary skill in the art based on the embodiments provided herein, are intended to be within the scope of the present application.
It is apparent that the drawings in the following description are only some examples or embodiments of the present application, and it is possible for those of ordinary skill in the art to apply the present application to other similar situations according to these drawings without inventive effort. Moreover, it should be appreciated that while such a development effort might be complex and lengthy, it would nevertheless be a routine undertaking of design, fabrication, or manufacture for those of ordinary skill having the benefit of this disclosure, and thus should not be construed as having the benefit of this disclosure.
The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the other features, objects, and advantages of the application.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is to be expressly and implicitly understood by those of ordinary skill in the art that the embodiments described herein can be combined with other embodiments without conflict.
Unless defined otherwise, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar terms herein do not denote a limitation of quantity, but rather denote the singular or plural. The terms "comprising," "including," "having," and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to only those steps or elements but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. The terms "connected," "coupled," and the like in this application are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as used herein refers to two or more. "and/or" describes an association relationship of an association object, meaning that there may be three relationships, e.g., "a and/or B" may mean: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship. The terms "first," "second," "third," and the like, as used herein, are merely distinguishing between similar objects and not representing a particular ordering of objects.
The present application relates to a method, apparatus, device or computer readable storage medium for querying information based on speech recognition, and aims to provide a speech recognition model based on combination of streaming and non-streaming. The streaming voice recognition model can support real-time return of recognition results in the process of processing the audio stream so as to meet the requirement that query equipment can return query results in real time; compared with the streaming model, the recognition result of the non-streaming model is more accurate, and the recognition result of the streaming model is corrected. The whole algorithm model comprises a encoder, decoder module and a hotword weighting module.
In the decoder process, a transducer is used, and the transducer encoder is more effective in extracting long sequence dependence, the input of the decoder process is voice data, and the output is extracted two-dimensional eigenvalue sequences, and the eigenvalue dimension is (T (voice frame number).
In the decoder process, two language models are added, wherein the streaming comprises a language model based on a transducer decoder and a non-streaming n-gram, the transducer is used for identifying characteristic values (based on a daily universal dictionary), the identification accuracy of daily expressions is enhanced, the n-gram is added with (specific related corpus) to train the language model, the identification accuracy of specific special terms is increased, and a text result is output. In order to increase the recall rate in the decoder process, a decoding process based on a prefixeamsearch is added; the Transformer uses a huge amount of corpus in the training process, and has the advantage of high recognition accuracy of daily expressions. n-grams use a large number of specialized corpus during training, which are more sensitive to specific specialized terms.
Furthermore, the algorithm model of the invention also designs a con voice model which comprises voice characteristics and context, and continuously screens out n optimal decoding paths (n is more than or equal to 1).
The input of the Decoder is a characteristic value sequence output by the Decoder, and the output is n optimal decoding results.
And after the decoder, a hotword weighting module is designed, a voice hotword dictionary is configured, weighting and scoring are carried out according to the condition that n optimal decoding paths contain hotwords, and finally, one item with the highest scoring is selected as a recognition result of the model. The input of the module is n best decoding sentences, and the output is one best decoding sentence, namely the final output result.
FIG. 1 is a diagram showing a method for inquiring information based on speech recognition according to the present invention, and FIG. 2 is a diagram showing a decoding step according to the present invention; FIG. 3 is a schematic diagram of the assignment weighting steps of the present invention; as shown in fig. 1-3, the present embodiment provides an information query method based on speech recognition, and adopts streaming and non-streaming mixed speech recognition, and the method includes:
encoding step S10: inputting audio data, extracting characteristic values by using a transducer encoder, and outputting a two-dimensional characteristic value sequence;
decoding step S20: after carrying out streaming voice decoding and recognition by adopting a decoder combining a transducer and an N-gram based on a two-dimensional characteristic value sequence, screening and outputting M results of first text sorting, carrying out non-streaming voice decoding and recognition by adopting a transducer model based on the two-dimensional characteristic value sequence and M results of first matched text sorting, and outputting N results of second text sorting, wherein N and M are positive integers which are greater than or equal to 1, and N is less than or equal to M;
assignment weighting step S30: and in the text sorting result output in the decoding step, carrying out assignment weighting based on the hot word dictionary, and outputting an optimal query result.
Wherein, the decoding step S20 includes:
the transform and n-gram combine decoding step S21: the method comprises the steps that a transform model is used for decoding and identifying based on a daily term corpus and an n-gram model is used for decoding and identifying based on a specific special corpus, a prefixeamsearch is used for screening and sorting decoding and identifying results, and M results of first text sorting are output.
Further, the decoding step S20 further includes:
conformer decoding step S22: based on the two-dimensional eigenvalue sequence and M results of the first decoding text sequencing, adopting a conformation model to decode and identify the combined voice and context, and outputting N results of the second text sequencing, wherein N is a positive integer greater than or equal to 1 and less than or equal to M.
In some embodiments, the assigning weighting step S30 includes:
step S31 is traversed: performing hot word traversal matching on the text sequencing results output in the decoding step and the hot word dictionary one by one;
weighting step S32: if the hot word matching is successful, the text which is successfully matched is added with a weighted score, and finally, one item with the highest score is selected as the best output result.
Specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings:
in the specific embodiment of the invention, the railway passenger specification query equipment based on voice recognition is realized, and passengers are supported to use the equipment to perform voice query of the railway passenger specification. A voice command can simplify a lot of complicated steps, most problems can be solved by using the voice query equipment, and the voice query equipment does not need to input characters by passengers to search the problems or call 12306 customer service telephones by the passengers to query, so that the query time is greatly shortened. The equipment not only simplifies the process of passenger specification inquiry, but also lightens the pressure of 12306 manual customer service, and is a good innovation for both parties.
The railway 12306 official website comprises four modules of common problems, use of beard and knowledge, relevant regulations and railway insurance in a trip guide column, and basically comprises all railway custom rules. Although each module makes a classification, indexing by class presents certain difficulties for passengers who are not familiar with the classification. Although the official website provides a search box, after inputting the corresponding keywords, the corresponding questions need to be selected in the drop-down box to obtain answers to the questions. The whole set of operation consumes longer time, and the official website is convenient to operate at a computer end, so that the operation is not the optimal choice for the traveler outside. The railway 12306App also contains the above information, but does not provide a search box, so that passengers can only find related problems by category, and the situation that the classification of the problems is not found is likely to exist, so that a plurality of inconveniences are brought to the passengers. Many passengers choose to dial 12306 the manual service for inquiry, but due to the limited manual service, the passengers often need to wait in line, even if the passengers do not make a call, and the passengers cannot acquire answers to the questions in the first time.
Aiming at the characteristic of more railway proper nouns, the invention designs a voice recognition algorithm for improving railway voice recognition special terms, and three improvements are added at two positions of an algorithm model, including two language models in the decoding process and hot word weighting after decoding is finished, so as to increase the recognition rate of the railway proper nouns and the special speech. The algorithm model is a flow-non-flow combined algorithm model, and the flow is as follows: the voice recognition system outputs a recognition result in real time in the speaking process; non-streaming: after a section of speech is spoken, the speech recognition system outputs a recognition result.
And a language model and a hotword weight module are added in the decoding process and after decoding so as to improve the recognition accuracy in the railway voice recognition field.
Fig. 4 is a schematic flow chart of a method embodiment of the present invention, and as shown in fig. 4, the embodiment of the present invention proposes:
1. and adding a language model based on combination of a transducer and an n-gram in the prefixed beam search, and training the language model by using a corpus of a railway to occupy more corpus in a preliminary screening stage, so as to increase recall rate of special nouns and speech of the railway. The method has the advantages that the accuracy of estimating the probability of the non-occurrence words can be increased by using a transducer model, the word weight in the decoding process is more controllable by using an n-gram model, for example, the weight is increased for silently writing specific words in the n-gram, so that the recognition accuracy of the railway proper nouns is improved. The Transformer uses a huge amount of corpus in the training process, and has the advantage of high recognition accuracy of daily expressions. n-gram uses a large number of railway-specific corpora during training, which are more sensitive to railway-specific terminology.
2. When decoding is finished, in the process of beam sequencing, special dictionary weighting is carried out, and the weighting process is carried out:
1) Traversing the text results of the N sentences in the top ranking;
2) Searching whether the result contains text in the hot word dictionary;
3) If the text in the dictionary is contained, the score of the sentence is increased.
4) Adding all scores of the sentences, the more hot words in the dictionary are contained in one sentence, the higher the sentence score, and the better the sentence containing the railway-specific term tends to be identified.
The embodiment of the present application further provides an information query system based on voice recognition, which adopts streaming and non-streaming mixed voice recognition, and adopts any one of the above information query methods based on voice recognition, and fig. 5 is a schematic diagram of an information query system based on voice recognition according to the present invention, and as shown in fig. 5, a system 100 includes:
encoding module 10: inputting audio data, extracting characteristic values by using a transducer encoder, and outputting a two-dimensional characteristic value sequence;
decoding module 20: after carrying out streaming voice decoding and recognition by adopting a decoder combining a transducer and an N-gram based on a two-dimensional characteristic value sequence, screening and outputting M results of first text sorting, carrying out non-streaming voice decoding and recognition by adopting a transducer model based on the two-dimensional characteristic value sequence and M results of first matched text sorting, and outputting N results of second text sorting, wherein N and M are positive integers which are greater than or equal to 1, and N is less than or equal to M;
assignment weighting module 30: and in the text sorting results output by the decoding module 20, assignment weighting is carried out based on the hot word dictionary, and the optimal query result is output.
Wherein, the decoding module 20 includes:
the transducer and n-gram combine decoding modules: the conversion model performs decoding recognition based on the combination of the daily term corpus and the n-gram model based on the specific special corpus, adopts the prefixeamsearch to screen and sort decoding recognition results, and outputs M results of the first text sorting.
Further, the decoding module 20 further includes:
conformer decoding step: based on the two-dimensional eigenvalue sequence and M results of the first decoding text sequencing, adopting a conformation model to decode and identify the combined voice and context, and outputting N results of the second text sequencing, wherein N is a positive integer greater than or equal to 1 and less than or equal to M.
In some embodiments, the assignment weighting module 30 includes:
and (5) traversing a module: performing hot word traversal matching on the text sequencing results output in the decoding step and the hot word dictionary one by one;
and a weighting module: if the hot word matching is successful, the text which is successfully matched is added with a weighted score, and finally, one item with the highest score is selected as the best output result.
An embodiment of the present application provides a computer device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor implements the voice recognition-based information query method according to the first aspect when executing the computer program.
An embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the speech recognition-based information query method as described in the first aspect above.
In addition, the voice recognition-based information query method of the embodiment of the present application described in connection with fig. 1 may be implemented by a computer device. Fig. 6 is a schematic diagram of a hardware structure of a computer device according to an embodiment of the present application.
The computer device may include a processor 81 and a memory 82 storing computer program instructions.
In particular, the processor 81 may include a Central Processing Unit (CPU), or an Application Specific Integrated Circuit (ASIC) for short, or may be configured to implement one or more integrated circuits of embodiments of the present application.
Memory 82 may include, among other things, mass storage for data or instructions. By way of example, and not limitation, memory 82 may comprise a Hard Disk Drive (HDD), floppy disk drive, solid State Drive (SSD), flash memory, optical disk, magneto-optical disk, tape, or universal serial bus (UniversalSerial Bus, USB) drive, or a combination of two or more of these. The memory 82 may include removable or non-removable (or fixed) media, where appropriate. The memory 82 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 82 is a Non-Volatile (Non-Volatile) memory. In particular embodiments, memory 82 includes Read-only memory (ROM) and Random Access Memory (RAM). Where appropriate, the ROM may be a mask-programmed ROM, a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), an electrically rewritable ROM (ElectricallyAlterable Read-Onlymemory, EAROM) or a FLASH Memory (FLASH), or a combination of two or more of these. The RAM may be a static random-access memory (SRAM) or a dynamic random-access memory (DRAM), where the DRAM may be a fast page mode dynamic random-access memory (FastPageMode DynamicRandomAccessMemory, FPMDRAM), an extended data output dynamic random-access memory (ExtendedDateOutDynamicRandomAccess Memory, EDODRAM), a synchronous dynamic random-access memory (Synchronous DynamicRandom-SDRAM), or the like, as appropriate.
Memory 82 may be used to store or cache various data files that need to be processed and/or communicated, as well as possible computer program instructions for execution by processor 81.
The processor 81 implements any of the entity recommendation methods of the above embodiments by reading and executing computer program instructions stored in the memory 82.
In some of these embodiments, the computer device may also include a communication interface 83 and a bus 80. As shown in fig. 6, the processor 81, the memory 82, and the communication interface 83 are connected to each other through the bus 80 and perform communication with each other.
The communication interface 83 is used to implement communications between various modules, devices, units, and/or units in embodiments of the present application. The communication interface 83 may also enable communication with other components such as: and the external equipment, the image/data acquisition equipment, the database, the external storage, the image/data processing workstation and the like are used for data communication.
Bus 80 includes hardware, software, or both, coupling components of the computer device to each other. Bus 80 includes, but is not limited to, at least one of: data bus (DataBus), address bus (address bus), control bus (control bus), expansion bus (expansion bus), local bus (LocalBus). By way of example, and not limitation, bus 80 may include an Accelerated Graphics Port (AGP) or other graphics Bus, an enhanced industry standard architecture (ExtendedIndustry StandardArchitecture, EISA) Bus, a Front Side Bus (FSB), a HyperTransport (HT) interconnect, an Industry Standard Architecture (ISA) Bus, an InfiniBand (InfiniBand) interconnect, a low pin count (LowPinCount) Bus, a memory Bus, a micro channel architecture (MicroChannel Architecture, MCA) Bus, a peripheral component interconnect (Peripheral ComponentInterconnect, PCI) Bus, a PCI-Express (PCI-X) Bus, a Serial Advanced Technology Attachment (SATA) Bus, a video electronics standards association (VideoElectronics StandardsAssociationLocalBus, VLB) Bus, or other suitable Bus, or a combination of two or more of these. Bus 80 may include one or more buses, where appropriate. Although embodiments of the present application describe and illustrate a particular bus, the present application contemplates any suitable bus or interconnect.
The computer device may be based on speech recognition to implement the information query method described in connection with fig. 1.
Compared with the prior art, the method adopts a flow-non-flow combined voice recognition algorithm model; adding a language model based on a transformer and an n-gram in a prefixed beam search, and using a corpus of a railway to occupy more corpus training language models in a preliminary screening stage to increase recall rate of proper nouns and speech of the railway; the method has the advantages that the accuracy rate of estimating the probability of the non-occurrence words can be increased by using a transducer model, the word weight in the decoding process is more controllable by using an n-gram model, for example, the weight is increased for silently writing specific special words in the n-gram, so that the recognition accuracy rate of the special nouns is improved; and a language model and a hotword weight module are added in the decoding process and after decoding so as to improve the recognition accuracy in the railway voice recognition field.
The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.
Claims (10)
1. An information query method based on voice recognition adopts streaming and non-streaming mixed voice recognition, which is characterized in that the method comprises the following steps:
encoding: inputting audio data, extracting characteristic values by using a transducer encoder, and outputting a two-dimensional characteristic value sequence;
decoding: after carrying out streaming voice decoding and recognition by adopting a decoder combining a transducer and an N-gram based on the two-dimensional characteristic value sequence, screening and outputting M results of first text sequencing, carrying out non-streaming voice decoding and recognition by adopting a transducer model based on the two-dimensional characteristic value sequence and the M results of first text sequencing, and outputting N results of second text sequencing, wherein N and M are positive integers which are greater than or equal to 1, and N is less than or equal to M;
assignment weighting: and in the text sorting result output in the decoding step, carrying out assignment weighting based on the hot word dictionary, and outputting an optimal query result.
2. The voice recognition based information query method of claim 1, wherein the decoding step comprises:
the transform and n-gram combine decoding steps: the conversion model performs decoding recognition based on the combination of the daily term corpus and the n-gram model based on the specific special corpus, adopts the prefixeamsearch to screen and sort decoding recognition results, and outputs M results of the first text sorting.
3. The voice recognition based information query method of claim 2, wherein the decoding step further comprises:
conformer decoding step: based on the two-dimensional eigenvalue sequence and the M results of the first text ordering, adopting a con model to decode and identify combined voice and context, and outputting N results of the second text ordering, wherein N is a positive integer greater than or equal to 1 and less than or equal to M.
4. The voice recognition based information query method of claim 1, wherein the assigning a weighting step comprises:
traversing: performing hot word traversal matching on the text sequencing results output by the decoding step and the hot word dictionary one by one;
weighting: if the hot word is successfully matched, the text successfully matched is added with a weighted score, and finally, one item with the highest score is selected as the best output result.
5. An information query system based on speech recognition, using a mixed speech recognition of streaming and non-streaming, and using the information query method based on speech recognition according to any one of claims 1 to 4, the system comprising:
and a coding module: inputting audio data, extracting characteristic values by using a transducer encoder, and outputting a two-dimensional characteristic value sequence;
and a decoding module: after carrying out streaming voice decoding and recognition by adopting a decoder combining a transducer and an N-gram based on the two-dimensional characteristic value sequence, screening and outputting M results of first text sequencing, carrying out non-streaming voice decoding and recognition by adopting a transducer model based on the two-dimensional characteristic value sequence and the M results of first text sequencing, and outputting N results of second text sequencing, wherein N and M are positive integers which are greater than or equal to 1, and N is less than or equal to M;
assignment weighting module: and in the text sorting result output in the decoding step, carrying out assignment weighting based on the hot word dictionary, and outputting an optimal query result.
6. The speech recognition based information query system of claim 5, wherein the decoding module comprises:
the transform and n-gram combine decoding steps: the conversion model performs decoding recognition based on the combination of the daily term corpus and the n-gram model based on the specific special corpus, adopts the prefixeamsearch to screen and sort decoding recognition results, and outputs M results of the first text sorting.
7. The speech recognition based information query system of claim 6, wherein the decoding module further comprises:
conformer decoding step: based on the two-dimensional eigenvalue sequence and the M results of the first text ordering, adopting a con model to decode and identify combined voice and context, and outputting N results of the second text ordering, wherein N is a positive integer greater than or equal to 1 and less than or equal to M.
8. The speech recognition based information query system of claim 5, wherein the assignment weighting module comprises:
and (5) traversing a module: performing hot word traversal matching on the text sequencing results output by the decoding step and the hot word dictionary one by one;
and a weighting module: if the hot word is successfully matched, the text successfully matched is added with a weighted score, and finally, one item with the highest score is selected as the best output result.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the speech recognition based information query method according to any of claims 1 to 4 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the speech recognition based information query method as claimed in any one of claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110971706.2A CN113674734B (en) | 2021-08-24 | 2021-08-24 | Information query method and system based on voice recognition, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110971706.2A CN113674734B (en) | 2021-08-24 | 2021-08-24 | Information query method and system based on voice recognition, equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113674734A CN113674734A (en) | 2021-11-19 |
CN113674734B true CN113674734B (en) | 2023-08-01 |
Family
ID=78545243
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110971706.2A Active CN113674734B (en) | 2021-08-24 | 2021-08-24 | Information query method and system based on voice recognition, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113674734B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116052725B (en) * | 2023-03-31 | 2023-06-23 | 四川大学华西医院 | Fine granularity borborygmus recognition method and device based on deep neural network |
CN117437909B (en) * | 2023-12-20 | 2024-03-05 | 慧言科技(天津)有限公司 | Speech recognition model construction method based on hotword feature vector self-attention mechanism |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11055900B1 (en) * | 2020-02-28 | 2021-07-06 | Weta Digital Limited | Computer-generated image processing including volumetric scene reconstruction to replace a designated region |
CN113223534A (en) * | 2021-04-27 | 2021-08-06 | 西北工业大学 | Self-organizing microphone array voice recognition channel selection method |
CN113257248A (en) * | 2021-06-18 | 2021-08-13 | 中国科学院自动化研究所 | Streaming and non-streaming mixed voice recognition system and streaming voice recognition method |
-
2021
- 2021-08-24 CN CN202110971706.2A patent/CN113674734B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11055900B1 (en) * | 2020-02-28 | 2021-07-06 | Weta Digital Limited | Computer-generated image processing including volumetric scene reconstruction to replace a designated region |
CN113223534A (en) * | 2021-04-27 | 2021-08-06 | 西北工业大学 | Self-organizing microphone array voice recognition channel selection method |
CN113257248A (en) * | 2021-06-18 | 2021-08-13 | 中国科学院自动化研究所 | Streaming and non-streaming mixed voice recognition system and streaming voice recognition method |
Non-Patent Citations (3)
Title |
---|
Analysis of the Novel Transformer Module Combination for Scene Text Recognition;Yeon-Gyu Kim等;IEEE;全文 * |
Conformer: Convolution-augmented transformer for speech recognition;Anmol Gulati等;arXiv:2005.08100v1 [eess.AS];全文 * |
基于对抗训练的端到端语音翻译研究;何文龙 等;信号处理;第37卷(第05期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN113674734A (en) | 2021-11-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113674734B (en) | Information query method and system based on voice recognition, equipment and storage medium | |
CN103971675B (en) | Automatic speech recognition method and system | |
US9940927B2 (en) | Multiple pass automatic speech recognition methods and apparatus | |
US7421387B2 (en) | Dynamic N-best algorithm to reduce recognition errors | |
US20230186912A1 (en) | Speech recognition method, apparatus and device, and storage medium | |
CN105632499B (en) | Method and apparatus for optimizing speech recognition results | |
CN111599340A (en) | Polyphone pronunciation prediction method and device and computer readable storage medium | |
CN105609107A (en) | Text processing method and device based on voice identification | |
CN112967713B (en) | Audio-visual voice recognition method, device, equipment and storage medium based on multi-modal fusion | |
CN111274367A (en) | Semantic analysis method, semantic analysis system and non-transitory computer readable medium | |
CN112562640B (en) | Multilingual speech recognition method, device, system, and computer-readable storage medium | |
CN111414746A (en) | Matching statement determination method, device, equipment and storage medium | |
CN111539199A (en) | Text error correction method, device, terminal and storage medium | |
CN106294460A (en) | A kind of Chinese speech keyword retrieval method based on word and word Hybrid language model | |
US20050187767A1 (en) | Dynamic N-best algorithm to reduce speech recognition errors | |
CN115132196B (en) | Voice instruction recognition method and device, electronic equipment and storage medium | |
CN112949293A (en) | Similar text generation method, similar text generation device and intelligent equipment | |
CN111400429B (en) | Text entry searching method, device, system and storage medium | |
CN116842168A (en) | Cross-domain problem processing method and device, electronic equipment and storage medium | |
JP3875357B2 (en) | Word / collocation classification processing method, collocation extraction method, word / collocation classification processing device, speech recognition device, machine translation device, collocation extraction device, and word / collocation storage medium | |
CN116129883A (en) | Speech recognition method, device, computer equipment and storage medium | |
CN115757680A (en) | Keyword extraction method and device, electronic equipment and storage medium | |
CN116110370A (en) | Speech synthesis system and related equipment based on man-machine speech interaction | |
CN110648666B (en) | Method and system for improving conference transcription performance based on conference outline | |
CN110347813B (en) | Corpus processing method and device, storage medium and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |