CN113299283B - Speech recognition method, system, apparatus and medium - Google Patents

Speech recognition method, system, apparatus and medium Download PDF

Info

Publication number
CN113299283B
CN113299283B CN202110470132.0A CN202110470132A CN113299283B CN 113299283 B CN113299283 B CN 113299283B CN 202110470132 A CN202110470132 A CN 202110470132A CN 113299283 B CN113299283 B CN 113299283B
Authority
CN
China
Prior art keywords
corpus
field
training
domain
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110470132.0A
Other languages
Chinese (zh)
Other versions
CN113299283A (en
Inventor
白蒙蒙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Qiyue Information Technology Co Ltd
Original Assignee
Shanghai Qiyue Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Qiyue Information Technology Co Ltd filed Critical Shanghai Qiyue Information Technology Co Ltd
Priority to CN202110470132.0A priority Critical patent/CN113299283B/en
Publication of CN113299283A publication Critical patent/CN113299283A/en
Application granted granted Critical
Publication of CN113299283B publication Critical patent/CN113299283B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to the field of voice recognition, provides a voice recognition method, a system, a device and a medium of the invention aiming at the defects that the existing voice recognition, the waste of computing resources, the hot switching among a plurality of models can not be realized, the single-field model can not adapt to the long-dialog recognition and the like, and aims at solving the technical problem of how to provide voice recognition clothes in different fields based on the dynamic language model of deep learning according to field information. Therefore, the method provided by the invention provides the voice recognition service suitable for effective hot switching and long conversation in multiple fields by combining the utilization of the corpus field information in the prediction process of the constructed voice recognition model, improves the performance of the existing voice recognition service, effectively reduces the resource waste, is suitable for the correct recognition of cross and long conversation in different fields, realizes the recognition hot switching, is simple to realize, is easy to operate, and has low cost and high efficiency.

Description

Speech recognition method, system, device and medium
Technical Field
The present invention relates to the field of speech recognition, and in particular, to a speech recognition method, system, apparatus, and medium.
Background
In speech recognition, the main process is to recognize the acoustic features of a segment of speech by using an acoustic model, and then translate the acoustic features into corresponding characters by using a language model. Due to the existence of homonyms and heteronyms, different application scenes can be distinguished to train different language models, so that the method can adapt to specific field scenes. As shown in fig. 1: the speech to be recognized is converted into a unit (pinyin, tone serial number and the like in the example) corresponding to the acoustic model, the acoustic features are extracted through the acoustic model, for example, the acoustic feature 'jian 3 yi4 gong1 zuo 4' obtained by the speech recognized by the acoustic model is translated into 'simple work', and the translation of a certain epidemic prevention language model can be 'quarantine work'. Thus, in order to provide support in many areas, it is necessary to train numerous language models and provide service support using numerous servers. Furthermore, in the manner shown in fig. 2, the front-end judgment logic is required to determine which specific server port i of the servers (i.e. the corresponding different models applicable to different domains/different scenarios) of the back-end should be called for service, and the back-end outputs the accurate translation result matching with the domains/scenarios.
The above prior art methods often have many drawbacks, such as: 1. service cost is high, for example, a service engine needs at least 2C4G configuration, and a language model of a field may not only be a server considering concurrency; 2. the phenomenon of computing resource waste, for example, the call volumes of all the fields are not the same, some call volumes are less, and some call volumes are more, so that the deployed computing resources of the language model with less call volumes are very wasted; 3. for example, when a language model in a cold domain is to be switched to provide a hot domain, manual maintenance is needed, so that labor cost is high; 4. often a single domain model is not adaptable to long dialog recognition tasks.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a technical scheme of a voice recognition method, a system, a device and a medium, aiming at solving the technical problem of how to adapt to recognition services in different fields by a dynamic language model based on deep learning; furthermore, the problem that recognition service is provided for different fields through a dynamic language model based on deep learning according to the field information and/or the field information for judging the existing conversation so as to reduce excessive server deployment can be solved; furthermore, the technical problems of reducing computing resource waste, reducing labor cost, effectively providing identification results in different fields for long-dialog identification and the like can be solved.
In order to solve the above technical problem, a first aspect of the present invention provides a speech recognition method, including: according to the voice recognition service request, obtaining acoustic features of voice to be recognized and field information corresponding to the acoustic features; and switching to a field corresponding to the field information to recognize the acoustic features based on a dynamic language model of deep learning so as to determine a text recognition result corresponding to the speech to be recognized.
Preferably, the obtaining, according to the voice recognition service request, the acoustic feature of the voice to be recognized and the domain information corresponding to the acoustic feature specifically include: in a service request, converting the speech to be recognized into a unit corresponding to an acoustic model to construct a corpus of a current sentence as the acoustic feature; and judging the domain to which the corpus of the current sentence belongs according to the domain classification of the corpus, and using the domain as the domain information corresponding to the acoustic features.
Preferably, the voice recognition service request includes: a service request identified by a long dialog and/or a service request identified by a single sentence; according to the domain classification of the corpus, determining the domain to which the corpus of the current sentence belongs, as the domain information corresponding to the acoustic feature, specifically including: when the voice recognition service request is a single sentence recognized primary service request, directly classifying according to the field of the corpus, and judging the field of the corpus of the current sentence; and/or when the voice recognition service request is a service request of long conversation recognition, extracting first features of the corpus of the current sentence according to the recognition information of the previous sentence, fusing the first features and the features directly extracted from the corpus of the current sentence, and then judging the field to which the corpus of the current sentence belongs according to the field classification of the corpus; wherein the previous sentence refers to a sentence before the current sentence.
Preferably, the domain classification of the corpus includes: setting labels corresponding to the linguistic data in each field, and identifying the field classification of the linguistic data; and/or the identification information of the previous sentence comprises: the state vector of the last sentence that has occurred and been identified; and/or, according to the field classification of the corpus, determining the field to which the corpus of the current sentence belongs, specifically including: and judging according to the field classification of the corpus by the trained corpus field classifier.
Preferably, the training of the corpus area classifier specifically includes: constructing training data, including: in each field, the linguistic data are formed into a plurality of dialogues; randomly selecting a plurality of dialogues in a plurality of different fields, and splicing the dialogues together to form a training sample; converting all the corpora in each training sample into units corresponding to the acoustic models to serve as input of a corpus classifier, using texts of all the corpora in each training sample as output of the corpus classifier, and carrying out change labeling on connection points in different fields in each training sample; training the pre-constructed corpus field classifier specifically comprises: performing first feature extraction on the input according to the state of the current time t, and calculating a loss function for judging whether the corpus field changes or not; the state of the current time t is a state vector returned at the time t-1, and the state vector represents a state vector of a sentence at the time t-1 which has occurred and is identified; extracting the features of the input, fusing the first features, and calculating a loss function for judging the domain part to which the corpus in the input belongs; and determining whether the preset training times are reached and/or whether each loss function reaches a convergence target, and if so, finishing the training of the corpus area classifier.
Preferably, the dynamic language model based on deep learning specifically includes: building a neural network model which adopts embedded layer Embedding to represent each language model structure corresponding to each field; and/or switching to a field corresponding to the field information based on a deep learning dynamic language model to recognize the acoustic features so as to determine a text recognition result corresponding to the speech to be recognized, and specifically comprising: converting the domain information, then carrying out feature coding, and fusing the feature coding and the features extracted from the acoustic features; and switching the dynamic language model based on deep learning to a language algorithm corresponding to the field indicated by the feature coding according to the fused features and performing feature decoding on the fused features to obtain a predicted text recognition result corresponding to the speech to be recognized.
Preferably, the method further comprises the following steps: building training data to train the dynamic language model; wherein, constructing training data comprises: randomly selecting each sentence corpus in each field, and taking each sentence corpus as a training sample; converting all the linguistic data in each training sample into units corresponding to the acoustic models to serve as input of the dynamic language models, taking texts of all the linguistic data in each training sample as output of the dynamic language models, and labeling the fields to which the linguistic data of each training sample belong; wherein training the dynamic language model comprises: carrying out information conversion on the field of each marked training sample and then carrying out feature coding; extracting features from the input for training the dynamic language model and performing feature fusion with the feature codes; the dynamic language model is switched to a language algorithm corresponding to the field indicated by the feature code in the fused features and feature decoding is carried out on the fused features; calculating a loss function based on the output of the training of the dynamic language model and the feature decoded text; and determining whether a preset training time is reached and/or whether the loss function reaches a convergence target, and if so, finishing the training of the dynamic language model.
In order to solve the above technical problem, a second aspect of the present invention provides a speech recognition system, including: the judging device is arranged at the front end, and the language recognition service device is connected with the judging device and arranged at the back end; the judging device is used for judging the domain information corresponding to the acoustic features of the voice to be recognized according to the domain classification of the corpus in the primary service request; and the language recognition service device is used for recognizing the acoustic characteristics after switching to the field corresponding to the field information through a dynamic language model based on deep learning so as to determine a text recognition result corresponding to the voice to be recognized.
Preferably, the method further comprises the following steps: in the primary service request, converting the speech to be recognized into a unit corresponding to an acoustic model to construct a corpus of a current sentence as the acoustic feature; according to the domain classification of the corpus, the domain information corresponding to the acoustic features of the speech to be recognized is judged, and the method specifically comprises the following steps: and judging the field of the corpus of the current sentence as the field information according to the field classification of the corpus.
Preferably, the voice recognition service request includes: a service request identified by a long dialog and/or a service request identified by a single sentence; the judging device is specifically configured to: when the voice recognition service request is a one-time service request of single sentence recognition, directly classifying the fields corresponding to the linguistic data of the current sentence as the field information according to the field of the set linguistic data, and outputting the field information to the voice recognition service device; when the voice recognition service request is a service request of long dialog recognition: firstly, according to the identification information of the previous sentence, performing first feature extraction on the corpus of the current sentence, and outputting the first feature and a judgment result of whether the corpus field is changed or not; wherein, the previous sentence refers to a sentence before the current sentence; and after fusing the received first feature with the feature extracted from the corpus of the current sentence, judging the domain to which the corpus of the current sentence belongs according to the domain classification of the set corpus, and outputting the domain information to the language identification service device as the domain information.
Preferably, the domain classification of the corpus includes: setting labels corresponding to the linguistic data in each field, and identifying the field classification of the linguistic data; and/or the identification information of the previous sentence comprises: the state vector of the last sentence that has occurred and been identified; and/or, the judging device specifically further comprises: and judging the domain to which the corpus belongs by utilizing the trained corpus domain classifier according to the domain classification of the corpus.
Preferably, training the corpus field classifier specifically includes: constructing training data, including: in each field, language material is formed into a plurality of dialogs; randomly selecting a plurality of dialogues in a plurality of different fields, and splicing the dialogues together to form a training sample; converting all the corpora in each training sample into units corresponding to the acoustic model to be used as input of a corpus classifier, using texts of all the corpora in each training sample as output of the corpus classifier, and carrying out change labeling on a connection point in each training sample in each different field; the constructed corpus field classifier comprises the following steps: the system comprises a corpus field change judgment module and a corpus field classification module, wherein the two modules are connected; training the corpus field classifier using the training data, comprising: in the corpus field change judgment module, performing first feature extraction on the input according to the state at the current time t, and calculating a loss function for judging whether the corpus field changes or not; the state of the current time t is a state vector returned at the time t-1, and the state vector represents a state vector of a sentence at the time t-1 which has already occurred and is identified; in the corpus field classification module, performing feature extraction on the input, fusing the first features, and calculating a loss function for judging a field part to which the corpus in the input belongs; and determining whether the preset training times are reached and/or whether each loss function reaches a convergence target, and if so, ending the training of the corpus field classifier.
Preferably, the dynamic language model based on deep learning specifically includes: building a neural network model which adopts embedded layer Embedding to represent each language model structure corresponding to each field; and/or the language recognition service device comprises: the domain information decoding module and the language model module are connected with each other; the domain information decoding module is used for converting the domain information, then performing feature coding and providing the feature coding to the language model module; the language model module is used for fusing the feature codes with features extracted from the acoustic features, switching to a language algorithm corresponding to the field indicated by the feature codes according to the fused features through a trained dynamic language model based on deep learning, and performing feature decoding on the fused features to obtain a predicted text recognition result corresponding to the speech to be recognized.
Preferably, the method further comprises the following steps: constructing training data to train the dynamic language model; wherein, the training data construction specifically comprises: randomly selecting each sentence corpus in each field, and taking each sentence corpus as a training sample; converting all the linguistic data in each training sample into units corresponding to the acoustic model to serve as the input of the dynamic language model, taking the texts of all the linguistic data in each training sample as the output of the dynamic language model, and labeling the field to which the linguistic data of each training sample belongs; wherein training the dynamic language model specifically comprises: in the field information decoding module, the field of each marked training sample is subjected to information conversion and then characteristic coding; in the language model module, extracting features from the input for training the dynamic language model and performing feature fusion with the feature codes, switching the dynamic language model to a language algorithm corresponding to a field indicated by the feature codes in the fused features and performing feature decoding on the fused features, and calculating a loss function from the output for training the dynamic language model and the text decoded by the features; and determining whether a preset training time is reached and/or whether the loss function reaches a convergence target, and if so, finishing the training of the dynamic language model.
In order to solve the above technical problem, a third aspect of the present invention proposes an electronic device, which comprises a processor and a memory storing computer-executable instructions that, when executed, cause the processor to perform the method proposed by the first aspect.
In order to solve the above technical problem, a fourth aspect of the present invention proposes a computer-readable storage medium storing one or more programs which, when executed by a processor, implement the method proposed by the first aspect.
According to the embodiment of the invention, the dynamic identification mode of the dynamic language model based on the front-end judgment logic control and the back-end deep learning according to the field information can adapt to the identification services in different fields, the model quantity is simplified and reduced, or the configuration requirements of a server and a service engine are reduced, and a small amount of models or a single model can adapt to the actual concurrency quantity; furthermore, the combination of the dynamic model and the domain information realizes the hot switching that only one model is needed to provide all services and provide identification services, and can directly enter the calculation of the voice identification service in the matched domain, namely the neural network model with the embedded layer Embedding model structure is used for carrying out multi-language model switching and identification calculation, thereby reducing the manual maintenance requirement and lowering the labor cost; furthermore, the mode of combining the front-end judgment logic with the domain information to control the dynamic language model can effectively cope with the long dialogue recognition, especially the situation of continuously recognizing multiple domains by the long dialogue amount, and provides corresponding recognition results of different domains according to the occurring dialogue judgment of the current recognized voice domain, thereby effectively realizing the cross recognition of multiple domains.
Drawings
In order to make the technical problems solved, technical means adopted and technical effects achieved by the present invention clearer, specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It is to be noted, however, that the drawings described below are only drawings of exemplary embodiments of the invention, from which other embodiments can be derived by those skilled in the art without inventive effort.
FIG. 1 is a diagram illustrating an example of a process for extracting acoustic features and different language models for translation output by an acoustic model in the prior art;
FIG. 2 is a block diagram of an example of a prior art deployment of speech recognition services for speech recognition;
FIG. 3 is a detailed flow diagram of one embodiment of a speech recognition method according to the present invention;
FIG. 4 is a block diagram of one embodiment of a speech recognition system according to the present invention;
FIG. 5 is a block diagram of one embodiment of an architecture for a speech recognition service deployment in accordance with the speech recognition scheme of the present invention;
FIG. 6 is a diagram illustrating the results of one embodiment of a speech recognition scheme according to the present invention with respect to long dialog recognition effects;
FIG. 7 is a schematic diagram illustrating one embodiment of a speech recognition scheme in accordance with the present invention;
FIG. 8 is a schematic diagram of one embodiment of a model for use with respect to a composition module in accordance with the speech recognition scheme of the present invention;
FIG. 9 is a schematic diagram of one embodiment of model training for component modules in accordance with the speech recognition scheme of the present invention;
FIGS. 10, 11 are schematic illustrations of one embodiment of an application service of a speech recognition scheme according to the present invention;
FIG. 12 is a block diagram of an exemplary embodiment of an electronic device in accordance with the present invention;
FIG. 13 is a schematic diagram of one logical illustrative embodiment of a computer readable medium in accordance with the present invention.
Detailed Description
Exemplary embodiments of the present invention will now be described more fully hereinafter with reference to the accompanying drawings, in which exemplary embodiments of the invention may be embodied in many specific forms, and should not be construed as limited to the embodiments set forth herein. Rather, these exemplary embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept to those skilled in the art.
The structures, properties, effects or other characteristics described in a certain embodiment may be combined in any suitable manner in one or more other embodiments, while still complying with the technical idea of the invention.
In the description of the specific embodiments, the details of construction, performance, effects, or other characteristics are set forth in order to provide a thorough understanding of the embodiments for one skilled in the art. However, it is not excluded that a person skilled in the art may carry out the invention in a specific case in a solution that does not contain the above-mentioned structures, properties, effects or other features.
The flow chart in the drawings is only an exemplary flow demonstration, and does not represent that all the contents, operations and steps in the flow chart are necessarily included in the scheme of the invention, nor does it represent that the execution is necessarily performed in the order shown in the drawings. For example, some operations/steps in the flowcharts may be divided, some operations/steps may be combined or partially combined, and the like, and the execution order shown in the flowcharts may be changed according to actual situations without departing from the gist of the present invention.
The block diagrams in the figures generally represent functional entities and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
The same reference numerals denote the same or similar elements, components, or parts throughout the drawings, and thus, a repetitive description thereof may be omitted hereinafter. It will be further understood that, although the terms first, second, third, etc. may be used herein to describe various elements, components, or sections, these elements, components, or sections should not be limited by these terms. That is, these phrases are used only to distinguish one from another. For example, a first device may also be referred to as a second device without departing from the spirit of the present invention. Furthermore, the term "and/or", "and/or" is intended to include all combinations of any one or more of the listed items.
In the conventional method, a plurality of language models for realizing speech recognition, such as 8, 10, 15, etc., are trained, a plurality of models, such as 10, are trained, then 10 ports are opened for service, and different ports are selected for service by a front-end field. In one embodiment of the speech recognition method of the present invention, when the speech recognition service is deployed, the front end provides domain judgment, the back end provides a dynamic language model based on the Embedding structure, and the dynamic language model is hot-switched to a corresponding language algorithm/model according to the domain to complete translation, i.e. the recognition service. In this embodiment, the information provided at the front end or the domain judgment may specifically provide domain information through a corpus domain classifier, and the calculation may specifically be performed in the back-end recognition service by switching to the language algorithm/model/method of each corresponding and matched domain according to the domain information through a dynamic language model based on deep learning. The deep learning of the dynamic language model based on the deep learning adopts an embedded layer Embedding mode to represent model structures of various fields, namely, the dynamic language model is a multi-classifier adopting an Embedding structure as a whole, and a plurality of different and separated language models are not required to be provided by selecting a plurality of different ports.
[ example 1 ] A method for producing a polycarbonate
The following describes a speech recognition implementation process of the present invention with reference to a main flowchart of an embodiment of a speech recognition method according to the technical solution of the present invention shown in fig. 3.
Step S110, according to the voice recognition service request, obtaining the acoustic feature of the voice to be recognized and the field information corresponding to the acoustic feature.
In one embodiment, in a voice recognition service request, the voice to be recognized may be converted into a corpus of a current speech corresponding to an acoustic model unit, that is, an acoustic feature, and a feature (for example, a feature vector) of the corpus of the current speech may be extracted; and judging the domain to which the corpus of the current sentence belongs according to the domain classification of the corpus, and using the domain as the domain information corresponding to the acoustic features.
Here, the result obtained by the acoustic model is a corpus of a sentence, which is an ordered and contextual combination of a plurality of acoustic features, and the extraction of the input result, i.e., the feature of the constructed corpus of the sentence, is still actually an extracted acoustic feature, and further, the extracted acoustic feature may also be in the form of a feature vector, which is convenient for the processing of the subsequent language algorithm.
Specifically, the voice recognition service request (abbreviated as service request) at least includes: a service request identified by a long dialog, and/or a service request identified by a single sentence.
The domain to which the corpus of the current sentence belongs is determined according to the domain classification of the corpus, and the domain information corresponding to the acoustic feature may be processed differently according to the two service request conditions, for example: when the voice recognition service request is a single sentence recognized primary service request, directly classifying according to the field of the corpus, and judging the field of the corpus of the current sentence; and/or when the voice recognition service request is a service request of long-conversation recognition, extracting the acoustic feature, namely the corpus of the current sentence, according to the recognition information of the previous sentence, and then classifying according to the field of the corpus after fusing the first feature and the acoustic feature to judge the field to which the corpus of the current sentence belongs; wherein the previous sentence refers to a sentence before the current sentence.
Further, the domain classification of the corpus may at least include: and setting labels corresponding to the linguistic data in each field, and identifying the field classification of the linguistic data.
Further, the identification information of the previous sentence may include: the state vector of the previous sentence that has already occurred and been identified.
Further, according to the domain classification of the corpus, determining the domain to which the corpus of the current sentence belongs, which may specifically include: and judging according to the domain classification of the corpus by the trained corpus domain classifier.
In an embodiment, the trained corpus area classifier may be trained by using specific training data/training samples/training objects for the pre-constructed corpus area classifier.
Preferably, training data may be constructed first, such as: in each corpus field, forming a plurality of dialogues by the corpora in each field; and randomly selecting the dialogues in a plurality of different fields, and splicing the dialogues together to form a training sample or serve as a training object. Further, converting all the linguistic data into units corresponding to the acoustic models, namely recognition results (acoustic features) of the acoustic models, and using the recognition results (acoustic features) as input of the linguistic data classifier; the texts of all the linguistic data are used as the output of a linguistic data classifier; and, carrying out change labeling on the ligation points of each different field in each training sample.
Preferably, the corpus area classifier is trained with the aforementioned training data, such as: according to the state of the current time t to the input I 1 (as shown in fig. 7 to 9), extracting a first feature (e.g., a feature vector), and calculating a loss function for determining whether the corpus area is changed; the state of the current time t is a state vector returned at the time t-1, and the state vector represents a state vector of which a conversation has occurred; extracting the characteristics (characteristic vectors) of the input acoustic characteristics combined with the state vector, fusing the first characteristics, and calculating the language for judging the inputA loss function of a domain to which a corpus (corpus of a current sentence to be recognized, i.e., the acoustic feature) belongs; and determining whether the preset training times are reached and/or whether each loss function reaches a convergence target, and if so, ending the training of the corpus field classifier.
An example of an application and implementation is described with reference to the example of a speech recognition service deployment pattern shown in FIG. 5. In the example, in order to solve the problems of high service cost, serious waste of computing resources and difficulty in realizing multi-domain cross recognition by long conversation continuous recognition, a method of a dynamic language model based on deep learning is provided, domain information is provided to a model at the front end, and the model at the rear end provides recognition services in different domains according to the domain information. Therefore, for the identification service of switching the back end to the corresponding domain, the domain information of the dependent corpus is important, namely, the automatic acquisition/judgment or the manual acquisition/judgment of the domain information of the front end, especially, the automatic acquisition/judgment can be more effectively adapted to the situation of continuously identifying multiple cross domains by long conversation.
Specific examples thereof include: firstly, splitting input speech to be recognized into linguistic data according to a unit corresponding to an acoustic model for extracting acoustic features, or constructing the linguistic data, namely a result (acoustic features) extracted by the acoustic model, for language recognition.
The acoustic model may be various known acoustic models modeled by the prior art, such as acoustic models modeled by hidden markov models HMM, DNN-HMM, and the like, and is mainly used to extract (e.g., calculate from each frame) acoustic features of speech, i.e., corpora of a sentence of speech to be recognized that constructs a specific structure. In this way, the corpus of each utterance of speech to be recognized in the service request is in the form of a unit corresponding to the matching acoustic model.
In the long dialog recognition, the domain to which the current speech to be recognized belongs, or to which domain the corpus converted from the speech to be recognized belongs, may be determined according to the dialog that has occurred. This may provide results for different domains in a long dialog, i.e. to distinguish different domains crossing in a long dialog. As an example of the long dialog recognition process shown in fig. 6, a plurality of sentence dialogs in one service, the front part: "you, i want to consult a sign off" \8230; "your recent shop activity" are all dialogues with the area a related to trading goods; the rear part: "recent epidemic prevention situation of your \8230;" recent quarantine work situation of your all are dialogues with the domain B related to epidemic prevention control. In one service, the recognition service of each sentence is combined with the recognition judgment of the previous sentence to determine its field, for example, when the speech or corpus of the current sentence to be recognized is to be recognized, the field to which the speech or corpus belongs is judged first, and the judgment needs to be combined with the recognition information of the previous sentence before the current sentence. For example, the last sentence is the latest shop activity of your, and the current sentence is the latest epidemic prevention situation of your, it is obvious that according to the scheme of the present invention, it is determined that the field of the current sentence is changed (not field a), and further, the field of the current sentence is changed to field B.
The following describes an implementation process of receiving a voice service request with reference to a specific implementation manner of an embodiment of constructing and training a corpus area classifier and a language model applied by the solution of the present invention shown in fig. 7, 8 and 9. In one embodiment, the speech to be recognized is recognized according to the form of one-sentence dialogue, the sentence can be directly recognized for the speech service request of single-sentence dialogue, and the recognition of a plurality of sentences of long dialogue can be realized one sentence by one sentence.
Further, when a service request is received, each sentence of the speech to be recognized is converted into the corpus of the unit of the acoustic model used correspondingly, that is, the acoustic features of the speech to be recognized are extracted through the acoustic model, and for the corpus of a current speech to be recognized, the domain classification of the corpus can be determined, that is, the current speech to be recognized is determined to be a dialog in which domain, for example: domain a or domain B, etc., and may also feature the acoustic features (e.g., feature vectors). The tags corresponding to the corpora of each field are preset, for example: the label of the field I is 0, the label of the field II is 1, and so on, the setting can be carried out in such a way to distinguish different fields, which is beneficial to switching the subsequent language model to the corresponding field and completing the final recognition through the corresponding language model/algorithm.
In one embodiment, for a speech recognition service for a single sentence, the corpus of the sentence may be directly determined to determine the domain (for example, directly classified by a corpus domain classifier) so as to provide domain information for the recognition of a subsequent language algorithm, and for a single sentence, even additional input domain information may be introduced to directly enter the recognition of the subsequent language algorithm.
In another embodiment, for the speech recognition service of long conversation, the linguistic data domain classifier is constructed and combined with the recognition situation of the previous sentence of the current sentence to determine the domain to which the linguistic data of the current sentence belongs.
In one embodiment, the constructed corpus area classifier is used after being trained by specific training data, for example, deployed on the line after training.
An example of training the corpus area classifier may be training the corpus area classifier as a whole, for example, training a whole (hereinafter, referred to as a whole 1) formed by combining corpus area variation judgment and corpus area classification.
Specifically, one example of constructing training data: the language material in each field forms a plurality of dialogues, and the dialogues in the plurality of fields are randomly selected and spliced together to be used as a training object or a training sample. And converting all the linguistic data into units corresponding to the acoustic models as input, outputting texts of the linguistic data as results, and performing change labeling on the connection points of different fields.
Further, taking a schematic diagram of a structural framework when an embodiment of the scheme of the present invention shown in fig. 7 is applied as an example, an example of a training whole 1 is specifically described: the whole 1 at least comprises a corpus field change judgment module and a corpus field classification module, and the two modules/parts are connected firstly, if the two parts of the corpus field change judgment module and the corpus field classification module are connected to form the whole 1, the corpus is connectedIntermediate output O of domain change judging section 12 Input I connected to corpus Domain Classification Module 21 In the above, training of the whole 1 is performed.
Specific examples thereof include: the input h of the corpus field change judgment part represents the state of the feature extractor at the time t, the state is a state vector of which a conversation has occurred, and the state vector returned by the corpus field change judgment part at the time t-1 is obtained, namely the state vector corresponding to the current time is returned at each time and is sent to the next time for prediction and utilization. I represents input, O represents output, delta represents classification judgment, and + represents feature fusion.
Wherein, I 1 And constructing the language material of the current dialog for the input language material of the current dialog to be identified, namely the acoustic features, and conforming to the unit corresponding to the acoustic model for extracting the acoustic features. O is 11 Is the output of the classification judgment, and O 12 Is the output of feature extraction, such as the first feature, for the corpus of the current sentence according to the h state. Will output Q in the middle 12 As input is I 21 Input (first feature), the input is compared with I 22 For example, the acoustic features, that is, the features extracted from the corpus of the current sentence, are directly fused, that is, the first feature vector and the feature vector of the extracted corpus are fused, and then classified and judged to obtain the output O 2
During training, for example, as shown in fig. 8 and 9, the whole 1 is trained by using the constructed training data, and the way of inputting the corpus in the domain of the connective point, outputting the text corresponding to the corpus, and calculating the loss function is spliced and labeled. The loss function may for example be two parts, including at least: and judging the loss function of the change of the domain to which the corpus belongs and the loss function of the domain to which the corpus belongs in the input. Extracting features of the corpus of the current sentence by combining the state vector h, and then judging a loss function of which the field is changed; and after extracting the characteristics of the linguistic data, fusing the characteristics with the first characteristics, and calculating a loss function for judging the field to which the linguistic data in the input belong.
In training, following the above example, the output O of the corpus area change judging section of the whole 1 is output 11 For the calculated loss function loss for judging the domain to which the corpus in the input belongs, and the output Q of the whole 1 at the time of training 2 Namely, the classification judgment part of the fused first feature and the acoustic feature is the loss function loss which is calculated and used for judging the domain to which the corpus in the input belongs. The loss functions loss of the two parts are both loss functions when the domain of the corpus to be recognized is determined and the whole 1 is trained, and the prediction condition of the classifier/classification model is determined. The loss function loss can generally adopt L1, L2, cross entropy, and the like in deep learning.
In one embodiment, in the training process, it may be determined whether to end the training according to whether a predetermined training number is reached and/or whether each loss function reaches a convergence target, wherein if so, the training of the corpus area classifier is ended to obtain a trained classifier or the whole 1, and if not, the iteration or the training is continued until the predetermined training number is reached or the convergence target is reached. Further, when the trained classifier is actually applied to speech recognition, the output of the classifier is the domain corresponding to the judged corpus, such as Q 2 And mixing the Q 2 Domain information I as a domain information decoding module (shown in FIGS. 7 to 9) 3 And inputting the information into the domain information decoding module for information conversion and characteristic coding. So that the domain information is used in the recognition process of the subsequent language model.
And step S120, switching to a domain corresponding to the domain information based on the dynamic language model of deep learning to recognize the acoustic features so as to determine a text recognition result corresponding to the speech to be recognized.
In one embodiment, the dynamic language model based on deep learning may use embedded layer Embedding to represent a neural network model of a language model structure corresponding to each domain when building the model. The neural network model with the Embedding model structure can provide a mode of switching to a corresponding language algorithm for recognition operation according to different feature codes, so that only one model needs to be provided, the model can correspond to different language models through different codes in a deployment mode as shown in figure 5, the front end judges the field to which the linguistic data of the voice to be recognized belongs, the judged field information is provided to one model at the rear end, and the corresponding algorithm is executed by switching to the corresponding code, so that the recognition result is output. Further, various known neural network models using the Embedding architecture may be specifically adopted, and details are not described herein.
In one embodiment, switching to a corresponding domain for recognition based on a deep learning dynamic language model may specifically be: converting the domain information, then performing feature coding, and fusing/combining the feature coding (such as feature vectors) with results obtained from an acoustic model, namely the acoustic features, namely the features extracted from the corpus of the current sentence; and switching the dynamic language model based on deep learning to a language model corresponding to the field indicated by the feature coding, performing feature decoding on the fused feature coding, taking the feature coding as a text recognition result corresponding to the speech to be recognized, and returning corresponding recognition information.
In one embodiment, the built language model needs to be trained before deploying the language model. Reference may be made to examples of specific applications of the decision and language identification service deployment and examples of training corpus area classifiers and language models shown in fig. 7-9. Specifically, training data for training the language model may be prepared in advance, for example, the corpora in each field constitute several dialogues, each corpus in each field is randomly selected, and each corpus is used as a training object/sample; then all the linguistic data are converted into units corresponding to the acoustic model as in the training of the classifier, namely, the linguistic data or the phonetic features of a sentence are constructed and used as model input in the training, and the text of the linguistic data is used as the output/result of the model; and the information of the domain to which each sentence of corpus belongs is labeled, so that the domain to which the corpus of each training sample belongs is labeled.
Further, in a specific example, the whole language model may be divided into a domain information decoding part and a language model or a language algorithm part for performing specific operations, such as two parts of a domain information decoding module and a language model shown in fig. 7 to 9, wherein the domain information decoding moduleI in the block 3 Is input corresponding domain information, and output O after information conversion and characteristic coding 3 The language model has two inputs I 41 And I 42 After fusion, feature decoding is carried out, namely, O is output after operation 4 I.e. the recognition result of said acoustic features obtained corresponding to the input acoustic model, e.g. the text of a corpus. Further, the two parts/modules can be connected to form a dynamic language model whole body based on deep learning, such as the whole body 2, and then training is carried out. The connection mode is, for example, as shown in FIG. 8 or 9, and outputs O 3 As input I 41 That is, the domain feature codes are provided to the dynamic language model, so that the dynamic language model can be switched to the language algorithm corresponding to the codes as the basis or speaking condition for performing operations such as decoding. Further, during training, the whole 2 may be trained by using the training data of the constructed training language model. Specifically, the method comprises the following steps: converting information of the field of each training sample for training the label of the dynamic language model, and then performing feature coding (namely obtaining a feature vector of a corresponding field); encoding the features and training the input I of the neural network model 42 After fusion/combination, feature decoding is carried out, namely, text recognition is carried out by switching to a corresponding field; and calculating a loss function loss to determine the prediction condition of the model according to the output of the neural network model and the text decoded by the characteristics. And further determining whether a predetermined training number is reached and/or whether the loss function reaches a convergence target, and if so, ending the training of the neural network model.
The following description is continued with reference to the schematic diagrams of the structure of the classifier and the language model and the model linkage and model training in an example of the application of the solution of the invention shown in fig. 7 to 9. The whole body 2 is mainly a dynamic language model part based on deep learning, such as a domain information decoding module and a language model module, and the whole body 2 is formed after the modules are connected, and the output O of the domain information decoding module 3 Input for language model I 41 Field information I to be input 3 Extracting feature codes after completing information conversion and inputting the feature codes into a language model, namely the language model is used as the field of corresponding linguistic dataAnd fusing conditions so as to complete decoding and identify texts corresponding to the linguistic data by using a corresponding language algorithm. Wherein I entered in the language model 42 The unit corresponding to the acoustic model, namely the result (acoustic feature) obtained by the acoustic model, extracts the feature vector of the acoustic feature in the language model part, combines/fuses the input fields, namely judges which field should be subjected to feature decoding of the feature vector of the corpus of the current sentence, thereby outputting O (output) the text result which predicts that the text result should be corresponding to the current sentence 4 . During the training of the whole 2, each sentence of corpus of the training data or the training sample is converted into a unit corresponding to the acoustic model, namely, the unit is used as input and also provides information of each field, the language model extracts the characteristics (vectors) of the corpus, the text corresponding to the corpus in the training sample is used as an output result, and the O is output by combining the labeled field 4 And (4) performing iterative training on part of the calculated loss function loss to determine the model prediction condition. Further, the loss function loss may be a known loss function in various training models, as in the case of training the classifier described above. Further, in the training process, whether to finish the training can be determined according to whether the preset training times are reached and/or whether each loss function reaches the convergence target, wherein if yes, the training of the corpus domain classifier is finished to obtain a trained dynamic language model or a whole 2, and if not, the iteration or the training is continued until the preset training times are reached or the convergence target is reached.
Further, when the trained whole 2 is actually applied to speech recognition, the output is to determine the text Q corresponding to the decoded corpus 4 And mixing the Q 4 As a result of the deployment of the recognition service used on-line, i.e. the text result of the corresponding long or short dialog (e.g. single sentence) is output according to the speech recognition service request, i.e. the text prediction is done using the domain information in the decoding recognition process in combination with the corresponding linguistic algorithm of the domain.
[ example 2 ]
The implementation of the present invention will be further explained with reference to the block diagram of the structure of an embodiment of the speech recognition system according to the present invention shown in fig. 4 and the schematic illustrations of the recognition service deployment principle, the long dialog example, the module construction and model building when applied, the model/module connection, and the model training applied in an embodiment of the technical solution according to the present invention shown in fig. 5 to 9.
In one embodiment, the system may include: a judgment device 410 arranged at the front end, a language identification service device 420 connected with the judgment device 410 and arranged at the back end; the determining device 410 is configured to determine, in a service request, a domain to which a corpus of a current sentence to be voice converted to be recognized belongs according to the domain classification of the corpus, and use the domain as domain information corresponding to the corpus of the current sentence; the language identification service device 420 is configured to identify the corpus of the current sentence after switching to the domain corresponding to the domain information through a dynamic language model based on deep learning, so as to determine a text identification result corresponding to the speech to be identified.
Here, the result obtained by the acoustic model is a corpus of a sentence, which is an ordered combination of a plurality of acoustic features with contexts, and the extraction of the input result, i.e. the feature of the constructed corpus of the sentence, is still the actually extracted acoustic feature, and further, the extracted acoustic feature can also be in the form of a feature vector to facilitate the processing of the subsequent linguistic algorithm.
Specifically, the corpus of the current sentence of the speech conversion to be recognized is a unit corresponding to the acoustic model, that is, an acoustic feature, which is an acoustic feature in a sentence form; feature vectors are extracted according to the corpus of the current sentence, that is, the result obtained by the acoustic model, namely the acoustic features (the corpus of the current sentence) are extracted (obtaining of several main feature vectors).
Further, the voice recognition service request includes: a service request identified by a long dialog, and/or a service request identified by a short dialog/sentence.
Further, the determining device 410 may be configured to: when the speech recognition service request is a one-time service request of single sentence recognition, directly classifying the speech recognition service request according to the field of the set corpus, and outputting the field to which the corpus of the current one sentence corresponds as the field information to the speech recognition service device 420; when the voice recognition service request is a service request of long dialog recognition: extracting first features (namely, feature vectors which refer to state vectors of the previous sentence) from the corpus of the current sentence according to the identification information of the previous sentence, and outputting the first features and a judgment result of whether the corpus field is changed; wherein, the previous sentence refers to a sentence before the current sentence (the previous time t-1 of the time t); and, after fusing/combining the received first feature with the feature extracted from the acoustic feature, according to the domain classification of the set corpus, determining the domain to which the corpus of the current sentence belongs and outputting the domain as the domain information to the speech recognition service device 420.
Further, the domain classification of the corpus may include: and setting labels corresponding to the linguistic data in each field, and identifying the field classification of the linguistic data.
Further, the identification information of the previous sentence may include: the state vector of the previous sentence that has already occurred and been identified.
Further, the extracted feature according to the corpus of the current sentence is a feature vector extracted from the acoustic features of the acoustic model.
Further, the determining device 410 may further include: and judging the domain to which the corpus belongs by using a trained corpus domain classifier constructed in advance.
In one embodiment, training the corpus field classifier specifically includes: and (5) constructing training data. Specific examples thereof are: in each field, the linguistic data are formed into a plurality of dialogues; randomly selecting a plurality of dialogues in a plurality of different fields, and splicing the dialogues together to form a training sample; converting all the linguistic data into units corresponding to the acoustic models, namely acoustic model recognition results (acoustic features) serving as input of a linguistic data classifier; the texts of all the corpora are used as the output of a corpus classifier; moreover, carrying out change labeling on the joint points of each different field in each training sample; the constructed corpus field classifier comprises: the system comprises a corpus field change judgment module and a corpus field classification module, and the two modules are connected.
Further, the corpus field classifier may be trained by reusing the training data. Specific examples thereof are: in the corpus field change judgment module, performing first feature extraction on the input result (corpus of a current sentence) obtained by the acoustic model according to the state of the current time t, and calculating a loss function loss for judging whether the corpus field changes; the state of the current time t is a state vector returned at the time t-1, and the state vector represents a state vector of which a conversation has occurred; in the corpus domain classification module, performing feature extraction on the input, fusing/combining the first features output from the corpus domain change judgment module, and calculating a loss function loss for judging the domain to which the corpus in the input belongs; and determining whether the preset training times are reached and/or whether each loss function reaches a convergence target, and if so, finishing the training of the corpus area classifier.
In one embodiment, the dynamic language model based on deep learning may include: and constructing a neural network model which represents each language model structure corresponding to each field by adopting an Embedding layer Embedding.
Further, the language identification service device 420 may include: the domain information decoding module is used for converting the domain information and then carrying out feature coding; and the language model module is used for fusing the feature codes and the features extracted from the acoustic features, switching the dynamic language model based on deep learning to a language model corresponding to the field for calculation, such as feature decoding, and taking the result as a text prediction/recognition result corresponding to the speech to be recognized and returning corresponding recognition information.
Further, the method can also comprise the following steps: constructing training data to train the neural network model; wherein constructing training data comprises: randomly selecting each sentence corpus in each field, and taking each sentence corpus as a training sample; converting all the linguistic data into units corresponding to the acoustic model to serve as the input of the neural network model, namely acoustic features serving as the input, wherein the acoustic features are constructed in a sentence form (linguistic data of a sentence), using texts of all the linguistic data in each training sample as the output of the neural network model, and labeling the field to which each sentence of the linguistic data belongs.
Further, training the neural network model may include: in a domain information decoding module, performing information conversion on the domain of each sentence of labeled material for training the neural network model, and then performing feature coding; in a language model module, fusing the feature codes with the input for training the neural network model, then performing feature decoding, and calculating a loss function loss according to the prediction output for training the neural network model and the text for feature decoding to determine the model prediction condition and adjust the model parameters; and determining whether a preset training time is reached and/or whether the loss function reaches a convergence target, and if so, finishing the training of the neural network model.
An example of an application and implementation is described with reference to the example of a speech recognition service deployment pattern shown in FIG. 5. In the example, in order to solve the problems of high service cost, serious waste of computing resources and difficulty in realizing multi-domain cross recognition by long conversation continuous recognition, a method of a dynamic language model based on deep learning is provided, domain information is provided to a model at the front end, and the model at the rear end provides recognition services in different domains according to the domain information. Therefore, for the identification service of switching to the corresponding domain by the back end, the domain information of the dependent corpus is important, namely, the automatic acquisition/judgment or the manual acquisition/judgment of the domain information of the front end, especially, the automatic acquisition/judgment can be more effectively suitable for the situation of continuously identifying multiple cross domains by long conversation.
Specific examples thereof include: firstly, splitting input speech to be recognized into linguistic data according to a unit corresponding to an acoustic model for extracting acoustic features, or constructing the linguistic data forming a sentence, and extracting the features from results obtained by the acoustic model for subsequent language recognition.
The acoustic model may be various known acoustic models modeled by the prior art, such as acoustic models modeled by hidden markov models HMM, DNN-HMM, and the like, and is mainly used to extract (e.g., calculate from each frame) acoustic features of speech, i.e., corpora of a sentence of speech to be recognized that constructs a specific structure. Thus, the corpus of each utterance of speech to be recognized in the service request is in the form of a unit corresponding to the matching acoustic model.
In the long dialog recognition, the domain to which the current speech to be recognized belongs, or to which domain the corpus converted from the speech to be recognized belongs, may be determined according to the dialog that has occurred. This may provide results for different domains in a long dialog, i.e. to distinguish different domains crossing in a long dialog. As an example of the long dialog recognition process shown in fig. 6, a plurality of sentence dialogs in one service, the front part: "you, i want to consult a listing" \8230; "your recent store activity" are all dialogues with domain a related to trading goods; the rear part: "recent epidemic prevention situation of your" \8230; "recent quarantine work situation of your" are all dialogues with respect to domain B related to epidemic prevention control. In one service, the recognition service of each sentence is combined with the recognition judgment of the previous sentence to determine its field, for example, when the speech or corpus of the current sentence to be recognized is to be recognized, the field to which the speech or corpus belongs is judged first, and the judgment needs to be combined with the recognition information of the previous sentence before the current sentence. For example, the last sentence is the latest shop activity of your, and the current sentence is the latest epidemic prevention situation of your, it is obvious that according to the scheme of the present invention, it is determined that the field of the current sentence is changed (not field a), and further, the field of the current sentence is changed to field B.
The following describes an implementation process of receiving a voice service request with reference to a specific implementation manner of an embodiment of constructing and training a corpus area classifier and a language model applied by the solution of the present invention shown in fig. 7, 8 and 9. In one embodiment, the speech to be recognized is recognized according to the form of one-sentence dialogue, the sentence can be directly recognized for the speech service request of single-sentence dialogue, and the recognition of a plurality of sentences of long dialogue can be realized one sentence by one sentence.
Further, when a service request is received, each sentence of the speech to be recognized is converted into the corpus of the unit of the acoustic model used correspondingly, that is, the acoustic features of the speech to be recognized are extracted through the acoustic model, and for the corpus of a current speech to be recognized, the domain classification of the corpus can be determined, that is, the current speech to be recognized is determined to be a dialog in which domain, for example: domain a or domain B, etc., and may also perform feature extraction (e.g., feature vectors) on the acoustic features. The tags corresponding to the corpora of each field are preset, for example: the label of the field I is 0, the label of the field II is 1, and so on, the setting can be carried out in such a way to distinguish different fields, which is beneficial to switching the subsequent language model to the corresponding field and completing the final recognition through the corresponding language model/algorithm.
In one embodiment, for a speech recognition service for a single sentence, the corpus of the sentence may be directly determined to determine the domain (for example, directly classified by a corpus domain classifier) so as to provide domain information for the recognition of a subsequent language algorithm, and for a single sentence, even additional input domain information may be introduced to directly enter the recognition of the subsequent language algorithm.
In another embodiment, for the speech recognition service of long conversation, the linguistic data domain classifier is constructed and combined with the recognition situation of the previous sentence of the current sentence to determine the domain to which the linguistic data of the current sentence belongs.
In one embodiment, the constructed corpus area classifier is used after being trained by specific training data, for example, after being trained, the classifier is deployed on the line.
An example of training the corpus area classifier may be training the corpus area classifier as a whole, for example, training a whole (hereinafter, referred to as a whole 1) formed by combining corpus area variation judgment and corpus area classification.
Specifically, one example of constructing training data: the language material in each field forms a plurality of dialogues, and the dialogues in the plurality of fields are randomly selected and spliced together to be used as a training object or a training sample. And converting all the linguistic data into units corresponding to the acoustic models as input, outputting texts of the linguistic data as results, and performing change labeling on the connection points of different fields.
Further, taking a schematic diagram of a structural framework when an embodiment of the scheme of the present invention shown in fig. 7 is applied as an example, an example of a training whole 1 is specifically described: the whole 1 at least comprises a corpus field change judging module and a corpus field classification module, and is connected with two modules/parts firstly, if the two parts of the corpus field change judging module and the corpus field classification module are connected to form the whole 1, the middle output O of the corpus field change judging part is output 12 Input I connected to corpus field Classification Module 21 In the above, training of the whole 1 is performed.
Specific examples thereof include: the input h of the corpus field change judgment part represents the state of the feature extractor at the time t, the state is a state vector of which a conversation has occurred, and the state vector returned by the corpus field change judgment part at the time t-1 is obtained, namely the state vector corresponding to the current time is returned at each time and is sent to the next time for prediction and utilization. I represents input, O represents output, delta represents classification judgment, and + represents feature fusion.
Wherein, I 1 And constructing the corpus of the current conversation for the input corpus of the current sentence to be identified, namely the acoustic features, and conforming to the unit corresponding to the acoustic model for extracting the acoustic features. O is 11 Is the output of the classification decision, and O 12 Is the output of feature extraction, such as the first feature, for the corpus of the current sentence according to the h state. Will output Q in the middle 12 The input is as I 21 Input (first feature), the input is compared with I 22 For example, the acoustic feature, i.e. the feature extracted from the corpus of the current sentence, is directly fused, i.e. the first feature vector is fused with the feature vector of the extracted corpus and then the feature vector is enteredThe line classification judgment is carried out to obtain an output O 2
During training, for example, as shown in fig. 8 and 9, the whole 1 is trained by using the training data constructed as described above, and the mode of calculating the loss function is spliced and labeled by using the corpus in the connection point field as input and using the text corresponding to the corpus as output. The loss function may for example be two parts, including at least: and judging the loss function of the change of the domain to which the corpus belongs and the loss function of the domain to which the corpus belongs in the input. Extracting features of the corpus of the current sentence by combining the state vector h, and then judging a loss function of which the field is changed; and after the characteristics of the linguistic data are extracted, the linguistic data are fused with the first characteristics, and a loss function used for judging the field to which the linguistic data in the input belong is calculated.
In training, the output O of the corpus area change judging section of the whole 1 is obtained based on the above example 11 For the calculated loss function loss for judging the domain to which the corpus in the input belongs, and the output Q of the whole 1 at the time of training 2 Namely, the classification judgment part of the fused first feature and the acoustic feature is the loss function loss which is calculated and used for judging the domain to which the corpus in the input belongs. The loss functions loss of the two parts are both loss functions when the domain of the corpus to be recognized is determined and the whole 1 is trained, and the prediction condition of the classifier/classification model is determined. The loss function loss can generally adopt L1, L2, cross entropy, and the like in deep learning.
In one embodiment, in the training process, whether to end the training may be determined according to whether a predetermined training number is reached and/or whether each loss function reaches a convergence target, wherein if yes, the training of the corpus domain classifier is ended to obtain a trained classifier or a whole 1, and if not, the iteration or the training is continued until the predetermined training number is reached or the convergence target is reached. Further, when the trained classifier is actually applied to speech recognition, the output of the classifier is the domain corresponding to the judged corpus, such as Q 2 And mixing the Q 2 Domain information I as a domain information decoding module (shown in FIGS. 7 to 9) 3 Input to the information decoding module in the field for information conversionAnd feature encoding. So that the domain information is used in the recognition process of the subsequent language model.
In one embodiment, the dynamic language model based on deep learning may use embedded layer Embedding to represent a neural network model of a language model structure corresponding to each domain when building the model. The neural network model with the Embedding model structure can provide a mode of switching to a corresponding language algorithm for recognition operation according to different feature codes, so that only one model needs to be provided, the model can correspond to different language models through different codes in a deployment mode as shown in figure 5, the front end judges the field to which the linguistic data of the voice to be recognized belongs, the judged field information is provided to one model at the rear end, and the corresponding algorithm is executed by switching to the corresponding code, so that the recognition result is output. Further, various known neural network models using the Embedding architecture may be specifically adopted, and details are not described herein.
In one embodiment, switching to a corresponding domain for recognition based on a deep learning dynamic language model may specifically be: and converting the domain information, then performing feature coding, and fusing the feature coding and the acoustic features, so that a dynamic language model based on deep learning is switched to a language model/algorithm corresponding to the domain feature coding, a feature vector extracted from corresponding input, namely the acoustic features is calculated, a feature decoding is obtained and used as a text prediction/recognition result corresponding to the speech to be recognized, and corresponding recognition information is returned.
In one embodiment, the built language model needs to be trained before deployment of the language model. Reference may be made to examples of specific applications of the decision and language identification service deployment and examples of training corpus area classifiers and language models shown in fig. 7-9. Specifically, training data for training the language model may be prepared in advance, for example, the corpora in each field form a plurality of dialogues, each sentence corpus in each field is randomly selected, and each sentence corpus is used as a training object/sample; then all the linguistic data are converted into units corresponding to the acoustic model as in the training of the classifier, namely, the linguistic data or the phonetic features of a sentence are constructed and used as model input in the training, and the text of the linguistic data is used as the output/result of the model; and the information of the field to which each sentence of corpus belongs is labeled, so that the field to which the corpus of each training sample belongs is labeled.
Further, in a specific example, the whole language model may be divided into a domain information decoding part and a language model or a language algorithm part for performing specific operations, such as two parts of a domain information decoding module and a language model shown in fig. 7 to 9, wherein I in the domain information decoding module 3 Is input corresponding domain information, and output O after information conversion and characteristic coding 3 The language model has two inputs I 41 And I 42 After fusion, feature decoding is carried out, namely, O is output after operation 4 I.e. the recognition result of said acoustic features obtained corresponding to the input acoustic model, e.g. the text of a corpus. Further, the two parts/modules can be connected to form a dynamic language model whole body based on deep learning, such as the whole body 2, and then training is carried out. The connection mode is, for example, as shown in FIG. 8 or 9, and outputs O 3 As input I 41 That is, the domain feature codes are provided to the dynamic language model, so that the dynamic language model can be switched to the language algorithm corresponding to the codes as the basis or speaking condition for performing operations such as decoding. Further, during training, the whole 2 may be trained by using the training data of the constructed training language model. Specifically, the method comprises the following steps: performing information transformation on the field of each training sample for training the label of the dynamic language model, and then performing feature coding (namely obtaining a feature vector of a corresponding field); encoding the features and training the input I of the neural network model 42 Performing feature decoding after fusion/combination, namely switching to a corresponding field for text recognition; and calculating a loss function loss according to the output of the neural network model and the text decoded by the characteristics to determine the prediction condition of the model. And further determining whether a predetermined training number is reached and/or whether the loss function reaches a convergence target, and if so, ending the training of the neural network model.
The following description is continued with reference to the schematic diagrams of the structure of the classifier and the language model and the model linkage and model training in an example of the application of the solution of the invention shown in fig. 7 to 9. The whole body 2 is mainly a dynamic language model part based on deep learning, such as a domain information decoding module and a language model module, and the whole body 2 is formed by connecting the modules, and the output O of the domain information decoding module 3 Input for language model I 41 Field information I to be input 3 And after the information conversion is completed, extracting the feature codes and inputting the feature codes into the language model, namely, the feature codes are used as the field fusion conditions of the corresponding linguistic data so as to complete decoding and identifying the texts corresponding to the linguistic data by using a corresponding language algorithm. Wherein I entered in the language model 42 The language model part extracts the feature vector of the acoustic feature for the unit corresponding to the acoustic model, namely the result (acoustic feature) obtained by the acoustic model, and combines/fuses the input fields, namely, judges which field should be subjected to feature decoding of the feature vector of the corpus of the current sentence, thereby outputting O as the text result which is predicted to be corresponding to the current sentence 4 . When the whole body 2 is trained, each sentence corpus of training data or training samples is converted into a unit corresponding to an acoustic model, namely, the unit is used as input and provides information of each field, the language model extracts features (vectors) of the corpus, texts corresponding to the corpuses in the training samples are used as output results, and O is output by combining labeled fields 4 And (4) performing iterative training on part of the calculated loss function loss to determine the model prediction condition. Further, the loss function loss can be obtained by using known loss functions of various training models, as in the case of training the classifier described above. Further, in the training process, whether to finish the training can be determined according to whether the preset training times are reached and/or whether each loss function reaches the convergence target, wherein if yes, the training of the corpus domain classifier is finished to obtain a trained dynamic language model or a whole 2, and if not, the iteration or the training is continued until the preset training times are reached or the convergence target is reached.
Further, the method can be used for preparing a novel materialWhen the trained whole 2 is actually applied to speech recognition, the output is the text Q corresponding to the decoded corpus 4 And mixing the Q 4 As a result of deploying the recognition service used on-line, i.e. outputting the text result of the corresponding long dialog or short dialog (e.g. single sentence) according to the speech recognition service request, i.e. using the domain information to accomplish the text prediction in the decoding recognition process combined with the corresponding language algorithm of the domain.
[ example 3 ]
The process of constructing a model, deploying a speech recognition service, and performing online recognition in an actual application scenario according to the technical solution of the present invention will be further described below with reference to several schematic diagrams of application scenarios in which a model is constructed, deployed online, and subjected to speech recognition in specific applications according to the technical solution of the present invention shown in fig. 7 to 11. The present invention is only one specific application example, and is not limited to the implementation of the present invention.
Firstly, setting labels corresponding to corpora of each field, for example: 'Domain I' is 0, 'Domain II' is 1, and so on.
Secondly, a corpus area change judging module, a corpus area classifying module, an area information decoding module and a language model with area input are established, as shown in fig. 7 to 9. And h represents the state of the feature extractor at the time t, the state is a state vector of which a conversation has occurred, and the state vector returned by the module at the time t-1 is obtained, namely the state vector corresponding to the current time is returned at each time and is sent to the next time for prediction and utilization. While in the illustrated example, ' δ ' represents the classification decision, ' ' + ' represents the feature fusion, ' I ' represents the input, and ' O ' represents the output.
Example 1: the corpus area change judgment module takes the state vector of the previous sentence as the input I according to the condition h 1 Performing feature extraction, outputting the extracted features (such as feature vector) as first features and outputting O 12 While simultaneously applying the first characteristic O to the output 12 It can be determined whether the current sentence is the previous sentence or notA field of a sentence/previous sentence is changed and a result O of judging whether the field is changed or not is outputted 11 . Wherein, I 1 The corpus may be a corpus of sentences converted or constructed by the speech to be recognized (the speech in the speech service or the speech request needs to be performed), such as a corpus of a current sentence, and the input corpus is in the form of units corresponding to the acoustic model, i.e., the result (acoustic feature) obtained by the acoustic model, and then the input corpus is subjected to feature extraction (still in the form of acoustic feature, further may be in the form of feature vector), for example, when the acoustic model recognizes the speech, the corpus I of the sentence is constructed in the form of a sentence (the acoustic feature in the form of a sentence with ordered grammatical structure context), i.e., the corpus I of the sentence 1 As input, from I 1 The acoustic features of the representative linguistic data are extracted, and then the acoustic features can be in a feature vector form, so that the language algorithm of the neural network model constructed by the Embedding can be switched and calculated conveniently. Similarly, I 22 、I 42 Also refer to 1 The case (1).
Example 2: two inputs I of corpus field classification module 21 And I 22 ,I 22 Is also similar to I 1 The linguistic data of the current sentence is input, and the linguistic data is a unit corresponding to the acoustic model. And I 21 Is a feature of the additional extracted corpus. Both are input together, and I 22 Extracting the characteristics of the corpus of the current sentence of the speech to be recognized and then comparing the extracted corpus with I 21 Is determined to be actually on input I 22 Which domain the extracted feature should be, i.e., corpus domain classification, and outputting the domain O to which the corpus of the current sentence judged/predicted belongs 2
Example 3: the domain information decoding module is input I 3 For the domain information, the information is converted and feature coded, and then the code O of the corresponding feature corresponding to the corresponding domain is output 3 Indicating the corresponding domain.
Example 4: the language model module is for input I 42 I.e. extracting the characteristics of the corpus of the current sentence and combining or fusing the input I of the field code 41 Then decoding the features andoutputting decoded result O 4 I.e. the sentence/word (various text predictions) corresponding to the corpus of the current sentence. Because the field codes (vectors) can determine the corresponding recognition algorithm or switch to the corresponding recognition algorithm in the whole neural network model constructed by the Embedding layer or the dynamic language model based on deep learning, the codes of the corresponding field are corresponding to the mapping algorithm of the field, for example, the corresponding words are mapped to the word bank, namely the corresponding high-latitude vectors, and the positions of the corresponding words are found; while Embedding layer Embedding can complete all contents that can be converted into vectors: the dimension reduction processing is realized by dense vector representation, so that each word is replaced to search the index of the vector embedded in the matrix, the model is simplified, the model can be quickly switched to a corresponding word bank to realize mapping or neural network model prediction, and words (texts corresponding to the language materials of the sentences) corresponding to the language materials are obtained. The model and the prediction mode using the Embedding layer may be the existing neural network model and the prediction mode thereof, and are not described in detail.
In the foregoing examples 1 to 4, the training data is first constructed when the speech domain classifier (as shown in examples 1 and 2) and the language model (as shown in examples 3 and 4) need to be trained.
For example, corpus domain classification training data: in each field, the corpus forms a plurality of dialogues, and the dialogues in the fields are randomly selected and spliced together to be used as a training object/sample. And converting all the linguistic data into units corresponding to the acoustic models as input, converting the linguistic data into texts as results, and performing variation labeling on the connection points in different fields.
As another example, language model training data: each sentence of linguistic data is used as a training object/sample, all the linguistic data are converted into units corresponding to the acoustic model to be used as input, the text is used as a result, and the information of the field to which each sentence of linguistic data belongs is labeled.
Thirdly, the modules are connected as shown in fig. 8 and 9. Output O in the middle of the corpus field change judging module 12 Input I connected to corpus Domain Classification Module 21 In the above, training is performed as a whole, referred to as a whole 1 (i.e., doingAs corpus domain classifiers). Output of domain information decoding module 3 And language model input I 41 The join, trained as a whole, is referred to as a whole 2 (i.e., as a deep learning based dynamic language model).
Fourth, the model is trained, as in FIG. 9. And training the whole 1 by utilizing the constructed corpus field classification module training data. And training the data entirety 2 by using the constructed language model.
Example 5: when training the whole 1, the language material of the corresponding training sample is mainly input for the training of the classifier in the language material field, and O is output 12 Based on the condition, i.e. the first feature, which is the feature of the corpus of the current sentence extracted from the state vector of t-1, the first feature is also judged whether the domain of the first feature is changed from the domain of the previous sentence 11 Output the output O 11 Partly determining the prediction condition by calculating the loss function loss (namely comparing the prediction output with the text actually corresponding to the corpus in the training sample to determine the prediction condition of the model/classifier, thereby adjusting the model parameters); to thereby convert O into 12 Input I as corpus domain classification module 21 And the acoustic characteristics I obtained directly from the corpus of the current sentence 22 Extracting the feature vector and the input I 21 That is, referring to the first feature extracted from the acoustic feature obtained by the state vector pair, after being fused or combined, the domain corresponding to the corpus of the current sentence is judged and O is output 2 And the output O 2 Similarly, determining the model prediction situation by calculating the loss function loss and adjusting the model parameters according to the model prediction situation; the training process is continuously circulated, and the parameters of the classifier are adjusted according to the two loss until a preset condition for finishing model training, such as iteration times and the like, is reached.
Example 6: when training the whole 2, the language model training is mainly input with the corpus and the domain information of the corresponding training sample. Output O obtained by converting domain information and coding characteristics 3 Specific inputs I provided to or saying the code is a model of a language 41 Directly to the corpus I of the current sentence 42 The extracted feature vectors are combined with the I 41 Namely, domain coding (feature direction)Quantity), thereby determining to switch to the corresponding mapped lexicon or spoken language algorithm, completing feature decoding and outputting O 4 And the O is 4 The model prediction case (compared to the actual corresponding text case in the sample) is determined by calculating the loss function loss during training. The training process is continuously circulated, and parameters of a model (such as a neural network model) are adjusted according to the loss until a preset condition for finishing the model training, such as the number of iterations and the like, is reached
Fifth, a service is provided. After the models of the whole 1 and 2 are trained, the models are deployed on line for speech recognition service, and referring to the principle shown in fig. 5, the front end of the service completes domain judgment (judgment device) and provides domain information to the back end single model (recognition service device), the model is a dynamic language model based on deep learning, and can determine a corresponding language algorithm along with domain coding and predict and obtain a text result of speech to be recognized of a corresponding request service.
Taking the long dialog scene recognition shown in fig. 6 as an example, the following description will be made with reference to fig. 10: when a service request is made, the result obtained by the acoustic model is required to be sent to the whole 1 for input, and information such as a state vector h during recognition of a sentence on the whole 1 is given, namely information that a conversation has occurred can be obtained, and the extracted feature is based on a condition h, so that whether the field of the corpus has changed or not can be judged by extracting the feature from the corpus of the current sentence in the service, namely the result obtained by the acoustic model, and output O is output 11 And combining the extracted features as first features with the input I 22 That is, the characteristics directly extracted from the corpus of the current sentence are used to classify or judge the domain to which the corpus of the current sentence belongs. In this way, the entity 1 determines the domain to which the current result belongs, and transmits the domain information to the entity 2. In the whole 2, the domain information is converted and then is subjected to feature coding to obtain coding information of the domain, such as vectors, the coding information is provided for a neural network model adopting an Embedding layer, namely a dynamic language model based on deep learning, and the coding information is combined with I 42 That is, the feature extracted from the corpus of the current sentence is decoded according to the corresponding domain coding (vector) to obtain the corresponding domain algorithm, for example, the feature decoding, that is, prediction, is outputText result O corresponding to current sentence 4 Therefore, one model is dynamically switched to different fields to complete prediction, and a text result of the corresponding field is given by combining an acoustic feature result extracted by the acoustic model.
Further, in the case of a single sentence recognition scenario, in addition to judging the domain by the whole 1 and then predicting the text result, the result obtained by the acoustic model and the domain information may be directly input to the whole 2 at the time of a service request, for example, in another mode of single sentence recognition shown in fig. 11, the domain information input I 3 Combining completion transformation and feature coding into input I 42 (the result obtained through the acoustic model is the acoustic feature), the result of the text recognized by the corresponding field can be obtained, and the hot switching of the field is realized.
[ example 4 ] A method for producing a polycarbonate
In particular, an embodiment of an electronic device is also included, comprising a processor and a memory storing computer executable instructions, wherein the computer executable instructions, when executed, cause the processor to perform the embodiment steps of the method of the invention as referred to in the preceding embodiments 1 to 3.
An embodiment of the electronic device of the invention is described below, which may be regarded as an implementation in physical form for the method and device embodiments of the invention described above. The details described in this embodiment of the electronic device of the invention should be considered supplementary to the embodiments of the method or device/system described above; for details not disclosed in embodiments of the electronic device of the invention reference may be made to the above-described method or device/system embodiments.
Fig. 12 is a block diagram of an exemplary embodiment of an electronic device according to the present invention. The electronic device shown in fig. 12 is only an example, and should not bring any limitation to the function and the range of use of the embodiment of the present invention.
As shown in fig. 12, the electronic apparatus 200 of the exemplary embodiment is represented in the form of a general-purpose data processing device. The components of the electronic device 200 may include, but are not limited to: at least one processing unit 210, at least one memory unit 220, a bus 230 connecting different system components (including the memory unit 220 and the processing unit 210), a display unit 240, and the like.
The storage unit 220 stores a computer readable program, which may be a code of a source program or a read-only program. The program may be executed by the processing unit 210 such that the processing unit 210 performs the steps of various embodiments of the present invention. For example, the processing unit 210 may perform the steps of the methods of the foregoing embodiments 2 to 5.
The memory unit 220 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM) 2201 and/or a cache memory unit 2202, and may further include a read only memory unit (ROM) 2203. The storage unit 220 may also include a program/utility 2204 having a set (at least one) of program modules 2205, such program modules 2205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Bus 230 may be one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic apparatus 200 may also communicate with one or more external devices 300 (e.g., a keyboard, a display, a network device, a bluetooth device, etc.), enable a user to interact with the electronic apparatus 200 via the external devices 300, and/or enable the electronic apparatus 200 to communicate with one or more other data processing devices (e.g., a router, a modem, etc.). Such communication may occur through input/output (I/O) interfaces 250, and may also occur through network adapter 260 with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet). The network adapter 260 may communicate with other modules of the electronic device 200 via the bus 230. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in the electronic device 200, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, to name a few.
[ example 5 ] A method for producing a polycarbonate
Specifically, a computer readable storage medium is also included, which stores one or more programs, wherein when the one or more programs are executed by a processor, the embodiment steps related to the method of the present invention in the aforementioned embodiments 1 to 3 are implemented.
FIG. 13 is a schematic diagram of a computer-readable medium embodiment of the present invention. As shown in fig. 13, the computer program may be stored on one or more computer readable media. The computer readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. The computer program, when executed by one or more data processing devices, enables the computer-readable medium to implement the above-described method of the invention, namely: the configuration service management provided by the server side generates a configuration file related to the domain name; when a domain name fault occurs during the service request, the client terminal realizes the automatic switching of domain name access through the configuration updating according to the configuration file.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments of the present invention described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiment of the present invention can be embodied in the form of a software product, which can be stored in a computer-readable storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a data processing device (which can be a personal computer, a server, or a network device, etc.) to execute the above method according to the present invention.
The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
In summary, the present invention can be implemented as a method, system, electronic device, or computer readable medium executing a computer program. Some or all of the functions of the present invention may be implemented in practice using a general purpose data processing device such as a microprocessor or a Digital Signal Processor (DSP).
The dynamic language model based on the deep learning at the back end and the front end judgment logic control can adapt to the recognition service of different fields according to the dynamic recognition mode of the field information, the model quantity is simplified and reduced, or the configuration requirements of a server and a service engine are reduced, and a small quantity or a single model can adapt to the actual concurrency quantity; furthermore, the combination of the dynamic model and the domain information realizes the hot switching that only one model is needed to provide all services and provide identification services, and can directly enter the calculation of the voice identification service in the matched domain, namely the neural network model with the embedded layer Embedding model structure is used for carrying out multi-language model switching and identification calculation, thereby reducing the manual maintenance requirement and lowering the labor cost; furthermore, the mode that the front-end judgment logic is combined with the domain information to control the dynamic language model can effectively deal with the long dialogue recognition, particularly the situation that the long dialogue quantity continuously recognizes multiple domains, provides corresponding recognition results of different domains according to the domain to which the generated dialogue judges the current recognized voice belongs, and effectively realizes the cross recognition of the multiple domains.
While the foregoing embodiments have described the objects, aspects and advantages of the present invention in further detail, it should be understood that the present invention is not inherently related to any particular computer, virtual machine or electronic device, and various general-purpose machines may be used to implement the present invention. The invention is not to be considered as limited to the specific embodiments thereof, but is to be understood as being modified in all respects, all changes and equivalents that come within the spirit and scope of the invention.

Claims (16)

1. A speech recognition method, comprising:
according to the voice recognition service request, obtaining the acoustic features of the voice to be recognized and the field information corresponding to the acoustic features, specifically comprising: in a service request, converting the speech to be recognized into a unit corresponding to an acoustic model to construct a corpus of a current sentence as the acoustic feature; judging the domain to which the corpus of the current sentence belongs according to the domain classification of the corpus by a trained corpus domain classifier, and taking the domain as the domain information corresponding to the acoustic feature; wherein, the training of the corpus field classifier specifically comprises:
constructing training data of a corpus field classifier, comprising: forming a plurality of dialogues by the linguistic data in each field, randomly selecting the dialogues in a plurality of different fields to be spliced together to form a training sample, converting all the linguistic data in each training sample into a unit corresponding to an acoustic model to be used as the input of a linguistic data classifier, using the texts of all the linguistic data in each training sample as the output of the linguistic data classifier, and carrying out change labeling on the joint points of each different field in each training sample; training a pre-constructed corpus field classifier by using training data of the training corpus field classifier;
switching a dynamic language model based on deep learning to a domain corresponding to the domain information to recognize the acoustic features so as to determine a text recognition result corresponding to the speech to be recognized, and specifically comprising: the dynamic language model based on deep learning comprises the steps of constructing a neural network model which adopts embedded layer Embedding to represent each language model structure corresponding to each field; and converting the domain information, then performing feature coding, fusing the feature coding and features extracted from the acoustic features, switching a dynamic language model based on deep learning to a language algorithm corresponding to the domain indicated by the feature coding according to the fused features, and performing feature decoding on the fused features to obtain a predicted text recognition result corresponding to the speech to be recognized.
2. The method of claim 1, wherein the speech recognition service request comprises: a service request identified by a long dialog, and/or a service request identified by a single sentence.
3. The method according to claim 2, wherein the determining a domain to which the corpus of the current sentence belongs by the trained corpus domain classifier according to a domain classification of the corpus as the domain information corresponding to the acoustic feature further comprises:
when the voice recognition service request is a service request recognized by a single sentence, directly judging the field of the corpus of the current sentence according to the field classification of the corpus;
when the voice recognition service request is a service request of long conversation recognition, extracting first features of the corpus of the current sentence according to the recognition information of the previous sentence, fusing the first features and the features directly extracted from the corpus of the current sentence, and then classifying according to the fields of the corpus to judge the field to which the corpus of the current sentence belongs; wherein the previous sentence refers to a sentence before the current sentence.
4. The method of claim 3,
the field classification of the corpus comprises the following steps: setting labels corresponding to the linguistic data in each field, and identifying the field classification of the linguistic data;
and/or the presence of a gas in the gas,
the identification information of the previous sentence includes: the state vector of the last sentence that has already occurred and been identified.
5. The method of claim 4,
training a pre-constructed corpus field classifier by using training data for training the corpus field classifier specifically comprises:
performing first feature extraction on the input according to the state at the current time t, and calculating a loss function for judging whether the corpus field changes or not; the state of the current time t is a state vector returned at the time t-1, and the state vector represents a state vector of a sentence at the time t-1 which has already occurred and is identified; extracting the characteristics of the input, fusing the first characteristics, and calculating a loss function for judging the field part to which the linguistic data in the input belongs;
and determining whether the preset training times are reached and/or whether each loss function reaches a convergence target, and if so, finishing the training of the corpus area classifier.
6. The method according to any of claims 1 to 5, further comprising constructing training data for training the dynamic language model, in particular comprising:
randomly selecting each sentence corpus in each field, and taking each sentence corpus as a training sample;
and converting all the linguistic data in each training sample into units corresponding to the acoustic model to be used as the input of the dynamic language model, using the texts of all the linguistic data in each training sample as the output of the dynamic language model, and labeling the field to which the linguistic data of each training sample belongs.
7. The method of claim 6, further comprising training the dynamic language model, including in particular:
converting the information of the field of each marked training sample and then carrying out feature coding;
extracting features from the input for training the dynamic language model and performing feature fusion with the feature codes;
the dynamic language model is switched to a language algorithm corresponding to the field indicated by the feature code in the fused features and feature decoding is carried out on the fused features;
calculating a loss function based on the output of the training of the dynamic language model and the feature decoded text;
and determining whether a preset training time is reached and/or whether the loss function reaches a convergence target, and if so, finishing the training of the dynamic language model.
8. A speech recognition system, comprising:
the judgment device is arranged at the front end, and the language identification service device is connected with the judgment device and arranged at the back end;
the determining device is configured to determine, in a service request, domain information corresponding to acoustic features of a speech to be recognized according to the domain classification of the corpus, and specifically includes: in a service request, converting the speech to be recognized into a unit corresponding to an acoustic model to construct a corpus of a current sentence as the acoustic feature; judging the domain to which the corpus of the current sentence belongs according to the domain classification of the corpus by a trained corpus domain classifier, and taking the domain as the domain information corresponding to the acoustic feature; the training of the corpus field classifier specifically comprises:
constructing training data, including: forming a plurality of dialogues by the linguistic data in each field, randomly selecting the dialogues in a plurality of different fields to be spliced together to form a training sample, converting all the linguistic data in each training sample into a unit corresponding to an acoustic model to be used as the input of a linguistic data classifier, using the texts of all the linguistic data in each training sample as the output of the linguistic data classifier, and carrying out change labeling on the joint points of each different field in each training sample; training a pre-constructed corpus field classifier by using training data of the training corpus field classifier;
the language identification service device is configured to identify the acoustic feature after switching to a domain corresponding to the domain information through a dynamic language model based on deep learning, so as to determine a text identification result corresponding to the speech to be identified, and specifically includes: the dynamic language model based on deep learning comprises the steps of constructing a neural network model which adopts embedded layer Embedding to represent each language model structure corresponding to each field; and converting the domain information, then carrying out feature coding, fusing the feature coding with features extracted from the acoustic features, switching a dynamic language model based on deep learning to a language algorithm corresponding to the domain indicated by the feature coding according to the fused features, and carrying out feature decoding on the fused features so as to obtain a predicted text recognition result corresponding to the speech to be recognized.
9. The system of claim 8,
the speech recognition service request includes: a service request identified by a long dialog, and/or a service request identified by a single sentence.
10. The system of claim 9, wherein the determining means further comprises:
when the voice recognition service request is a single-sentence recognized primary service request, directly classifying the fields corresponding to the corpus of the current sentence as the field information according to the field of the set corpus, and outputting the field information to the voice recognition service device;
when the voice recognition service request is a service request of long dialog recognition: firstly, according to the identification information of the previous sentence, performing first feature extraction on the corpus of the current sentence, and outputting the first feature and a judgment result of whether the corpus field is changed; wherein, the previous sentence refers to a sentence before the current sentence; and the number of the first and second groups,
and fusing the received first feature with the feature extracted from the corpus of the current sentence, and then judging the domain to which the corpus of the current sentence belongs according to the domain classification of the set corpus and outputting the domain as the domain information to the language identification service device.
11. The system of claim 10,
the field classification of the corpus comprises the following steps: setting labels corresponding to the linguistic data in each field, and identifying the field classification of the linguistic data;
and/or the presence of a gas in the gas,
the identification information of the previous sentence includes: the state vector of the previous sentence that has already occurred and been identified.
12. The system of claim 11,
the constructed corpus field classifier comprises: the system comprises a corpus field change judgment module and a corpus field classification module, wherein the two modules are connected;
training the corpus field classifier using the training data of the corpus field classifier, comprising:
executing first feature extraction on the input according to the state of the current time t in the corpus area change judging module, and calculating a loss function for judging whether the corpus area changes or not; the state of the current time t is a state vector returned at the time t-1, and the state vector represents a state vector of a sentence at the time t-1 which has occurred and is identified;
and performing feature extraction on the input in the corpus field classification module, fusing the first features, and calculating a loss function for judging the field part to which the corpus in the input belongs.
13. The system according to any of claims 8 to 12, further comprising constructing training data for training the dynamic language model, in particular comprising:
randomly selecting each sentence corpus in each field, and taking each sentence corpus as a training sample;
and converting all the linguistic data in each training sample into units corresponding to the acoustic model to serve as the input of the dynamic language model, using the texts of all the linguistic data in each training sample as the output of the dynamic language model, and labeling the field to which the linguistic data of each training sample belongs.
14. The system of claim 13, further comprising training the dynamic language model, including in particular:
in the field information decoding module, the field of each marked training sample is subjected to information conversion and then characteristic coding;
in the language model module, extracting features from the input for training the dynamic language model and performing feature fusion with the feature codes, switching the dynamic language model to a language algorithm corresponding to a field indicated by the feature codes in the fused features and performing feature decoding on the fused features, and calculating a loss function from the output for training the dynamic language model and the text decoded by the features;
and determining whether a preset training time is reached and/or whether the loss function reaches a convergence target, and if so, finishing the training of the dynamic language model.
15. An electronic device comprising a processor and a memory storing computer-executable instructions, wherein the computer-executable instructions, when executed, cause the processor to perform the method of any one of claims 1 to 7.
16. A computer readable storage medium, characterized in that the computer readable storage medium stores one or more programs which, when executed by a processor, implement the method of any one of claims 1 to 7.
CN202110470132.0A 2021-04-28 2021-04-28 Speech recognition method, system, apparatus and medium Active CN113299283B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110470132.0A CN113299283B (en) 2021-04-28 2021-04-28 Speech recognition method, system, apparatus and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110470132.0A CN113299283B (en) 2021-04-28 2021-04-28 Speech recognition method, system, apparatus and medium

Publications (2)

Publication Number Publication Date
CN113299283A CN113299283A (en) 2021-08-24
CN113299283B true CN113299283B (en) 2023-03-10

Family

ID=77320444

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110470132.0A Active CN113299283B (en) 2021-04-28 2021-04-28 Speech recognition method, system, apparatus and medium

Country Status (1)

Country Link
CN (1) CN113299283B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101923854A (en) * 2010-08-31 2010-12-22 中国科学院计算技术研究所 Interactive speech recognition system and method
CN105869635A (en) * 2016-03-14 2016-08-17 江苏时间环三维科技有限公司 Speech recognition method and system
CN106653007A (en) * 2016-12-05 2017-05-10 苏州奇梦者网络科技有限公司 Speech recognition system
US20190156817A1 (en) * 2017-11-22 2019-05-23 Baidu Usa Llc Slim embedding layers for recurrent neural language models
CN111883113A (en) * 2020-07-30 2020-11-03 云知声智能科技股份有限公司 Voice recognition method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101923854A (en) * 2010-08-31 2010-12-22 中国科学院计算技术研究所 Interactive speech recognition system and method
CN105869635A (en) * 2016-03-14 2016-08-17 江苏时间环三维科技有限公司 Speech recognition method and system
CN106653007A (en) * 2016-12-05 2017-05-10 苏州奇梦者网络科技有限公司 Speech recognition system
US20190156817A1 (en) * 2017-11-22 2019-05-23 Baidu Usa Llc Slim embedding layers for recurrent neural language models
CN111883113A (en) * 2020-07-30 2020-11-03 云知声智能科技股份有限公司 Voice recognition method and device

Also Published As

Publication number Publication date
CN113299283A (en) 2021-08-24

Similar Documents

Publication Publication Date Title
CN110704641B (en) Ten-thousand-level intention classification method and device, storage medium and electronic equipment
CA3166784A1 (en) Human-machine interactive speech recognizing method and system for intelligent devices
WO2020046807A1 (en) Cross-lingual classification using multilingual neural machine translation
CN108710704B (en) Method and device for determining conversation state, electronic equipment and storage medium
CN112069300B (en) Semantic recognition method and device for task type dialogue, electronic equipment and storage medium
KR20180001889A (en) Language processing method and apparatus
CN113505591A (en) Slot position identification method and electronic equipment
CN111341293B (en) Text voice front-end conversion method, device, equipment and storage medium
CN110910903B (en) Speech emotion recognition method, device, equipment and computer readable storage medium
JP2008165783A (en) Discriminative training for model for sequence classification
CN111402861A (en) Voice recognition method, device, equipment and storage medium
CN111737991B (en) Text sentence breaking position identification method and system, electronic equipment and storage medium
CN111191000A (en) Dialog management method, device and system of intelligent voice robot
WO2023020262A1 (en) Integrating dialog history into end-to-end spoken language understanding systems
CN112016275A (en) Intelligent error correction method and system for voice recognition text and electronic equipment
CN112802444A (en) Speech synthesis method, apparatus, device and storage medium
US20220375453A1 (en) Method and apparatus for speech synthesis, and storage medium
KR102358485B1 (en) Dialogue system by automatic domain classfication
Hori et al. Statistical dialog management applied to WFST-based dialog systems
CN112100339A (en) User intention recognition method and device for intelligent voice robot and electronic equipment
CN114860938A (en) Statement intention identification method and electronic equipment
CN114360485A (en) Voice processing method, system, device and medium
CN113299283B (en) Speech recognition method, system, apparatus and medium
Hori et al. Adversarial training and decoding strategies for end-to-end neural conversation models
CN113160820A (en) Speech recognition method, and training method, device and equipment of speech recognition model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant