CN110428819B - Decoding network generation method, voice recognition method, device, equipment and medium - Google Patents

Decoding network generation method, voice recognition method, device, equipment and medium Download PDF

Info

Publication number
CN110428819B
CN110428819B CN201910745811.7A CN201910745811A CN110428819B CN 110428819 B CN110428819 B CN 110428819B CN 201910745811 A CN201910745811 A CN 201910745811A CN 110428819 B CN110428819 B CN 110428819B
Authority
CN
China
Prior art keywords
language model
decoding network
network
slot position
decoding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910745811.7A
Other languages
Chinese (zh)
Other versions
CN110428819A (en
Inventor
黄羿衡
贺利强
苏丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201910745811.7A priority Critical patent/CN110428819B/en
Publication of CN110428819A publication Critical patent/CN110428819A/en
Application granted granted Critical
Publication of CN110428819B publication Critical patent/CN110428819B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0635Training updating or merging of old and new templates; Mean values; Weighting

Landscapes

  • Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The method provides a decoding network generating method and an intelligent voice interaction system for a voice recognition application scene based on an artificial intelligence technology, and provides a decoding network architecture based on a differential structure. When a newly added entry needs to update the decoding network, only the difference language model corresponding to the target slot position needs to be updated. The construction of the decoding network capable of supporting the oversized slot position is realized by only once establishing a basic decoding network, and once a new instance needs to be added to the decoding network, only the differential language model corresponding to the slot position needs to be updated without reconstructing a large decoding network, so that the iterative updating speed of the decoding network can be accelerated to support a service scene with higher real-time requirement.

Description

Decoding network generation method, voice recognition method, device, equipment and medium
The application provides divisional application for Chinese patent application with application number of 201910424817.4 and application date of 2019, 05 and 21, entitled 'decoding network generation method, voice recognition method, device, equipment and medium'.
Technical Field
The present application relates to the field of speech recognition technologies, and in particular, to a decoding network generation method, a speech recognition method, an apparatus, a device, and a computer-readable storage medium.
Background
With the rapid development of voice recognition technology, various intelligent products supporting voice recognition functions, such as intelligent robots, intelligent vehicle-mounted devices, and the like, have gradually penetrated into various corners of the work and life of users, and such intelligent products provide more intelligent services for the users through the voice recognition functions.
In practical application, such intelligent products need to continuously update newly added entries in an application scene into a decoding network, so as to ensure that the intelligent products can adapt to the continuously changing scene in time; the decoding network is actually a state network, and the speech recognition process is actually a process of searching a path in the state network that is most matched with the speech, and the process is also called a decoding process.
In the related art, decoding is performed based on a static decoding network, the static decoding network refers to that all knowledge sources are compiled in a state network in a unified manner, probability information is obtained according to transfer weights among nodes in the decoding process, the method causes that when newly added entries are generated, the whole decoding network needs to be reconstructed, and the reconstruction of the decoding network is often a week, so that the iteration updating speed of the decoding network cannot be adapted to the entry updating speed of an application scene, and the updating iteration of an intelligent product is limited.
Disclosure of Invention
The embodiment of the application provides a decoding network generation method, a decoding network generation device, decoding network generation equipment and a storage medium, and the updating of the decoding network can be realized without reconstructing the whole decoding network under the condition that a newly added entry is generated, so that the iterative updating speed of the decoding network is accelerated.
In view of the above, a first aspect of the present application provides a decoding network generating method, including:
training according to a first training sample set to obtain a language model corresponding to a target slot position, and performing difference on the language model and a basic compression language model corresponding to the target slot position to obtain a difference language model corresponding to the target slot position;
constructing a target decoding network according to the differential language model and the basic decoding network; the basic decoding network is generated by constructing a similar language model and a basic compressed language model corresponding to a slot position in the similar language model; the basic compressed language model is generated by cutting the language model corresponding to the slot position obtained by training through a second training sample set, wherein the second training sample set is a subset of the first training sample set.
A second aspect of the present application provides a speech recognition method, including:
acquiring a voice to be recognized;
decoding the voice through a basic decoding network in a decoding network, decoding through a differential language model corresponding to a slot position in the decoding network when the decoding reaches the slot position in the basic decoding network in the decoding process, and storing the current network state of the decoding network, the identification and the current network state of the differential language model corresponding to the slot position and the historical searching state of the decoding network; and when the slot decoding is completed, skipping to the basic decoding network to continue decoding until the last frame of the voice is decoded.
A third aspect of the present application provides a voice interaction system, comprising:
the voice acquisition equipment is used for acquiring voice input by a user through a microphone;
the voice recognition device is used for decoding the voice collected by the voice receiving device through a basic decoding network in the decoding network, decoding the voice through a differential language model corresponding to a slot position in the decoding network when the decoding reaches the slot position in the basic decoding network in the decoding process, and storing the current network state of the decoding network, the identification and the current network state of the differential language model corresponding to the slot position and the historical search state of the decoding network; when the slot decoding is finished, skipping to the basic decoding network to continue decoding until the last frame of the voice is decoded, and obtaining a voice recognition result;
and the control equipment is used for executing the operation corresponding to the voice recognition result according to the voice recognition result obtained by the language recognition equipment.
A fourth aspect of the present application provides a decoding network generating apparatus, including:
the differential language model generating module is used for obtaining a language model corresponding to a target slot position through training according to a first training sample set, and performing a differential value on the language model and a basic compressed language model corresponding to the target slot position to obtain a differential language model corresponding to the target slot position;
the decoding network construction module is used for constructing a target decoding network according to the slot position difference language model and the basic decoding network; the basic decoding network is generated by constructing a similar language model and a slot position basic compression language model corresponding to a slot position in the similar language model; the slot position basic compression language model is generated by cutting a language model corresponding to a slot position obtained by training through a second training sample set, the second training sample set is a subset of the first training sample set, and the target slot position is a slot position in the class language model.
A fifth aspect of the present application provides a language identification device, including:
the acquisition module is used for acquiring the voice to be recognized;
the identification module is used for decoding the voice through a basic decoding network in a decoding network, decoding the voice through a differential language model corresponding to a slot position in the decoding network when the decoding reaches the slot position in the basic decoding network in the decoding process, and storing the current network state of the decoding network, the identification and the current network state of the differential language model corresponding to the slot position and the historical search state of the decoding network; and when the slot decoding is completed, skipping to the basic decoding network to continue decoding until the last frame of the voice is decoded.
A sixth aspect of the present application provides an apparatus comprising a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to perform the steps of the decoding network generating method according to the first aspect or the steps of the speech recognition method according to the second aspect, according to instructions in the program code.
A seventh aspect of the present application provides a computer-readable storage medium for storing program code for performing the steps of the decoding network generating method of the first aspect or the speech recognition method of the second aspect.
An eighth aspect of the present application provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the steps of the decoding network generating method of the first aspect or the steps of the speech recognition method of the second aspect.
According to the technical scheme, the embodiment of the application has the following advantages:
the embodiment of the application provides a decoding network generation method, which provides a new decoding network architecture mode, constructs a new decoding network through a basic decoding network and a differential language model corresponding to a slot position in the basic decoding network, only needs to initially construct the basic decoding network once in an application scene, and only needs to reuse the basic decoding network once the basic decoding network is constructed, and combines a differential language model corresponding to a newly trained target slot position on the basis of the basic decoding network, namely, when a newly added entry in the application scene needs to update the decoding network, collects the newly added entry and an original entry as a first training sample set to train the language model corresponding to the target slot position, and then differentiates the language model and a basic compressed language model corresponding to the target slot position in the basic decoding network to obtain the differential language model corresponding to the target slot position, then, the decoding network can be updated by combining the newly trained differential language model on the basis of the initially constructed basic decoding network. The basic decoding network is generated by constructing a similar language model and a basic compression language model corresponding to a slot position in the similar language model; the basic compressed language model is generated by cutting a language model corresponding to a slot position obtained by training through a second training sample set, the second training sample set is a subset of the first training sample set, and the target slot position is a slot position in the class language model. By utilizing the decoding network generation method provided by the application, only a basic decoding network needs to be constructed once, and only the differential language model corresponding to the slot position needs to be updated when the decoding network is updated every time, the updating speed of the differential language model is very high and can reach the level of minutes, so that the iterative updating speed of the decoding network can be greatly improved, and the decoding network can be adapted to the continuously changing application scene in time.
Drawings
Fig. 1 is a schematic architecture diagram of a decoding network generation method according to an embodiment of the present application;
fig. 2 is a schematic flowchart of a decoding network generating method according to an embodiment of the present application;
fig. 3 is a schematic flowchart of a basic decoding network construction method according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a constructed basic decoding network according to an embodiment of the present application;
fig. 5a is a schematic flowchart of a speech recognition method according to an embodiment of the present application;
fig. 5b is a schematic structural diagram of a voice interaction system according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of a first decoding network generating device according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a second decoding network generating apparatus according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a first speech recognition apparatus according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of a second speech recognition apparatus according to an embodiment of the present application;
fig. 10 is a schematic structural diagram of a third speech recognition apparatus according to an embodiment of the present application;
fig. 11 is a schematic structural diagram of a fourth speech recognition apparatus according to an embodiment of the present application;
fig. 12 is a schematic structural diagram of a terminal device according to an embodiment of the present application;
fig. 13 is a schematic structural diagram of a server according to an embodiment of the present application.
Detailed Description
In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
In the related technology, the relevance of each component in the static decoding network structure is extremely high, when the static decoding network is dynamically updated based on the newly added entry, each component in the static decoding network structure is dynamically updated along with the addition of the newly added entry, and the dynamic update of each component causes the memory occupied by the static decoding network to rapidly increase in an exponential growth trend. Thus, when a large number of newly added entries need to be dynamically added to the static decoding network, the memory occupied by the static decoding network is rapidly expanded, for example, when newly added 2000 ten thousand songs need to be added to the static decoding network, if the 2000 ten thousand songs are directly and dynamically replaced into the original static decoding network, the static decoding network is directly expanded from 3M to 40G, and the decoding network cannot be normally generated by such a huge network structure.
Considering that dynamic addition of a newly added entry can cause rapid expansion of a memory occupied by a static decoding network, and the static decoding network cannot be directly updated on the basis of an original network structure, in the related art, when the static decoding network needs to be updated based on the newly added entry, the static decoding network generally needs to be reconstructed based on the original entry and the newly added entry, but reconstruction of the decoding network is very time-consuming, and it is difficult to ensure that the decoding network can adapt to a constantly changing application environment in time.
For example, assuming that the number n (i.e., entries) of edges having slots in the fst network generated based on the class language model is 10k, and the size M of the fst file corresponding to the language model corresponding to the slots is 20M, if the decoding network is constructed according to the conventional method, the model occupies at least 200G of memory, so that memory explosion cannot normally generate the decoding network.
In order to solve the problems in the related art, the embodiment of the present application provides a decoding network generation method, and a new decoding network architecture is proposed in the method, wherein the decoding network is formed by a basic decoding network and a differential language model corresponding to a slot position, and the basic decoding network is constructed and generated by a similar language model and a slot position basic compression language model corresponding to the slot position in the similar language model. When a newly added entry needs to update the decoding network, only the difference language model corresponding to the target slot position needs to be updated. The method can realize the updating of the decoding network by carrying out differential processing on the language model corresponding to the target slot position and the basic compression language model corresponding to the target slot position without repeatedly constructing the basic decoding network, thereby greatly improving the updating speed of the decoding network, leading the decoding network to be timely adapted to the continuously changing application environment, and greatly reducing the network size of the decoding network constructed by the method and being convenient for deployment and realization.
Based on the above example, with the decoding network generation method of the present application, since the language model corresponding to the slot is compressed in the basic decoding network, that is, the size M of the fst file corresponding to the language model is compressed and can be compressed to about 10k, which greatly reduces the size n M of the basic decoding network to 10k to 10M, when generating the decoding network, only the differential language model corresponding to the slot needs to be added on the basis of the basic decoding network, and the differential language model only has the size of several megabytes to several tens of megabytes, thus, with the decoding network generation method of the present application, the construction of the decoding network with an oversized slot can be supported, because the basic decoding network only needs to be built once, once a new instance needs to be added to the decoding network, we only need to update the differential language model corresponding to the slot, the method for constructing the decoding network does not need to reconstruct a large decoding network, and the updating of the differential language model is very quick, so that the method for constructing the decoding network can very quickly realize the iterative updating of the decoding network, can basically achieve the updating speed of a minute level, and can support a service scene with higher real-time requirement.
Based on the new decoding network structure provided by the application, the application also provides a voice recognition method suitable for the decoding network structure. Specifically, after the voice to be recognized is obtained, the voice is decoded through a basic decoding network in the decoding network, in the decoding process, when the decoding reaches a slot position in the decoding network, the decoding is performed through a differential language model corresponding to the slot position in the decoding network, and the current network state of the decoding network, the identification of the slot position language model corresponding to the slot position node, the current network state and the historical search state are stored; after the slot position node finishes decoding, jumping back to a basic decoding network in the decoding network to continue decoding until the last frame of the voice to be recognized is decoded. The voice recognition method realizes the decoding of the differential language model corresponding to the entering slot position in the decoding process and the return of the differential language model corresponding to the leaving slot position to the basic decoding network decoding by recording the quadruple including the current network state of the decoding network, the identification of the slot position language model corresponding to the slot position, the current network state and the historical searching state, and ensures that the voice recognition is successfully completed based on the decoding network structure formed by the basic decoding network and the differential language model corresponding to the slot position.
It should be understood that the decoding network generation method provided by the embodiment of the present application may be applied to a device with data analysis processing capability, where the device may specifically be a terminal device or a server; the terminal device may be a computer, a Personal Digital Assistant (PDA), a tablet computer, a smart phone, or the like; the server may specifically be an application server or a Web server, and in actual deployment, the server may be an independent server or a cluster server.
It should be understood that the speech recognition method provided by the embodiment of the present application may be applied to a device capable of supporting operation of a decoding network, where the device may specifically be a terminal device or a server; the terminal device may specifically be a smart phone, a tablet computer, a Personal Digital Assistant (PDA), a smart sound box, a smart robot, or other devices that can be controlled by voice; the server may specifically be an application server or a Web server, and in actual deployment, the server may be an independent server or a cluster server.
The scheme provided by the embodiment of the application relates to an artificial intelligence automatic voice recognition technology, voice recognition needs to be completed based on a pre-constructed decoding network when the technology is realized, and the construction updating efficiency and quality of the decoding network are the difficulties of the application of the technology.
In order to facilitate understanding of the technical solutions provided in the embodiments of the present application, an implementation architecture of the decoding network generation method provided in the embodiments of the present application is described below.
Referring to fig. 1, fig. 1 is a schematic diagram of an implementation architecture of a decoding network generation method provided in the embodiment of the present application. As shown in fig. 1, the decoding network generation method provided in the embodiment of the present application is implemented based on a decoding network 1100, where the decoding network 1100 is constructed based on a basic decoding network 1110 and a differential language model 1120 corresponding to a slot; the basic decoding network 1110 is constructed and generated by the class language model 1111 and the basic compressed language model 1113 corresponding to the slot in the class language model. As shown in fig. 1, the slot-containing class language model 1111 includes a slot a, a slot B, and a slot C, where the slot a corresponds to the basic compression language model 1113a, the slot B corresponds to the basic compression language model 1113B, and the slot C corresponds to the basic compression language model 1113C. The differential language model 1120 correspondingly includes a differential language model corresponding to each slot in the class language model, and as shown in fig. 1, the differential language model 1120 includes a differential language model 1121 corresponding to slot a, a differential language model 1122 corresponding to slot B, and a differential language model 1123 corresponding to slot C.
Optionally, in order to improve the recognition accuracy of the decoding network, a universal language model 1112 may be further fused in the basic decoding network 1110, and the universal language model can better recognize the daily expressions, and the recognition performance of the basic decoding network is enhanced by performing interpolation processing on the class language model 1111 and the universal language model 1112.
It should be noted that, when training the basic compressed language model 1113 corresponding to each slot, it is usually required to first train the language model corresponding to the slot based on the second training sample set corresponding to the slot, and then cut the language model corresponding to the slot to generate the corresponding basic compressed language model 1113. For example, assuming that the slot a is a song name slot, when training the basic compressed language model 1113A corresponding to the slot a, the language model 1113A needs to be obtained by training with a second training sample set composed of the currently existing song names, and then the language model 113A is cut to obtain the basic compressed language model 113A corresponding to the slot a; similarly, the basic compressed language model 113B corresponding to slot B and the basic compressed language model 113C corresponding to slot C may be obtained by training using the second training sample set corresponding to slot B and the second training sample set corresponding to slot C, respectively, in the manner described above.
It should be noted that, when a decoding network is initially constructed, the differential language model corresponding to each slot in the differential language model 1120 may be obtained based on the language model corresponding to each slot and the basic compressed language model, for example, the differential language model 1121 corresponding to slot a may be obtained by differentiating the language model 1113A corresponding to slot a and the basic compressed language model 1113A corresponding to slot a, the differential language model 1122 corresponding to slot B may be obtained by differentiating the language model 1113B corresponding to slot B and the basic compressed language model 1113B corresponding to slot B, and the differential language model 1123 corresponding to slot C may be obtained by differentiating the language model 1113C corresponding to slot C and the basic compressed language model 1113C corresponding to slot C.
When a new entry exists and the decoding network 1100 needs to be updated, a target slot position corresponding to the new entry can be determined first, and a first training sample set is formed by the new entry and a second training sample set corresponding to the target slot position; then, a language model corresponding to the target slot is obtained through training by using the first training sample set, and then a difference language model corresponding to the target slot in the basic decoding network 1110 is obtained by using the language model corresponding to the target slot, and finally the target decoding network 1200 is obtained through construction according to the difference language model corresponding to the target slot and the basic decoding network 1100.
Still taking the slot position a as the song name slot position as an example, when the decoding network 1100 needs to be updated based on the new song name, a first training sample set can be formed by using the new song name and the song name in the second training sample set, and then the language model 1213A corresponding to the slot position a is obtained by using the first training sample set for training; further, the difference between the language model 1213A and the basic compressed language model 1113A corresponding to the slot a is used to obtain a difference language model 1221 corresponding to the slot a, and the target decoding network 1200 is obtained according to the difference language model corresponding to the slot a and the basic decoding network 1110.
It should be understood that, in practical applications, the language models corresponding to the plurality of slot positions may be obtained by training based on the first training sample set corresponding to the plurality of slot positions, and then the language models corresponding to the plurality of slot positions are used to differentiate the corresponding basic compressed language models, so as to update the decoding network 1100, and obtain the target decoding network 1200.
The following describes a decoding network generation method provided by the present application by an embodiment.
Referring to fig. 2, fig. 2 is a schematic flowchart of a decoding network generation method provided in the embodiment of the present application. For convenience of description, the following embodiments are described with a server as an execution subject, and it should be understood that the execution subject of the decoding network generation method is not limited to the server, and may be other devices with data analysis processing capability, such as a terminal device. As shown in fig. 2, the decoding network generating method includes the steps of:
step 201: and training according to a first training sample set to obtain a language model corresponding to the target slot position, and performing difference on the language model and a basic compression language model corresponding to the target slot position to obtain a differential language model corresponding to the target slot position.
When a newly added entry exists and the decoding network needs to be updated based on the newly added entry, the server may first determine which slot position in the class language model the newly added entry specifically corresponds to, and correspondingly take the slot position as a target slot position, and obtain a second training sample set corresponding to the target slot position, where the second training sample set includes training samples used for training a compressed language model corresponding to the target slot position in the basic decoding network. And then combining the newly added entries with the training samples in the second training sample set to form a first training sample set, and training by using the first training sample set to obtain the language model corresponding to the target slot position. And carrying out difference by using the language model corresponding to the target slot position and the basic compressed language model corresponding to the target slot position in the basic decoding network to obtain a difference language model corresponding to the target slot position.
It should be noted that the slot may be specifically understood as a variable in a command-type dialog, taking a command to the smart speaker to play a song as an example, in the sentence "i want to listen to a friend" the singer name "zhang schou" and the song name "kiss" are variables in a command to play a music, accordingly, a position corresponding to the "zhang schou" in the sentence may be understood as a slot for the singer name, and a position corresponding to the "kiss" may be understood as a slot for the song name. It should be understood that, in practical applications, in addition to understanding the singer name and the song name in the song playing command as the slot, the video name in the video playing command may also be understood as the slot, and the destination name in the navigation command may also be understood as the slot, and no limitation is made to the specific name type corresponding to the slot.
Specifically, when the language model corresponding to the target slot is trained based on the first training sample set, some open source tools, such as srilm, can be directly used to obtain the language model corresponding to the target slot based on the first training sample set.
In order to facilitate understanding of the training process of the language model corresponding to the target slot position, the following takes the language model corresponding to the song name slot position as an example, and introduces the training process of the language model corresponding to the target slot position.
When a language model corresponding to the song name slot position needs to be trained, the server may first obtain a second training sample set corresponding to the song name slot position, where the second training sample set includes all song names used when a basic compression model corresponding to the song slot position in the basic decoding network is trained, then, form a first training sample set by using all newly added song names and all song names in the second training sample set, and train by using the first training sample set to obtain the language model corresponding to the song name slot position.
In a possible situation, the newly added entry currently exists corresponding to the same slot position in the class language model, namely only one target slot position exists; at this time, the first training sample is directly used for training to obtain the language model corresponding to the target slot position.
In another possible case, the newly added vocabulary entry currently existing respectively corresponds to a plurality of different slot positions in the class language model, namely a plurality of target slot positions exist simultaneously; at this time, first training sample sets corresponding to the target slot positions are respectively obtained, namely second training sample sets corresponding to the target slot positions are respectively obtained, and aiming at each target slot position, a newly added entry corresponding to each target slot position is combined with the second training sample sets to form the first training sample set corresponding to the target slot position; and then, respectively training to obtain the language model corresponding to each target slot position by utilizing the first training sample set corresponding to each target slot position.
After the language model corresponding to the target slot position is obtained through training based on the first training sample set, the server can use the language model corresponding to the target slot position to perform difference on the basic compressed language model corresponding to the target slot position in the basic decoding network, and therefore the difference language model corresponding to the target slot position can be obtained. It should be noted that if the spoken language model a is a differential language model of the language model B and the language model C, the language model a, the language model B, and the language model C will satisfy the formulas (1) and (2):
logPA(w|H)=logPB(w|H)-logPc(w|H) (1)
αA(H)=αB(H)-αc(H) (2)
wherein, PA(w | H) denotes the probability that the language model A determines the occurrence of a word w given a historical word sequence H, PB(w | H) denotes the probability, logP, that language model B determines the occurrence of a word w given a historical word sequence Hc(w | H) represents the probability that language model C determines the occurrence of word w given a historical sequence of words H; alpha is alphaA(H) Representing a backspacing coefficient, alpha, corresponding to the historical word sequence H in the language model AB(H) Representing a backspacing coefficient, alpha, corresponding to the historical word sequence H in the language model BC(H) And representing the backspacing coefficient corresponding to the historical word sequence H in the language model C. The terms that can be recognized by the language model C are a subset of the terms that can be recognized by the language model B, and the terms that can be recognized by the language model a are the same as the terms that can be recognized by the language model B.
That is, the differential language model obtained by performing differential processing on the basic compressed language model corresponding to the target slot position in the basic decoding network by using the language model corresponding to the target slot position is combined with the basic decoding network, and the recognized vocabulary entry is the same as the vocabulary entry recognized by the language model corresponding to the target slot position obtained by training based on the first training sample set.
When the decoding network is generated for the first time, the differential language model in the decoding network is obtained by performing differential processing on the language model obtained by training the second training sample set and the basic compressed language model obtained by cutting the language model. When the decoding network needs to be updated in the subsequent generation of the new entry, the server may update the difference language model in the decoding network by performing step 201.
Step 202: constructing a target decoding network according to the differential language model and the basic decoding network; the basic decoding network is generated by constructing a similar language model and a basic compressed language model corresponding to a slot position in the similar language model; the basic compressed language model is generated by cutting the language model corresponding to the slot position obtained by training through a second training sample set, wherein the second training sample set is a subset of the first training sample set.
The server performs differential processing on a language model corresponding to a target slot position obtained by training based on a first training sample set and a basic compressed language model corresponding to the target slot position to obtain a differential language model corresponding to the target slot position, and then directly replaces the differential language model corresponding to the target slot position in an original decoding network with the differential language model.
In a specific implementation, the server may convert the differential language model into a tree structure network or a Weighted finite state machine (WFST) network, and as the first network, merge the first network and the basic decoding network to generate the target decoding network.
When converting the differential language model into a WFST network, the server may call the language model tool Kaldi to convert the differential language model into its corresponding WFST network. In practical application, of course, the server may invoke the language model tool Kaldi to convert the differential language model into the WFST network, and may invoke other language model tools to convert into the WFST network, where no limitation is imposed on the language model tool used for converting the WFST network.
When the difference language model is converted into a tree structure, the n-gram language model corresponding to each tree structure can be represented by using a 64-bit state, the depth information of the n-gram language model corresponding to the state in the tree structure and the history vocabulary information traversed by the n-gram language model are recorded in the state, and the n-gram score corresponding to the vocabulary to be inquired can be easily inquired in the tree structure based on the state corresponding to each tree structure.
It should be noted that the tree structure is an efficient data structure for storing the language model, and occupies a smaller storage space than the WFST network tree structure. In addition, all the items stored in the tree structure are sorted in advance according to the corresponding numerical values of all the items, when the language model probability is searched, the dichotomy can be directly used for searching, and the searching efficiency is high.
It should be noted that the basic decoding network is constructed based on the class language model and the basic compressed language model corresponding to each slot. When the class language model is trained, phrase blocking processing needs to be performed on a text sample to obtain a text sample of a phrase layer, then a text block belonging to a slot type in the text sample of the phrase layer is replaced by a slot name corresponding to the type to which the text block belongs, and then the class language model is trained by using the text sample with the slot name. And the basic compressed language model corresponding to the slot position is generated by cutting the language model corresponding to the slot position, and the language model corresponding to the slot position is obtained by utilizing the second training sample set corresponding to the slot position for training.
Optionally, in order to improve the accuracy of the speech recognition of the decoding network, the basic decoding network may further include a general language model, which is usually obtained based on natural language training used in daily life.
It should be understood that different slot positions correspond to different second training sample sets, and accordingly, the language models obtained by training based on the different second training sample sets are different, and further, the basic compressed language models generated by clipping the language models are also different, that is, different slot positions correspond to different basic compressed language models.
In practical applications, the class language model may include one slot or a plurality of slots. For example, in an application scenario in which destination navigation is performed based on voice, since a voice command that is generally used by a user is "navigate to XXX (destination name)", when constructing the base decoding network, the base decoding network may be constructed based on a class language model that includes only a geographical location name slot. For another example, in an application scenario of playing music based on voice, since a command of voice generally used by a user may be "XXX (audio name) of XXX (creator name)" playing, when constructing the basic decoding network, the basic decoding network may be constructed based on a language-like model including a creator name slot and an audio name slot, it should be understood that the creator name may specifically be a performer name, a word creator name, an artist name, and the like, and the audio name may specifically be a song name.
The decoding network generation method develops a new decoding network architecture, the decoding network is composed of a basic decoding network and a differential language model corresponding to the slot position, wherein the basic decoding network is constructed and generated through a similar language model and a basic compression language model corresponding to the slot position in the similar language model. When a newly added entry needs to update the decoding network, only the differential language model corresponding to the target slot position needs to be updated. That is, the method performs differential processing on the language model corresponding to the target slot and the basic compressed language model corresponding to the target slot, so that the updating of the decoding network can be realized without reconstructing the decoding network, the iterative updating speed of the decoding network is greatly improved, and the decoding network can be adapted to the continuously changing application environment in time.
As can be seen from the description of the embodiment shown in fig. 2, the reason why the decoding network generation method provided in the embodiment of the present application can realize fast update of the decoding network is that a new decoding network structure is provided, and in order to further understand the decoding network generation method provided in the embodiment of the present application, a method for constructing a basic decoding network structure in the decoding network is introduced below.
Referring to fig. 3, fig. 3 is a schematic flowchart of a basic decoding network construction method provided in an embodiment of the present application. For convenience of description, the following embodiments are described with a server as an execution subject, and it should be understood that the execution subject of the basic decoding network construction method is not limited to the server, and may be other devices with data analysis processing capability, such as a terminal device. As shown in fig. 3, the basic decoding network construction method includes the following steps:
step 301: and determining a third training sample set, wherein the third training sample set comprises text samples with slot position names, and training according to the third training sample set to obtain a language-like model.
When the server constructs a basic decoding network structure, a third training sample set comprising a large number of text samples with slot position names needs to be obtained first, and then a class language model with slots is obtained through training according to the third training sample set.
In the related art, the server generally performs block processing on the text sample by taking a word as a unit, but the block processing by taking the word as a unit leads to that the finally divided text sample is too trivial, for example, when the voice instruction is "i want to listen to me until flowers all are thank", the related art divides the voice instruction into "i want to listen to me until flowers all are thank", which includes a large number of words. The concept of the language model with the slot positions is provided, the content in each slot position belongs to the same category, correspondingly, when the language model with the slot positions is trained, the language model with the slot positions needs to be trained based on a text sample of a phrase level, and still by taking the example of recognizing a voice instruction that 'I want to listen to me and the flower all thanks', the concept of the language model with the slot positions is divided into 'I want to listen to me and the flower all thanks' based on the phrase level, namely, a plurality of words belonging to the same category are divided into a phrase by the concept of the language model with the slot positions.
Based on this, when the server trains the class language model, the server firstly obtains the text sample, and then performs phrase level block processing on the obtained text sample to obtain the text sample of the phrase layer. In particular, based on the sentence w1 w2 w3 ... wn(wiRepresenting words in a sentence, which are all words included in a dictionary), the server divides the sentence into pi (n) different blocks according to categories included in the dictionary, the phrases or words included in each block belong to the same category, and records the text sample of the phrase layer obtained after the blocks as W1 W2 W3 ... Wπ(n)
Then, the text block belonging to the SLOT category in the text sample of each phrase layer is replaced with the SLOT name corresponding to the category to which the text block belongs, for example, after the text "i want to listen to lip" is divided into the text sample "i want to listen to lip" of the phrase layer, the server may further replace "lip" with the SLOT name corresponding to the category, that is, "lip" with the SLOT name SONG-SLOT corresponding to the SONG name category, so that the text sample "i want to listen to SONG-SLOT" with the SLOT name can be obtained. Thus, in this manner, a large number of slot-name text samples are obtained, and a third training sample set is formed using these slot-name text samples.
It should be understood that, in practical applications, according to application scenarios applicable to various voice commands, a text block belonging to a SLOT category in a text sample at a phrase level may be replaced with a SLOT name corresponding to the category to which the text block belongs, and the replacement is not limited to replacing a SONG name with a SONG-SLOT, and no limitation is made to the category to which the text block in the text sample at the phrase level belongs, nor is any limitation made to the SLOT name corresponding to each category.
And training by using the text sample with the slot name in the third training sample set to obtain a language-like model, wherein the language-like model can be specifically an n-gram language model.
It should be noted that, in the following description,for a sentence W composed of multiple phrases1 W2 W3 ... Wπ(n)(wherein, WiRepresenting the corresponding blocks of each phrase), the process of determining the text to which it corresponds is actually calculating the probability P (W)1 W2 W3 ... Wπ(n)) The probability P (W)1 W2 W3 ... Wπ(n)) Specifically, the compound can be decomposed into the formula (3):
Figure BDA0002165510890000154
wherein, P (W)1) The expression W1Probability of occurrence, P (W)2|W1) Is expressed in the phrase W1The phrase W in the case of occurrence2The probability of occurrence of the event is,
Figure BDA0002165510890000155
is expressed in the phrase W1、W2、...、Wπ(n)-1All cases of Wπ(n)The probability of occurrence.
At probability P (W)1 W2 W3 ... Wπ(n)) In the formula for decomposition of (a) above,
Figure BDA0002165510890000156
specifically, the following relation (4) is satisfied:
Figure BDA0002165510890000151
wherein, P (W)k|Ck) The expression WkBelong to class CkThe probability of (a) of (b) being,
Figure BDA0002165510890000152
is shown in the category
Figure BDA0002165510890000153
Class C in case of occurrencekThe probability of occurrence.
Based on the expressions (3) and (4), the probability P (W) is calculated1 W2 W3 ... Wπ(n)) And in order to reduce the calculation amount, an n-gram language model can be used as a class language model with a slot position.
Specifically, for the 1-gram language model, its assumptions
P(Wπ(n)|W1 W2 ... Wπ(n)-1)=P(Wπ(n)|Wπ(n)-1) Accordingly, the probability P (W)1 W2 W3 ... Wπ(n)) Specifically, the compound can be decomposed into the formula (5):
P(W1 W2,...,Wn)=P(W1)P(W2|W1)...P(Wπ(n)|Wπ(n)-1) (5)
for the 2-gram language model, it assumes P (W)π(n)|W1 W2 ... Wπ(n)-1)=P(Wπ(n)|Wπ(n)-1 Wπ(n)-2) Accordingly, the probability P (W)1 W2 W3 ... Wπ(n)) Specifically, the compound can be decomposed into the formula (6):
P(W1 W2,...,Wn)=P(W1)P(W2|W1)...P(Wπ(n)|Wπ(π)-1 Wπ(n)-2) (6)
for the 3-gram language model, it assumes P (W)π(n)|W1 W2 ... Wπ(n)-1)=P(Wπ(n)|Wπ(n)-1 Wπ(n)-2Wπ(n)-3) Accordingly, the probability P (W)1 W2 W3 ... Wπ(n)) Specifically, the compound can be decomposed into the formula (7):
P(W1 W2,...,Wn)=P(W1)P(W2|W1)...P(Wπ(n)|Wπ(n)-1 Wπ(n)-2 Wπ(n)-3) (7)
experimental research shows that the larger the value of n, the better the model performance, but the larger the calculated amount, and the better the effect can be achieved by setting the value of n to 2 or 3 under normal conditions.
Step 302: and training according to a second training sample set corresponding to the slot position in the similar language model to obtain a language model corresponding to the slot position, and cutting the language model corresponding to the slot position to obtain a basic compression language model corresponding to the slot position.
And acquiring a second training sample set corresponding to the slot position in the similar language model, then training by using the second training sample set to obtain a language model corresponding to the slot position, and further cutting the language model corresponding to the slot position to obtain a basic compression language model corresponding to the slot position.
It should be understood that in practical applications, the class language model may include one slot or a plurality of slots. When the similar language model only comprises one slot position, the corresponding language model is obtained by directly training based on the second training sample set corresponding to the slot position, and then the basic compression language model corresponding to the slot position can be obtained by cutting the language model. When the similar language model comprises a plurality of slot positions, the similar language model needs to be trained respectively to obtain a language model corresponding to each slot position based on a second training sample set corresponding to each slot position, and then the language model corresponding to each slot position is cut to obtain a basic compression language model corresponding to each slot position.
It should be noted that the slot language model corresponding to each slot may actually be an n-gram language model, and accordingly, when the language model corresponding to the slot is trained by using the second training sample set, word segmentation processing needs to be performed on each training sample in the second training sample set, and then the n-gram language model is trained based on the training samples obtained after the word segmentation processing.
It should be understood that when the language model corresponding to the slot is cut, the slot language model may be cut by using a language model tool SRILM, or may be cut by using language tools such as Kenlm, IRSTLM, MITLM, and the like, and the language model tool used for cutting the slot language model is not limited in any way.
It should be noted that, in practical applications, step 301 and step 302 may be executed first and then in sequence, or step 302 may be executed first and then in sequence, and of course, step 301 and step 302 may also be executed simultaneously, and the order of executing step 301 and step 302 is not limited at all.
Step 303: and generating a basic decoding network according to the similar language model and the basic compression language model corresponding to the slot position.
After the server obtains the similar language model through the training in step 301 and obtains the basic compression model corresponding to the slot position through the training in step 302, the server can directly generate the basic decoding network according to the similar language model and the basic compression language model corresponding to each slot position, and specifically, the server can obtain the basic decoding network by correspondingly embedding the basic compression language model corresponding to each slot position into the slot position in the similar language model.
In order to improve the speech recognition accuracy of the decoding network, the server interpolates the generic language model into the basic decoding network in the process of generating the basic decoding network. Specifically, the server may perform interpolation processing on the similar language model and the general language model to obtain an interpolation language model, and cut the interpolation language model to obtain a corresponding compressed interpolation language model. And then, generating a basic decoding network according to the compressed interpolation language model and the basic compressed language model corresponding to the slot position obtained in the step 302.
In specific implementation, after the server trains to obtain the similar language model, the similar language model and the general language model can be interpolated, that is, the similar language model and the general language model are subjected to linear weighting processing to obtain the interpolated language model, and then the interpolated language model is cut by adopting a specific language model tool to obtain the compressed interpolated language model corresponding to the interpolated language model. After obtaining the compressed interpolation language model, the server may correspondingly embed the basic compressed language model corresponding to the slot position obtained in step 302 into each slot position of the similar language model in the compressed interpolation language model, so as to obtain the basic decoding network.
It should be noted that the above-mentioned general language model is usually obtained based on natural language training used in daily life. In general, a generic language model, which may be specifically an n-gram language model, may be obtained by training with an open-source tool such as SRILM based on various types of texts downloaded from a network.
When the interpolation language model is cut, the interpolation language model may be collected by using a language model tool SRILM, or may be cut by using language tools such as Kenlm, IRSTLM, MITLM, and the like, and the language model tool used for cutting the interpolation language model is not limited at all.
When a basic decoding network is specifically constructed, the server can convert the similar language model into a WFST network to serve as a basic main network, and convert the basic compression language models corresponding to the slot positions in the similar language model into corresponding WFST networks to serve as sub-networks corresponding to the slot positions; and then, respectively embedding the sub-networks corresponding to the slots in the class language model into the basic main network to obtain a basic decoding network.
Specifically, the server may invoke a language model tool Kaldi to convert the class language model and the basic compressed language model corresponding to each slot into the corresponding WFST network. During specific conversion, the server can convert the basic language model to obtain a basic main network, then convert the basic compression language model corresponding to each slot to obtain each sub-network, or convert the basic compression language model corresponding to each slot to obtain each sub-network, then convert the basic language model to obtain the basic main network, or simultaneously convert the basic language model and the basic compression language model corresponding to each slot to obtain the basic main network and each sub-network respectively.
It should be understood that, in practical application, in addition to calling the language model tool Kaldi to convert the class language model and the basic compression language model corresponding to each slot into the WFST network, other language model tools may also be called to convert into the WFST network, and the language model tool used for converting the WFST network is not limited at all.
Furthermore, the server may call the language model tool openfst to replace the WFST network corresponding to each slot with the WFST network corresponding to the similar language model, that is, each sub-network is correspondingly embedded into the basic main network to obtain the basic decoding network. The specific operation is as follows:
fstreplace--call_arc_labeling=both--return_label=id_rlabel--return_arc_labeling=both root.fst upper_bound_id class.fst class_slot_id root+class.fst&
wherein id _ rlabel is a disambiguation symbol id corresponding to an original slot after a place, class _ slot _ id is an id of the original slot, and upper _ bound _ id +1 is an upper bound of all coding ids in the WFST and cannot be reached. After the replacement is completed, considering that the slot has not been sounded, the server needs to replace the symbol corresponding to each slot with the corresponding disambiguation symbol, so as to optimize the disambiguation symbol corresponding to each slot in the subsequent process of optimizing the WFST network. For example, assuming that the SLOT that the server needs to replace is the SONG-SLOT, the id of the SONG-SLOT in WFST is 2000, and the id of the corresponding disambiguation symbol # SONG-SLOT is 3300.
It should be noted that, if an interpolation language model is obtained by performing interpolation processing on the class language model and the general language model in the process of building the basic decoding network, and the corresponding compressed interpolation language model is obtained by cutting the interpolation language model, accordingly, when the basic decoding network is built based on the compressed interpolation language model, the compressed interpolation language model can be converted into a WFST network by using a related language model tool, and the basic decoding network is obtained by replacing the sub-network corresponding to each slot with the WFST network corresponding to the compressed interpolation language model.
For the convenience of understanding the basic decoding network construction method shown in fig. 3, the basic decoding network construction method will be described below by taking an example of training the basic decoding network by using a "play white bird" sample.
Referring to fig. 4, fig. 4 is a schematic diagram corresponding to the basic decoding network construction method provided in the present application. Fig. 4 shows 410 a WFST network (i.e., an underlying host network) corresponding to a class language model, which includes slots corresponding to song names. FIG. 4 is a WFST network (i.e., subnetwork) to which the basic compressed language model is derived by clipping the language model derived from the training of "white birds" in the second set of training samples, 420; fig. 4 shows 430 a basic decoding network obtained by replacing the sub-network shown at 420 with the basic main network shown at 410.
The numbers in the circles in fig. 4 are used to represent the state numbers in the WFST network, and the content on the arrows represents the symbols of the input and output words, which also include the corresponding weights, which represent the log probability value and the back-off probability value of the corresponding language model.
According to the embodiment of the application, the basic decoding network structure is constructed based on the class language model and the basic compression language model corresponding to each slot position, when a new entry needs to update the decoding network structure, the basic decoding network in the decoding network can be directly updated without any improvement, namely, the whole decoding network does not need to be reconstructed, and the iterative updating speed of the decoding network can be greatly improved.
Based on the decoding network structure provided by the embodiment of the application, the application correspondingly provides a speech recognition method applying the decoding network structure in practical application.
Referring to fig. 5a, fig. 5a is a schematic flowchart of a speech recognition method according to an embodiment of the present application. For convenience of description, the following embodiments are described with a server as an execution subject, and it should be understood that the execution subject of the voice recognition method is not limited to the server, and may be other devices with a voice recognition function, such as a terminal device.
In order to facilitate understanding of the speech recognition method provided in the embodiment of the present application, a basic flow of speech recognition is described first.
In practice, speech recognition converts a speech signal into text information corresponding to the speech signal, and a current speech recognition system mainly includes four major parts, namely feature extraction, an acoustic model, a language model, a dictionary and decoding. The feature extraction is mainly used for converting the sound signal from a time domain to a frequency domain and providing a proper feature vector for the acoustic model; further, the acoustic model calculates a score of each feature vector on the acoustic features based on the acoustic characteristics. The language model is used for calculating the probability of the phrase sequence possibly corresponding to the sound signal according to the theory related to linguistics. And finally, decoding the phrase sequence according to the existing dictionary to obtain text information most probably corresponding to the sound signal.
When speech recognition is carried out specifically, the silence at the head and tail ends in the sound signal needs to be cut off through preprocessing operation to reduce the interference of the silence on the subsequent steps, then sound framing is carried out, namely, the sound is divided into a plurality of small sections, each small section of sound signal can be called as a frame, the sound framing is usually realized by using a moving window function, and an overlapping area exists between the frames. Then, each frame of sound waveform is converted into a multi-dimensional vector (hereinafter referred to as a feature vector) containing sound information through feature extraction, and the feature vector is further processed through an acoustic model to obtain corresponding phoneme information. The dictionary stores the corresponding relation between the words or the words and the phonemes, the language model can determine the probability of mutual correlation between the single words or the words, finally, the decoding network is utilized to process the audio data obtained after the characteristic extraction based on the acoustic model, the dictionary and the language model, and the words corresponding to the input sound signals are output.
It should be noted that, the present application is mainly directed to an improvement of a decoding network in a speech recognition system, that is, a new decoding network structure is constructed based on the methods shown in fig. 2 and fig. 3, and the audio data after feature extraction is processed by using the decoding network structure and combining an acoustic model and a dictionary to obtain characters corresponding to the audio data.
Next, a speech recognition method provided by an embodiment of the present application is described with reference to fig. 5a, as shown in fig. 5a, the method includes the following steps:
step 501: and acquiring the voice to be recognized.
In practical application, a user can instruct the intelligent device to execute an operation corresponding to a voice signal by inputting the voice signal into the intelligent device, and after receiving the voice signal input by the user, the intelligent device can transmit the voice signal to a server through a network so as to recognize text information corresponding to the voice signal by using the server.
For example, in practical applications, a user may input a voice signal "play white bird" to the smart speaker to instruct the smart speaker to search and play a song "white bird", and accordingly, after the smart speaker receives the voice signal "play white bird", the voice signal "play white bird" is transmitted to the server through the network, so that the server recognizes text information corresponding to the voice signal "play white bird".
It should be noted that, in the case that the voice recognition system is operated in the terminal device, after the terminal device receives the voice signal input by the user, the terminal device can independently complete the voice recognition process, and the voice signal to be recognized does not need to be transmitted to the server.
Step 502: decoding the voice through a basic decoding network in a decoding network, decoding through a differential language model corresponding to a slot position in the decoding network when the decoding reaches the slot position in the basic decoding network in the decoding process, and storing the current network state of the decoding network, the identification and the current network state of the differential language model corresponding to the slot position and the historical searching state of the decoding network; and when the slot node completes decoding, skipping to the basic decoding network to continue decoding until the last frame of the voice is decoded.
After receiving the voice signal transmitted by the terminal equipment, the server correspondingly utilizes the basic decoding network in the decoding network running in the server to decode the voice.
It should be noted that the decoding network used here is a decoding network generated based on the method shown in fig. 2 and 3, and the decoding network includes a basic decoding network and a differential language model corresponding to each slot position, where the basic decoding network is constructed and generated by a similar language model and a basic compressed language model corresponding to each slot position in the similar language model, and the basic decoding network may further include a general language model; the differential language model is obtained by differentiating the basic compressed language model of the target slot position by utilizing the language model obtained by training based on the first training sample set corresponding to the target slot position.
The basic decoding network may be a WFST network, and the differential language model may be a WFST network or a tree structure.
Based on the above introduced working principle of the speech recognition system, in practical application, after receiving a speech signal transmitted from a terminal device, a server usually performs preprocessing operations such as mute cutting of the head and the tail of the speech signal and framing of sound, and then converts the speech signal into a feature vector through a feature extraction operation, and further determines text information corresponding to the speech signal based on an acoustic model and a dictionary by using a decoding network.
In the decoding process, when the decoding reaches the slot position in the decoding network, namely when the current recognized speech frame is determined to belong to the category corresponding to a certain slot position node, the server correspondingly utilizes the differential language model corresponding to the slot position to decode, and in the process, the server correspondingly stores the quadruple information consisting of the current network state of the decoding network, the identification of the differential language model corresponding to the slot position, the current network state and the historical search state. After the current voice frame is decoded by the slot position, the server can correspondingly jump back to a basic decoding network in the decoding network according to the quadruple information to continuously decode other voice frames behind the voice frame until the last voice frame is decoded.
It should be noted that, in practical applications, if the differential language model corresponding to the slot position is currently needed to be used for decoding, a historical search state backup is needed to backup a state already searched in the basic decoding network, so that after the decoding is completed by using the differential language model corresponding to the slot position, the state can be returned to the basic decoding network to continue the search.
It should be noted that, in the decoding process, switching between the basic decoding network and the differential language model is usually encountered, and when a sub-network corresponding to a SLOT (which may also be understood as a differential language model corresponding to the SLOT) is entered, if a category symbol corresponding to the SLOT is encountered in the decoding process, such as a SONG-SLOT, it indicates that a related search needs to be performed on the differential language model corresponding to the SLOT at this time, and accordingly, the current network state of the decoding network in the quadruplet needs to be updated to id corresponding to the differential language model, and the current network state is set to the initialization state corresponding to the differential language model, and the state corresponding to the decoded basic decoding network is added to the historical search state. When the differential language model corresponding to the SLOT is left, if the decoding meets # SONG-SLOT, the differential language model corresponding to the SLOT is decoded, at this time, a score corresponding to a sentence ending symbol needs to be added, meanwhile, the current network state of the decoding network is switched back to the id corresponding to the basic decoding network, and meanwhile, based on the basic decoding network state backed up in the historical search state, the decoding network is switched back to the basic decoding network for continuous decoding.
To facilitate understanding of the decoding process described in step 502, the following description will be made by taking decoding of the speech "play white bird" as an example.
The current network state of the initial decoding network is 0, the identifier of the differential language model corresponding to the slot position is 0, the current network state is the state corresponding to the sentence starting symbol < s >, and the historical search state is null, which means that backup is not needed currently, i.e. the quadruplet at this time is {0, Im _ id ═ 0, < s > corresponding state, and the historical search state is null }. When a voice frame corresponding to 'play' is decoded, because a valid voice signal is identified, the current network state of the decoding network is updated to be 1, because the differential language model corresponding to the slot position is not yet entered for decoding, the identifier of the differential language model corresponding to the slot position is still 0, the current network state is a state corresponding to's > play', and the historical search state is still null, that is, the quadruple at this time is {1, Im _ id is 0, '< s > play', and the historical search state is null }. And continuing decoding, when the output of the SONG-SLOT is received, representing that a differential language model corresponding to the SLOT SONG-SLOT is to be entered at the moment, updating the quadruple to be {3, wherein Im _ id ═ 1 (representing an identifier corresponding to the SLOT SONG-SLOT), the state corresponding to the < s > "in the differential language model, and the historical search state is the play SONG-SLOT' in the class language model. And continuing decoding, when a speech frame corresponding to the 'white bird' is decoded, updating the quadruplet to be {4, Im _ id ═ 1 (representing the identification corresponding to the SLOT SONG-SLOT), wherein the state corresponding to the's > white bird' in the differential language model, and the historical search state is's > play SONG-SLOT' in the class language model. And continuing decoding, when the # SONG-SLOT is met, indicating that a differential language model corresponding to the SONG-SLOT SLOT is about to jump out, updating the quadruple to be in a state corresponding to {2, Im _ id ═ 0, "< s > SONG-SLOT (representing a sentence ending symbol)" played in the class language model, and finishing decoding when the historical search state is empty }.
It should be noted that, in practical applications, many application scenarios involve performing voice recognition on a voice command input by a user, so as to trigger an operation related to the voice command according to a result obtained by the voice recognition. The embodiment of the present application provides three exemplary speech recognition application scenarios, in which the speech recognition method described above can be used to recognize the speech instruction input by the user.
In a first possible implementation manner, the speech recognition method provided by the embodiment of the present application may be applied to an application scenario in which a control terminal (e.g., a smart speaker) plays music. In order to be suitable for such an application scenario, when a decoding network is generated, a basic decoding network needs to be constructed based on a class language model including an author slot and an audio name slot, and it should be understood that a differential language model corresponding to the author slot is mainly used for identifying names of audio authors, such as singer names, word author names, song author names, and the like, and a differential language model corresponding to the audio name slot is mainly used for identifying audio names, such as song names, and the like.
Through the decoding network constructed based on the similar language model including the creator slot and the audio name slot, the server adopts the voice recognition method shown in fig. 5a, recognizes the voice to be recognized sent by the intelligent device to obtain the corresponding recognition result, correspondingly generates the control instruction matched with the recognition result according to the recognition result, and then sends the control instruction to the intelligent device to instruct the intelligent device to play the target audio.
For example, suppose that a user inputs a voice command "play white bird" through the smart speaker, the smart speaker will correspondingly transmit the voice command "play white bird" to the server, and the server performs voice recognition on the voice command "play white bird" based on a pre-generated decoding network to obtain a corresponding character recognition result. The server can determine that the target audio is 'white bird' according to the character recognition result, after the target audio is searched in the database for storing audio data, the target audio is added into a control instruction for controlling the intelligent sound box to play the target audio, the control instruction is sent to the intelligent sound box, and after the intelligent sound box receives the control instruction, the target audio 'white bird' carried in the control instruction is correspondingly played.
It should be understood that, in practical applications, in addition to the realization of playing the target audio according to the voice instruction through the form of interaction between the terminal and the server, the terminal may also independently complete the operation of playing the target audio according to the voice instruction, that is, under the condition that a voice recognition system (including a decoding network) is operated in the terminal, the terminal may independently perform voice recognition on the voice instruction input by the user, and further generate a control instruction according to the voice recognition result, and control the playing of the target audio in the voice instruction.
In a second possible implementation manner, the voice recognition method provided by the embodiment of the present application may be applied to an application scenario in which a control terminal (e.g., a smart television) plays a video. In order to be suitable for such application scenarios, when a decoding network is generated, a basic decoding network needs to be constructed based on a class language model including a video name slot, and it should be understood that a differential language model corresponding to the video name slot is mainly used for identifying video names, such as a tv series name, a movie name, and the like.
Through the decoding network constructed based on the similar language model including the video name slot, the server adopts the voice recognition method shown in fig. 5a, after the voice to be recognized sent by the terminal is recognized to obtain the corresponding recognition result, the control instruction matched with the voice to be recognized is correspondingly generated according to the recognition result, and then the control instruction is sent to the terminal so as to instruct the intelligent device to play the target video.
For example, suppose that the user inputs a voice command "play a game of right" through the smart tv, the smart tv will correspondingly transmit the voice command to the server, and the server performs voice recognition on the voice command based on a pre-generated decoding network to obtain a corresponding text recognition result. The server can determine that the target video is a game of the right according to the character recognition result, after the target video is searched in the database for storing the video data, the target video is added to a control instruction for controlling the intelligent television to play the target video, the control instruction is sent to the intelligent television, and after the intelligent television receives the control instruction, the intelligent television correspondingly plays the game of the right of the target video carried in the control instruction.
It should be understood that, in practical applications, in addition to the realization of playing the target video according to the voice instruction through the form of interaction between the terminal and the server, the terminal may also independently complete the operation of playing the target video according to the voice instruction, that is, under the condition that a voice recognition system (including a decoding network) is operated in the terminal, the terminal may independently perform voice recognition on the voice instruction input by the user, and further generate a control instruction according to the voice recognition result, and control the playing of the target video in the voice instruction.
In a third possible implementation manner, the voice recognition method provided by the embodiment of the present application may be applied to an application scenario in which a control terminal (e.g., an intelligent navigation device) performs route navigation. In order to be suitable for such application scenarios, when a decoding network is generated, a basic decoding network needs to be constructed based on a class language model including a geographical location name slot, and it should be understood that a differential language model corresponding to the geographical location name slot is mainly used for identifying a geographical location name, such as a specific mall name, a park name, and the like.
Through the decoding network constructed based on the similar language model including the geographical location name slot, the server adopts the voice recognition method shown in fig. 5a, after recognizing the voice to be recognized sent by the intelligent device to obtain the corresponding recognition result, correspondingly generates a control instruction matched with the recognition result according to the recognition result, and then sends the control instruction to the intelligent device so as to instruct the intelligent device to navigate according to the target geographical location.
For example, suppose that the user inputs a voice command "navigate to the city of sunward joy" through the intelligent navigation device, the intelligent navigation device will correspondingly transmit the voice command to the server, and the server performs voice recognition on the voice command based on a pre-generated decoding network to obtain a corresponding text recognition result. The server can determine that the target geographic position is 'facing the sun in the happy city' according to the character recognition result, and adds the target geographic position into the control instruction to be sent to the intelligent navigation equipment, and the intelligent navigation equipment correspondingly determines a navigation route from the current position to the target geographic position 'facing the sun in the happy city' according to the control instruction.
It should be understood that, in practical applications, in addition to the implementation of route navigation according to the voice instruction through the form of interaction between the terminal and the server, the terminal may also independently complete the operation of route navigation according to the voice instruction, that is, in the case that a voice recognition system (including a decoding network) is operated in the terminal, the terminal may independently perform voice recognition on the voice instruction input by the user, and further determine the navigation route from the current position to the target geographical position according to the target geographical position corresponding to the voice recognition result.
It should be noted that the speech recognition method provided in the embodiment of the present application is not limited to be applied to the above three application scenarios, and in practical application, the speech recognition method may be applied to various application scenarios requiring speech recognition, and the speech recognition method provided in the embodiment of the present application is not limited in any way.
Aiming at the decoding network construction method shown in the figures 2 and 3, the application adaptively provides a speech recognition method realized based on the decoding network structure, and the speech recognition method realizes the decoding of the differential language model corresponding to the entering slot position in the decoding process and the return of the differential language model corresponding to the leaving slot position to the basic decoding network decoding by recording the quadruple comprising the current network state of the decoding network, the identification of the slot position language model corresponding to the slot position, the current network state and the historical search state, thereby ensuring that the speech recognition is successfully completed based on the decoding network structure formed by the basic decoding network and the differential language model corresponding to the slot position.
It should be noted that the speech recognition method shown in fig. 5a is generally applied to the speech recognition interactive system provided in the embodiment of the present application, and referring to fig. 5b, fig. 5b is a schematic structural diagram of the speech interactive system provided in the embodiment of the present application. As shown in fig. 5b, the voice interaction system includes: the voice collecting device 510, the voice recognition device 520 and the control device 530 may interact with each other through wireless signals or wired signals.
The voice collecting device 510 is used for collecting voice input by a user through a microphone; and the voice collecting device 510 may also transmit the voice signal collected by itself to the voice recognition device 520.
The speech recognition device 520 is configured to execute the speech recognition method shown in fig. 5a, specifically, the speech recognition device 520 is configured to decode the speech acquired by the speech acquisition device 510 through a basic decoding network in the decoding network, and in the decoding process, when the decoding reaches a slot in the basic decoding network, decode the speech through a differential language model corresponding to the slot in the decoding network, and maintain a current network state of the decoding network, an identifier and a current network state of the differential language model corresponding to the slot, and a historical search state of the decoding network; and when the slot decoding is finished, jumping back to the basic decoding network to continue decoding until the last frame of the voice is decoded, and obtaining a voice recognition result. The specific decoding process of the speech recognition device 520 refers to the description corresponding to the speech recognition method shown in fig. 5a, and is not described herein again.
After the speech recognition device 520 decodes the speech recognition result corresponding to the speech, the speech recognition result is transmitted to the control device 530. Accordingly, the control device 530 will perform an operation corresponding to the voice recognition result based on the voice recognition result recognized by the voice recognition device 520. For example, when the voice recognition result is a control instruction for controlling the playing of a song, the control device 530 may search and play the song mentioned in the control instruction according to the information of the name of the singer, the name of the song, and the like in the control instruction; for another example, when the voice recognition result is a control instruction to control route navigation, the control device 530 may search for a navigation route from the current location to the destination according to the destination name in the control instruction, and perform route navigation in real time according to a change in the current location. It should be understood that when the speech recognition result is a control instruction of another type, the control device 530 may also perform other operations according to the control instruction accordingly, and the type of operations that the control device 530 can perform is not limited in any way.
In one possible implementation manner, the voice collecting device 510 and the control device 530 may be integrated on the same terminal device, such as a smart speaker, a smart robot, a smart car device, and the like, and the voice recognition device 520 may be integrated on a server capable of providing a voice recognition service. For example, the voice capturing device 510 may be a microphone, a microphone array or other components with sound receiving function, and the control device 530 may be a processor or a controller. After the voice input by the user is collected by the voice collection device 510 operating on the terminal device, the voice is sent to the voice recognition device 520 operating on the server, and after the voice recognition device 520 decodes the voice to obtain a voice recognition result, the voice recognition result is transmitted to the control device 530 operating on the terminal device, so that the control device 530 executes a corresponding operation.
In another possible implementation manner, the voice collecting device 510, the voice recognition device 520, and the control device 530 may all be integrated on the same terminal device, such as a smart speaker, a smart robot, a smart car device, and the like. Namely, the terminal equipment independently finishes the collection and recognition of voice and executes related operations according to the voice recognition result.
In yet another possible implementation manner, the voice collecting device 510 may be integrated on a terminal device, such as a smart speaker, a smart robot, a smart car device, or the like; the speech recognition device 520 and the control device 530 may each be integrated on a server providing speech recognition services as well as control instruction related services. After the voice input by the user is collected by the voice collection device 510 running on the terminal device, the voice is sent to the voice recognition device 520 running on the server, the voice recognition device 520 decodes the voice to obtain a voice recognition result, and then the voice recognition result is transmitted to the control device 530, the control device 530 can execute corresponding operations according to the voice recognition result, for example, searching for a song in the voice recognition result, searching for geographical location information of a destination in the voice recognition result, and the like, and further, the control device 530 can transmit the search result of itself to the terminal device, so that the search result is displayed to the user through the terminal device.
It should be understood that, in practical applications, the voice capturing device 510, the voice recognition device 520 and the control device 530 may be integrated on other devices in other manners, and no limitation is made to carriers of the voice capturing device 510, the voice recognition device 520 and the control device 530.
In order to further understand the decoding network generating method and the speech recognition method provided in the embodiments of the present application, the method provided in the embodiments of the present application is generally introduced below by taking an application scenario in which the method provided in the embodiments of the present application is applied to control music playing based on a speech instruction as an example.
In practical applications, the server needs to build the basic decoding network first. Specifically, the server may obtain a large amount of instruction text for controlling music playing, for example, i want to listen to a kiss, play a white bird, and so on. The method comprises the steps of carrying out block processing on instruction texts for controlling music playing at a phrase level to obtain text samples of phrase layers corresponding to the instruction texts, and then correspondingly replacing blocks corresponding to variables in the text samples of the phrase layers with SLOT names corresponding to the categories of the text samples, for example, replacing the lip of the text sample of the phrase layer 'I want to hear the lip' with the SLOT name SONG-SLOT corresponding to the SONG name, so as to obtain text samples of the phrase layers with the SLOT names, forming a third training sample set by using the text samples with the SLOT names, and further training by using the third training sample set to obtain a similar language model, wherein the similar language model can be an n-gram model.
Then, the language model and the general language model are subjected to interpolation processing to obtain an interpolation language model, and a related language model tool is called to cut the interpolation language model to obtain a basic compression model corresponding to the interpolation language model.
And then, according to the second training sample set corresponding to each slot position in the class language model, training to generate a language model corresponding to each slot position. For example, when the language model corresponding to the song name slot is trained, the language model corresponding to the song name can be obtained by training by using a second training sample set comprising a large number of song names; for another example, when the language model corresponding to the singer name slot is trained, the language model corresponding to the singer name can be trained by using the second training sample set including a large number of singer names. And after the language models corresponding to the slot positions are obtained through training, calling a relevant language model tool to cut the slot position language models corresponding to the slot positions, so as to obtain the basic compression language models corresponding to the slot positions. It should be understood that in practical applications, the slotted language model may include one slot or a plurality of slots. And then, calling a related language model tool to convert the compressed interpolation language model obtained through the operation into a WFST network, namely a basic main network, converting the basic compressed language model corresponding to each slot position obtained through the operation into a corresponding WFST network, namely a sub-network, and further embedding the sub-network corresponding to each slot position into the basic main network to obtain a basic decoding network.
And differentiating the language model corresponding to each slot position and the basic compressed language model corresponding to each slot position to obtain a differential language model corresponding to each slot position, converting the differential language model corresponding to each slot position into a WFST network or a tree structure, and correspondingly combining the WFST network or the tree structure with the basic decoding network to obtain an initial decoding network.
When a newly added vocabulary entry exists and the decoding network needs to be updated, the server can determine a slot position corresponding to the newly added vocabulary entry as a target slot position, further obtain a second training sample set corresponding to the target slot position, form a first training sample set by using the newly added vocabulary entry and each training sample in the second training sample set, obtain a language model corresponding to the target slot position by using the first training sample set for training, obtain a new differential language model corresponding to the target slot position by using the language model to perform difference on a basic compressed language model corresponding to the target slot position in a basic decoding network, and replace the differential language model corresponding to the target slot position in the original decoding network by using the differential language model, so that the original decoding network can be updated to obtain the target decoding network.
In practical application, a server can obtain voice to be recognized transmitted by intelligent equipment, the voice is decoded through a basic decoding network in a decoding network, in the decoding process, when a decoding token encounters a slot position in the decoding network, decoding is performed through a differential language model corresponding to the slot position, and quadruple information consisting of the current network state of the decoding network, the identification of the differential language model corresponding to the slot position, the current network state and the historical search state is stored. After the decoding is finished through the slot position node, the basic decoding network in the decoding network is jumped back to continue the decoding until the last frame of the voice is decoded.
After the text recognition result corresponding to the voice to be recognized is determined through the voice recognition process, the server can further determine a target song to be recognized and indicated to be played by the voice based on the text recognition result, further obtain the target song from a database for storing songs, add the target song to a control instruction, and return the target song to the intelligent device, so that the intelligent device plays the target song.
For the decoding network generation method and the voice recognition method described above, the present application also provides a corresponding decoding network generation apparatus and a corresponding voice recognition apparatus, so that the decoding network generation method and the voice recognition method described above are applied and implemented in practice.
Referring to fig. 6, fig. 6 is a schematic structural diagram of a decoding network generating apparatus 600 corresponding to the decoding network generating method shown in fig. 2, where the decoding network generating apparatus 600 includes:
a differential language model generation module 601, configured to obtain a language model corresponding to a target slot position by training according to a first training sample set, and perform a difference between the language model and a basic compressed language model corresponding to the target slot position to obtain a differential language model corresponding to the target slot position;
a decoding network construction module 602, configured to construct a target decoding network according to the slot position difference language model and a basic decoding network; the basic decoding network is generated by constructing a similar language model and a slot position basic compression language model corresponding to a slot position in the similar language model; the slot position basic compression language model is generated by cutting a language model corresponding to a slot position obtained by training through a second training sample set, the second training sample set is a subset of the first training sample set, and the target slot position is a slot position in the class language model.
Optionally, on the basis of the decoding network generating device shown in fig. 6, referring to fig. 7, fig. 7 is a schematic structural diagram of another decoding network generating device provided in the embodiment of the present application. As shown in fig. 7, the decoding network generating apparatus 700 further includes:
a similar language model generating module 701, configured to determine a third training sample set, where the third training sample set includes text samples with slot names, and the similar language model is obtained through training according to the third training sample set;
a basic compressed language model generation module 702, configured to train to obtain a language model corresponding to the slot position according to the second training sample set corresponding to the slot position in the class language model, and cut the language model corresponding to the slot position to obtain a basic compressed language model corresponding to the slot position;
a basic decoding network generating module 703, configured to generate a basic decoding network according to the class language model and the basic compressed language model corresponding to the slot.
Optionally, on the basis of the decoding network generating apparatus shown in fig. 7, the basic decoding network generating module 703 is specifically configured to:
carrying out interpolation processing on the similar language model and the general language model to obtain an interpolation language model, and cutting the interpolation language model to obtain a compressed interpolation language model;
and generating a basic decoding network according to the compressed interpolation language model and the basic compressed language model corresponding to the slot position.
Optionally, on the basis of the decoding network generating apparatus shown in fig. 7, the class language model generating module 701 is specifically configured to:
performing phrase level block processing on the text sample to obtain a text sample of a phrase layer;
replacing a text block belonging to the slot position category in the text sample of the phrase layer with a slot position name, and generating the third training sample set according to the text sample with the slot position name;
and training according to the text sample with the slot position name in the third training sample set to obtain a similar language model, wherein the similar language model is an n-gram language model.
Optionally, on the basis of the decoding network generating apparatus shown in fig. 7, the basic decoding network generating module 703 is specifically configured to:
converting the class language model into a weighted finite state machine (WFST) network as a basic main network;
converting the basic compressed language model corresponding to the slot position in the similar language model into a weighted finite state machine (WFST) network as a sub-network corresponding to the slot position;
and embedding the sub-networks corresponding to the slots in the similar language model into the basic main network to obtain the basic decoding network.
Optionally, on the basis of the decoding network generating apparatus shown in fig. 6, the decoding network constructing module 602 is specifically configured to:
converting the differential language model into a tree structure network or a weighted finite state machine (WFST) network as a first network;
and fusing the first network and the basic decoding network to generate a target decoding network.
Optionally, on the basis of the decoding network generating apparatus shown in fig. 6, the class language model includes a plurality of slots.
Optionally, on the basis of the decoding network generating apparatus shown in fig. 6, the class language model includes an originator slot and an audio name slot.
The decoding network generating device develops a new decoding network architecture, the decoding network is composed of a basic decoding network and a differential language model corresponding to the slot position, wherein the basic decoding network is constructed and generated through a similar language model and a basic compression language model corresponding to the slot position in the similar language model. When a newly added entry needs to update the decoding network, only the differential language model corresponding to the target slot position needs to be updated. That is to say, the device can realize the updating of the decoding network by performing differential processing on the language model corresponding to the target slot position and the basic compression language model corresponding to the target slot position, does not need to reconstruct the decoding network, greatly improves the iterative updating speed of the decoding network, and ensures that the decoding network can adapt to the continuously changing application environment in time.
Referring to fig. 8, fig. 8 is a schematic structural diagram of a speech recognition apparatus 800 corresponding to the speech recognition method shown in fig. 5, where the speech recognition apparatus 800 includes:
an obtaining module 801, configured to obtain a voice to be recognized;
the recognition module 802 is configured to decode the voice through a basic decoding network in a decoding network, decode the voice through a differential language model corresponding to a slot in the decoding network when the decoding reaches the slot in the basic decoding network in the decoding process, and store a current network state of the decoding network, an identifier and a current network state of the differential language model corresponding to the slot, and a historical search state of the decoding network; and when the slot decoding is completed, skipping to the basic decoding network to continue decoding until the last frame of the voice is decoded.
Optionally, on the basis of the speech recognition apparatus shown in fig. 8, the class language model includes an author slot and an audio name slot, see fig. 9, and fig. 9 is another speech recognition apparatus provided in this embodiment of the present application. As shown in fig. 9, the speech recognition apparatus 900 further includes:
a first instruction generating module 901, configured to generate a control instruction matched with the recognition result according to the recognition result of the voice;
a first indication module 902, configured to send the control instruction to the smart device, and instruct the smart device to play the target audio.
Optionally, on the basis of the speech recognition apparatus shown in fig. 8, the language-like model includes a video name slot, see fig. 10, and fig. 10 is another speech recognition apparatus provided in this embodiment of the present application. As shown in fig. 10, the speech recognition apparatus 1000 further includes:
a second instruction generating module 1001, configured to generate, according to the recognition result of the voice, a control instruction matching the recognition result;
the second indicating module 1002 is configured to send the control instruction to the intelligent device, and instruct the intelligent device to play the target video.
Optionally, on the basis of the speech recognition apparatus shown in fig. 8, the class language model includes a geographical location name slot, see fig. 11, and fig. 11 is another speech recognition apparatus provided in this embodiment of the present application. As shown in fig. 11, the speech recognition apparatus 1100 further includes:
a third instruction generating module 1101, configured to generate, according to the recognition result of the voice, a control instruction matched with the recognition result;
a third indication module 1102, configured to send the control instruction to an intelligent device, and instruct the intelligent device to perform navigation according to a target geographic location.
The voice recognition device realizes the decoding of the differential language model corresponding to the entering slot position in the decoding process and the return of the differential language model corresponding to the leaving slot position to the basic decoding network decoding by recording the quadruple comprising the current network state of the decoding network, the identification of the slot position language model corresponding to the slot position, the current network state and the historical searching state, and ensures that the voice recognition is successfully completed based on the decoding network structure consisting of the basic decoding network and the differential language model corresponding to the slot position.
An apparatus is further provided in the embodiment of the present application, as shown in fig. 12, for convenience of description, only a portion related to the embodiment of the present application is shown, and details of the specific technology are not disclosed, please refer to the method portion of the embodiment of the present application. The terminal may be any terminal device including a mobile phone, a tablet computer, a personal digital assistant, a Point of Sales (POS), a vehicle-mounted computer, and the like, taking the terminal as the mobile phone as an example:
fig. 12 is a block diagram illustrating a partial structure of a mobile phone related to a terminal provided in an embodiment of the present application. Referring to fig. 12, the cellular phone includes: radio Frequency (RF) circuit 1210, memory 1220, input unit 1230, display unit 1240, sensor 1250, audio circuit 1260, wireless fidelity (WiFi) module 1270, processor 1280, and power supply 1290. Those skilled in the art will appreciate that the handset configuration shown in fig. 12 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
The memory 1220 may be used to store software programs and modules, and the processor 1280 executes various functional applications and data processing of the mobile phone by operating the software programs and modules stored in the memory 1220. The memory 1220 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 1220 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.
The processor 1280 is a control center of the mobile phone, connects various parts of the entire mobile phone by using various interfaces and lines, and performs various functions of the mobile phone and processes data by operating or executing software programs and/or modules stored in the memory 1220 and calling data stored in the memory 1220, thereby performing overall monitoring of the mobile phone. Optionally, processor 1280 may include one or more processing units; preferably, the processor 1280 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It is to be appreciated that the modem processor described above may not be integrated into the processor 1280.
In this embodiment, the processor 1280 included in the terminal further has the following functions:
training according to a first training sample set to obtain a language model corresponding to a target slot position, and performing difference on the language model and a basic compression language model corresponding to the target slot position to obtain a difference language model corresponding to the target slot position;
constructing a target decoding network according to the differential language model and the basic decoding network; the basic decoding network is generated by constructing a similar language model and a basic compressed language model corresponding to a slot position in the similar language model; the basic compressed language model is generated by cutting the language model corresponding to the slot position obtained by training through a second training sample set, wherein the second training sample set is a subset of the first training sample set.
Optionally, the processor 1280 is further configured to execute the steps of any implementation manner of the decoding network generation method provided in the embodiment of the present application.
In this embodiment, the processor 1280 included in the terminal further has the following functions:
acquiring a voice to be recognized;
decoding the voice through a basic decoding network in a decoding network, decoding through a differential language model corresponding to a slot position in the decoding network when the decoding reaches the slot position in the basic decoding network in the decoding process, and storing the current network state of the decoding network, the identification and the current network state of the differential language model corresponding to the slot position and the historical searching state of the decoding network; and when the slot decoding is completed, skipping to the basic decoding network to continue decoding until the last frame of the voice is decoded.
Optionally, the processor 1280 is further configured to execute the steps of any implementation manner of the speech recognition method provided in the embodiment of the present application.
Another device is provided in this embodiment of the present application, which may be a server, fig. 13 is a schematic structural diagram of a server provided in this embodiment of the present application, and the server 1300 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 1322 (e.g., one or more processors) and a memory 1332, and one or more storage media 1330 (e.g., one or more mass storage devices) storing an application 1342 or data 1344. Memory 1332 and storage medium 1330 may be, among other things, transitory or persistent storage. The program stored on the storage medium 1330 may include one or more modules (not shown), each of which may include a sequence of instructions operating on a server. Still further, the central processor 1322 may be arranged in communication with the storage medium 1330, executing a sequence of instruction operations in the storage medium 1330 on the server 1300.
The server 1300 may also include one or more power supplies 1326, one or more wired or wireless network interfaces 1350, one or more input-output interfaces 1358, and/or one or more operating systems 1341, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
The steps performed by the server in the above embodiment may be based on the server structure shown in fig. 13.
CPU 1322 is configured to perform the following steps:
training according to a first training sample set to obtain a language model corresponding to a target slot position, and performing difference on the language model and a basic compression language model corresponding to the target slot position to obtain a difference language model corresponding to the target slot position;
constructing a target decoding network according to the differential language model and the basic decoding network; the basic decoding network is generated by constructing a similar language model and a basic compressed language model corresponding to a slot position in the similar language model; the basic compressed language model is generated by cutting the language model corresponding to the slot position obtained by training through a second training sample set, wherein the second training sample set is a subset of the first training sample set.
Optionally, CPU 1322 is further configured to execute the steps of any implementation manner of the decoding network generation method provided in the embodiment of the present application.
CPU 1322 may also be configured to perform the following steps:
acquiring a voice to be recognized;
decoding the voice through a basic decoding network in a decoding network, decoding through a differential language model corresponding to a slot position in the decoding network when the decoding reaches the slot position in the basic decoding network in the decoding process, and storing the current network state of the decoding network, the identification and the current network state of the differential language model corresponding to the slot position and the historical searching state of the decoding network; and when the slot decoding is completed, skipping to the basic decoding network to continue decoding until the last frame of the voice is decoded.
Optionally, CPU 1322 may also be configured to perform the steps of any implementation manner of the speech recognition method provided in the embodiment of the present application.
The embodiments of the present application further provide a computer-readable storage medium for storing a computer program, where the computer program is configured to execute any one implementation of the decoding network generation method or the speech recognition method described in the foregoing embodiments.
The present application further provides a computer program product including instructions, which when run on a computer, causes the computer to execute any one of the embodiments of the decoding network generating method or the speech recognition method described in the foregoing embodiments.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (12)

1. A decoding network generating method, comprising:
training a generated class language model according to a third training sample set, wherein the third training sample set comprises text samples with slot names;
according to a second training sample set corresponding to the slot position in the similar language model, training to generate a second language model corresponding to the slot position and cutting the second language model corresponding to the slot position to obtain a basic compression language model corresponding to the slot position;
generating a basic decoding network according to the similar language model and a basic compression language model corresponding to the slot position;
performing difference on a second language model corresponding to the slot position and the basic compression language model to obtain a second difference language model corresponding to the slot position;
and constructing a target decoding network according to the second differential language model corresponding to the slot position in the class language model and the basic decoding network.
2. The method of claim 1, further comprising:
acquiring a newly added entry corresponding to a target slot position in the class language model and combining the newly added entry with the second training sample set to generate a first training sample set; the target slot position is any slot position in the class language model;
training according to the first training sample set to generate a first language model corresponding to the target slot position, and performing difference on the first language model corresponding to the target slot position and a basic compression language model corresponding to the target slot position to obtain a first difference language model corresponding to the target slot position;
and replacing a second differential language model corresponding to the target slot position in the target decoding network by using the first differential language model corresponding to the target slot position so as to update the decoding network.
3. The method of claim 1, wherein generating a base decoding network according to the class language model and a base compression language model corresponding to the slot comprises:
carrying out interpolation processing on the similar language model and the general language model to obtain an interpolation language model, and cutting the interpolation language model to obtain a compressed interpolation language model;
and generating a basic decoding network according to the compressed interpolation language model and the basic compressed language model corresponding to the slot position.
4. The method of claim 1, wherein training the class language model according to the third training sample set comprises:
performing phrase level block processing on the text sample to obtain a text sample of a phrase layer;
replacing a text block belonging to the slot position category in the text sample of the phrase layer with a slot position name, and generating the third training sample set according to the text sample with the slot position name;
and training according to the text sample with the slot position name in the third training sample set to obtain a similar language model, wherein the similar language model is an n-gram language model.
5. The method of claim 1, wherein generating a base decoding network according to the class language model and a base compression language model corresponding to the slot comprises:
converting the class language model into a weighted finite state machine (WFST) network as a basic main network;
converting the basic compressed language model corresponding to the slot position in the similar language model into a weighted finite state machine (WFST) network as a sub-network corresponding to the slot position;
and embedding the sub-networks corresponding to the slots in the similar language model into the basic main network to obtain the basic decoding network.
6. The method of any one of claims 1 to 5, wherein constructing the target decoding network according to the base decoding network and the second differential language model corresponding to the slot in the class language model comprises:
converting a second differential language model corresponding to the slot position in the class language model into a tree structure network or a weighted finite state machine (WFST) network as a first network;
and fusing the first network and the basic decoding network to generate a target decoding network.
7. An intelligent voice interaction system, comprising:
the voice acquisition equipment is used for acquiring voice input by a user through a microphone;
the voice recognition device is used for decoding the voice through a basic decoding network in a target decoding network, decoding the voice through a second differential language model corresponding to a slot position in the target decoding network when the decoding reaches the slot position in the basic decoding network in the decoding process, and storing the current network state of the target decoding network, the identification and the current network state of the differential language model corresponding to the slot position and the historical search state of the decoding network; when the slot decoding is finished, skipping to the basic decoding network to continue decoding until the last frame of the voice is decoded, and obtaining a voice recognition result; the target decoding network is generated according to any one of claims 1-6;
and the control equipment is used for executing the operation corresponding to the voice recognition result according to the voice recognition result.
8. The system according to claim 7, wherein the control device is specifically configured to, when the voice recognition result is a control instruction for controlling playing of a target resource, search and play the target resource according to resource attribute information of the target resource carried in the control instruction.
9. The system according to claim 7, wherein the control device is specifically configured to, when the voice recognition result is a control instruction for controlling route navigation, search for a navigation route according to a destination name carried in the control instruction and perform route navigation based on the navigation route.
10. The system according to any one of claims 7 to 9, wherein the voice collecting device and the control device are integrated in the same terminal device; the voice recognition device is integrated on a server; the terminal device communicates with the server via a network.
11. The system of claim 10, wherein the terminal device is a smart speaker, a smart robot, or a smart car device.
12. The system according to any one of claims 7 to 9, wherein the voice collecting device is integrated in a terminal device, the voice recognition device and the control device are integrated in the same server, and the terminal device and the server communicate with each other through a network.
CN201910745811.7A 2019-05-21 2019-05-21 Decoding network generation method, voice recognition method, device, equipment and medium Active CN110428819B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910745811.7A CN110428819B (en) 2019-05-21 2019-05-21 Decoding network generation method, voice recognition method, device, equipment and medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910745811.7A CN110428819B (en) 2019-05-21 2019-05-21 Decoding network generation method, voice recognition method, device, equipment and medium
CN201910424817.4A CN110148403B (en) 2019-05-21 2019-05-21 Decoding network generation method, voice recognition method, device, equipment and medium

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201910424817.4A Division CN110148403B (en) 2019-05-21 2019-05-21 Decoding network generation method, voice recognition method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN110428819A CN110428819A (en) 2019-11-08
CN110428819B true CN110428819B (en) 2020-11-24

Family

ID=67592352

Family Applications (2)

Application Number Title Priority Date Filing Date
CN201910745811.7A Active CN110428819B (en) 2019-05-21 2019-05-21 Decoding network generation method, voice recognition method, device, equipment and medium
CN201910424817.4A Active CN110148403B (en) 2019-05-21 2019-05-21 Decoding network generation method, voice recognition method, device, equipment and medium

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN201910424817.4A Active CN110148403B (en) 2019-05-21 2019-05-21 Decoding network generation method, voice recognition method, device, equipment and medium

Country Status (1)

Country Link
CN (2) CN110428819B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110610700B (en) * 2019-10-16 2022-01-14 科大讯飞股份有限公司 Decoding network construction method, voice recognition method, device, equipment and storage medium
CN111063347B (en) * 2019-12-12 2022-06-07 安徽听见科技有限公司 Real-time voice recognition method, server and client
CN111261144B (en) * 2019-12-31 2023-03-03 华为技术有限公司 Voice recognition method, device, terminal and storage medium
CN111415655B (en) * 2020-02-12 2024-04-12 北京声智科技有限公司 Language model construction method, device and storage medium
CN114078470A (en) * 2020-08-17 2022-02-22 阿里巴巴集团控股有限公司 Model processing method and device, and voice recognition method and device
CN113468303B (en) * 2021-06-25 2022-05-17 贝壳找房(北京)科技有限公司 Dialogue interaction processing method and computer-readable storage medium
CN114242046B (en) * 2021-12-01 2022-08-16 广州小鹏汽车科技有限公司 Voice interaction method and device, server and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1223739A (en) * 1996-06-28 1999-07-21 微软公司 Method and system for dynamically adjusted training for speech recognition
CN1588536A (en) * 2004-09-29 2005-03-02 上海交通大学 State structure regulating method in sound identification
CN107766559A (en) * 2017-11-06 2018-03-06 第四范式(北京)技术有限公司 Training method, trainer, dialogue method and the conversational system of dialog model
CN108922543A (en) * 2018-06-11 2018-11-30 平安科技(深圳)有限公司 Model library method for building up, audio recognition method, device, equipment and medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103971685B (en) * 2013-01-30 2015-06-10 腾讯科技(深圳)有限公司 Method and system for recognizing voice commands
US9484023B2 (en) * 2013-02-22 2016-11-01 International Business Machines Corporation Conversion of non-back-off language models for efficient speech decoding
CN105185372B (en) * 2015-10-20 2017-03-22 百度在线网络技术(北京)有限公司 Training method for multiple personalized acoustic models, and voice synthesis method and voice synthesis device
CN105529027B (en) * 2015-12-14 2019-05-31 百度在线网络技术(北京)有限公司 Audio recognition method and device
CN105869624B (en) * 2016-03-29 2019-05-10 腾讯科技(深圳)有限公司 The construction method and device of tone decoding network in spoken digit recognition
CN110364171B (en) * 2018-01-09 2023-01-06 深圳市腾讯计算机系统有限公司 Voice recognition method, voice recognition system and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1223739A (en) * 1996-06-28 1999-07-21 微软公司 Method and system for dynamically adjusted training for speech recognition
CN1588536A (en) * 2004-09-29 2005-03-02 上海交通大学 State structure regulating method in sound identification
CN107766559A (en) * 2017-11-06 2018-03-06 第四范式(北京)技术有限公司 Training method, trainer, dialogue method and the conversational system of dialog model
CN108922543A (en) * 2018-06-11 2018-11-30 平安科技(深圳)有限公司 Model library method for building up, audio recognition method, device, equipment and medium

Also Published As

Publication number Publication date
CN110148403A (en) 2019-08-20
CN110148403B (en) 2021-04-13
CN110428819A (en) 2019-11-08

Similar Documents

Publication Publication Date Title
CN110428819B (en) Decoding network generation method, voice recognition method, device, equipment and medium
CN111933129B (en) Audio processing method, language model training method and device and computer equipment
CN106683677B (en) Voice recognition method and device
EP0954856B1 (en) Context dependent phoneme networks for encoding speech information
JP3696231B2 (en) Language model generation and storage device, speech recognition device, language model generation method and speech recognition method
CN105869629B (en) Audio recognition method and device
CN102176310B (en) Speech recognition system with huge vocabulary
CN107507615A (en) Interface intelligent interaction control method, device, system and storage medium
CN109243468B (en) Voice recognition method and device, electronic equipment and storage medium
KR20080069990A (en) Speech index pruning
CN106847265A (en) For the method and system that the speech recognition using search inquiry information is processed
CN1551103B (en) System with composite statistical and rules-based grammar model for speech recognition and natural language understanding
CN112420026A (en) Optimized keyword retrieval system
CN113113024B (en) Speech recognition method, device, electronic equipment and storage medium
CN113724718B (en) Target audio output method, device and system
CN115148212A (en) Voice interaction method, intelligent device and system
KR101905827B1 (en) Apparatus and method for recognizing continuous speech
CN109559752B (en) Speech recognition method and device
CN110570838B (en) Voice stream processing method and device
CN102298927B (en) voice identifying system and method capable of adjusting use space of internal memory
CN111508481A (en) Training method and device of voice awakening model, electronic equipment and storage medium
CN111414748A (en) Traffic data processing method and device
TWI731921B (en) Speech recognition method and device
CN112397053B (en) Voice recognition method and device, electronic equipment and readable storage medium
KR101482148B1 (en) Group mapping data building server, sound recognition server and method thereof by using personalized phoneme

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant