WO2021051514A1 - Speech identification method and apparatus, computer device and non-volatile storage medium - Google Patents

Speech identification method and apparatus, computer device and non-volatile storage medium Download PDF

Info

Publication number
WO2021051514A1
WO2021051514A1 PCT/CN2019/116920 CN2019116920W WO2021051514A1 WO 2021051514 A1 WO2021051514 A1 WO 2021051514A1 CN 2019116920 W CN2019116920 W CN 2019116920W WO 2021051514 A1 WO2021051514 A1 WO 2021051514A1
Authority
WO
WIPO (PCT)
Prior art keywords
word graph
path
search result
model
word
Prior art date
Application number
PCT/CN2019/116920
Other languages
French (fr)
Chinese (zh)
Inventor
李秀丰
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021051514A1 publication Critical patent/WO2021051514A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/081Search algorithms, e.g. Baum-Welch or Viterbi

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a voice recognition method, device, computer equipment and non-volatile storage medium.
  • N-Gram is a commonly used language model in large vocabulary continuous speech recognition.
  • CLM Chinese Language Model
  • the Chinese language model uses the collocation information between adjacent words in the context to realize automatic conversion to Chinese characters.
  • the Chinese language model uses the collocation information between adjacent words in the context, and when it is necessary to convert consecutive pinyin, strokes, or numbers representing letters or strokes into Chinese character strings (ie sentences), the one with the greatest probability can be calculated Sentences, so as to realize automatic conversion to Chinese characters, without the user's manual selection, avoiding the problem of multiple Chinese characters corresponding to the same pinyin (or stroke string or number string).
  • the model is based on the assumption that the appearance of the Nth word is only related to the previous N-1 words, and is not related to any other words.
  • the probability of the entire sentence is the product of the probability of each word, that is, the correlation N yuan The context.
  • the decoding network integrates language models, dictionaries, and acoustic shared phonetic character sets into a large decoding network.
  • the path search needs to be combined The acoustic decoding dimension is searched, and the search volume is large.
  • the purpose of the embodiments of the present application is to provide a speech recognition method, which can reduce the search dimension of the decoding network, increase the search speed of the decoding network, and thereby increase the speed of speech recognition.
  • the embodiments of the present application provide a voice recognition method, which adopts the following technical solutions:
  • the second search result includes the second path and the corresponding second path score.
  • the second word graph model includes the second word graph. Space, the first word graph space is the sub word graph space of the second word graph space;
  • the corresponding second path is selected and output according to the second path score in the second search result, and the speech recognition result is obtained.
  • an embodiment of the present application further provides a voice recognition device, including:
  • the acquiring module is used to acquire the voice information to be recognized
  • the first search module is used to input the to-be-recognized speech information into the local first word graph model for decoding search, and obtain the first search result.
  • the first search result includes the first path and the corresponding first path score.
  • a word graph model includes acoustic model, pronunciation dictionary and first word graph space;
  • the second search module is used to input the first search result into the local second word graph model for searching, and obtain the second search result.
  • the second search result includes the second path and the corresponding second path score, where the second word
  • the graph model includes a second word graph space, and the first word graph space is a sub-word graph space of the second word graph space;
  • the output module is used to select the corresponding second path for output according to the second path score to obtain the voice recognition result.
  • the embodiments of the present application also provide a computer device, which adopts the following technical solutions:
  • the computer device includes a non-volatile memory and a processor, and executable code is stored in the non-volatile memory.
  • executable code is stored in the non-volatile memory.
  • the processor executes the executable code, the processor implements any one of the proposals in the embodiments of the present application. The steps of a voice recognition method described in the item.
  • the embodiments of the present application also provide a computer-readable non-volatile storage medium, which adopts the following technical solutions:
  • Executable code is stored on the computer-readable non-volatile storage medium, and when the executable code is executed by a processor, the steps of a speech recognition method according to any one of the embodiments of the present application are implemented.
  • this application inputs the voice information to be recognized into a small word graph model for acoustic decoding and search, and then directly inputs the search results into a larger word graph model for search.
  • the second search process There is no need to perform acoustic decoding, and the dimensionality of the search can be reduced, effectively reducing the amount of word map search, thereby reducing the search time and improving the speed of speech recognition.
  • Figure 1 is a schematic diagram of an exemplary system architecture to which the present application can be applied;
  • Fig. 2 is a schematic flowchart of a speech recognition method of the present application
  • Fig. 3 is a schematic flowchart of another voice recognition method of the present application.
  • FIG. 4 is a schematic diagram of a specific flow of step 202 in the embodiment of FIG. 2 of the present application;
  • Fig. 5 is a schematic diagram of the process of constructing a first word graph model of the present application.
  • Fig. 6 is a schematic diagram of the construction process of another first word graph model of the present application.
  • FIG. 7 is a schematic diagram of a specific process of step 203 in the embodiment of FIG. 2 of the present application.
  • FIG. 8 is a schematic diagram of a specific process of step 204 in the embodiment of FIG. 2 of the present application.
  • FIG. 9 is a schematic structural diagram of a speech recognition device of the present application.
  • FIG. 10 is a schematic structural diagram of another voice recognition device of the present application.
  • FIG. 11 is a schematic diagram of a specific structure of the first search module 902;
  • FIG. 12 is a schematic structural diagram of another voice recognition device of the present application.
  • FIG. 13 is a schematic diagram of the specific structure of the first word graph model construction module 907;
  • FIG. 14 is a schematic diagram of a specific structure of the second search module 903;
  • FIG. 15 is a schematic diagram of a specific structure of the output module 904;
  • Fig. 16 is a block diagram of the basic structure of a computer device of the present application.
  • the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105.
  • the network 104 is used to provide a medium for communication links between the terminal devices 101, 102, 103 and the server 105.
  • the network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, and so on.
  • the user can use the terminal devices 101, 102, and 103 to interact with the server 105 through the network 104 to receive or send messages and so on.
  • Various communication client applications such as web browser applications, shopping applications, search applications, instant messaging tools, mailbox clients, social platform software, etc., may be installed on the terminal devices 101, 102, and 103.
  • the terminal devices 101, 102, 103 may be various electronic devices with display screens and support for web browsing, including but not limited to smart phones, tablets, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, moving images) Experts compress standard audio layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, dynamic image experts compress standard audio layer 4) players, laptop portable computers and desktop computers, etc.
  • MP3 players Motion Picture Experts Group Audio Layer III, moving images
  • MP4 Moving Picture Experts Group Audio Layer IV, dynamic image experts compress standard audio layer 4
  • laptop portable computers and desktop computers etc.
  • the server 105 may be a server that provides various services, for example, a background server that provides support for pages displayed on the terminal devices 101, 102, and 103.
  • a voice recognition method provided in the embodiments of the present application is generally executed by a terminal device.
  • a voice recognition device is generally provided in the terminal device.
  • terminal devices, networks, and servers in FIG. 1 are only illustrative, and any number of terminal devices, networks, and servers may be provided according to implementation needs.
  • FIG. 2 a flowchart of an embodiment of a voice recognition method according to the present application is shown.
  • the above-mentioned speech recognition method includes the following steps:
  • Step 201 Acquire voice information to be recognized.
  • an electronic device (such as the terminal device shown in FIG. 1) on which a voice recognition method runs can obtain the voice information to be recognized through a wired connection or a wireless connection.
  • the above wireless connection methods can include but are not limited to 3G/4G connection, WiFi (Wireless-Fidelity) connection, Bluetooth connection, WiMAX (Worldwide Interoperability for Microwave Access) connection, Zigbee connection, UWB (ultra wideband) connection, And other wireless connection methods that are currently known or developed in the future.
  • the aforementioned voice information to be recognized can be collected through a microphone.
  • the microphone can be set in the form of an external device or a built-in microphone in the device, for example, set in a voice recorder, mobile phone, tablet, MP4, notebook, etc. Microphone.
  • the aforementioned voice information to be recognized may also be obtained by uploading by the user, for example, storing the collected voice in a storage device, and obtaining the corresponding voice information by reading data in the storage device.
  • the aforementioned voice information to be recognized may also be the voice information of the other party obtained when the user communicates through social software.
  • the voice information to be recognized may also be voice information that has undergone domain conversion, for example, voice information that has been converted into frequency domain through time domain.
  • the above-mentioned voice information may also be referred to as voice signal or voice data.
  • Step 202 Input the to-be-recognized speech information into a local first word graph model for decoding search, and obtain a first search result.
  • the first search result includes the first path and the corresponding first path score.
  • the first word graph model Including acoustic model, pronunciation dictionary and first word image space.
  • the aforementioned local can be an offline environment under the Linux system, and offline speech tools in other scenarios can also be configured in the offline environment.
  • the aforementioned speech information to be recognized is the speech information to be recognized obtained in step 201, and the aforementioned first
  • the word graph model is a local word graph model. If the first word graph model is configured locally, the speech information can be decoded without going through the network, thereby improving the speed of speech recognition.
  • the first word graph model can be a word graph model based on wfst.
  • the first word graph model includes an acoustic model, a pronunciation dictionary, and a first word graph space.
  • the above-mentioned acoustic model can acoustically decode the user's speech information, so that the speech information can be decoded.
  • a phoneme unit is formed.
  • the above pronunciation dictionary is used to combine phoneme units to form phoneme words.
  • each phoneme word is connected to form a path to form a language unit.
  • the first word graph model is used to decode and search the speech information to be recognized.
  • the first search result is the search result obtained in the first word graph space.
  • the first search result includes multiple first paths, and each path includes a corresponding path score.
  • the path score is used to indicate the credibility of the path, the higher the score, the more credible the path.
  • the path is the connection and connection weight of each phoneme word, such as:
  • weights are obtained by training the first word graph model, and the training corpus can be the training corpus publicly available on the Internet, such as all the training corpus of the People's Daily from 2000 to 2012.
  • Step 203 Input the first search result into the local second word graph model for searching, and obtain the second search result.
  • the second search result includes the second path and the corresponding second path score, wherein the second word graph model includes the first word graph model.
  • Two word graph space, the first word graph space is the sub word graph space of the second word graph space.
  • the first search result may be the first search result in step 202 or the nbest result.
  • the acoustic model and dictionary are not configured in the second word graph model, and the first search result of the first word graph model is used as input, which can save the process of acoustic decoding.
  • the second word graph model can be local The word graph model, the second word graph model is configured locally, and the voice information can be recognized without going through the network, thereby improving the speed of speech recognition.
  • the second word graph model can be a word graph model based on wfst, and the second word graph space in the second word graph model can be a static word graph space.
  • the above static word graph space means that it has been trained and the phoneme word weight remains unchanged
  • the first search result is searched through the static word graph network.
  • the second search result is the search result obtained in the second word graph model.
  • the second search result includes multiple second paths, and each path includes the corresponding
  • the path score is used to indicate the credibility of the path. The higher the score, the more credible the path.
  • the path score is the product of the weights of phonemes in the path. The weights of the phonemes can be obtained by training the second word graph model until the loss function is fitted.
  • the second word graph space in the second word graph model can be sorted by the user, that is, the second word graph space in the second word graph model can be smaller than the traditional word graph network, reducing the complexity of the word graph network , Thereby increasing the speed of decoding search and increasing the real-time rate of decoding.
  • Step 204 Select a corresponding second path for output according to the second path score in the second search result, and obtain a voice recognition result.
  • the second path includes a complete sentence composed of phoneme words and a corresponding path score.
  • the path score is used to indicate the credibility of the sentence. The higher the path score, the credibility of the true content of the voice information. The higher the degree.
  • the complete sentence corresponding to the second path with the highest path score can be selected for output, thereby obtaining a speech recognition result.
  • multiple complete sentences corresponding to second paths with higher path scores can also be selected for output, thereby obtaining multiple voice recognition results for output, and the user can select from multiple voice recognition results.
  • the voice information to be recognized is acquired; the voice information to be recognized is input into the local first word graph model for decoding and search, and the first search result is obtained.
  • the first search result includes the first path and the corresponding first path Score, the first word graph model includes acoustic model, pronunciation dictionary and first word graph space; input the first search result into the local second word graph model to search, get the second search result, the second search result includes the second Paths and corresponding second path scores, where the second word graph model includes a second word graph space, and the first word graph space is a sub-word graph space of the second word graph space; the corresponding second path is selected according to the second path score Output, get the result of speech recognition.
  • the second search process does not require acoustic decoding, which can make the search
  • the dimensionality becomes lower, which effectively reduces the amount of word map search, thereby reducing the search time and improving the speed of speech recognition.
  • the above voice recognition method further includes:
  • Step 301 Acquire current context information of the user.
  • the above-mentioned current context information can be determined according to the time. For example, if the working hours are from 9 am to 17:00, the context can be determined as the working context, and at the end of the week, it can be determined as the vacation context. After 22:00, 8 Before the point, it can be determined as a resting context. It can also be determined based on the acquisition of the voice to be recognized. For example, the voice to be recognized is obtained from a friend on WeChat, it can be determined as a friend’s chat context, and the voice to be recognized is obtained from a user who notes as a customer in WeChat or other social software, then Can be determined as the working context. In a possible implementation manner, the user's context may also be automatically determined by the user, and the context information obtained by the user can be more accurate by selecting the context by himself.
  • Step 302 Select the corresponding first word graph model according to the user's current context information to decode and search the voice information.
  • the above-mentioned first word graph model may be a first word graph model with context attributes, and each first word graph model corresponds to one or more context attributes, which can be obtained through step 301
  • the context information matches the corresponding first word graph model. Matching the context information to the corresponding first word graph model can make the results obtained by the first word graph model more suitable for the context and improve the accuracy.
  • the above-mentioned first search result is a path result of at least one path
  • Step 401 Obtain the path result of the first path and the corresponding first path score through decoding search.
  • Step 402 According to the first path score from high to low, m path results from n path results are sequentially selected for output, to obtain a first search result, where m is less than or equal to n.
  • the score of the search result (the first path) under the first word graph model can be obtained, that is, the score of at least one first path is scored. Yes, n search results (first path) correspond to n scores, and nbest results sorted according to the scores are obtained as the first search result.
  • the first search results may be sorted according to nbest scores, that is, the search results corresponding to the highest first path score are ranked first.
  • the input amount of the second word graph model can be reduced.
  • the construction of the above-mentioned first word graph model includes the following steps:
  • Step 501 Extract a word graph unit from the pre-built second word graph space, and construct a first word graph space according to the word graph unit.
  • Step 502 Construct the first word graph model according to the acoustic model, pronunciation dictionary, and first word graph space.
  • the second word graph space in the second word graph model may be configured through a local dictionary, or may be a word graph space pre-downloaded to the local.
  • the word map unit may include a language unit and a corresponding weight, and the language unit may be understood as a phoneme word in the first search result.
  • the word graph unit can also be understood as a word graph path.
  • word graph units with various context attributes can be extracted from the second word graph space to construct the first word graph space of different contexts, so that the voice information can be in the first word graph space.
  • the search and decoding range in the word graph model becomes smaller, thereby improving the speed of the first word graph model to decode speech information.
  • the above steps can be understood as pruning the second word graph space to obtain the first word graph space. It should be understood that the number of the aforementioned first word graph models may be one or more.
  • the first word graph space can be augmented to add word graph units with similar context attributes to expand the first word graph space into a second word graph space.
  • the weight of each language unit in the first word graph space obtained after pruning will change with the model training.
  • the weight of the same language unit will be in the first word graph space and the second word graph space.
  • the weight of the graph space is not the same.
  • the weight of each language unit in the second word graph space obtained after branching is different from the weight of the same language unit in the first word graph space. That is, the same path is searched in the first word graph model and the second word graph model, and the path scores obtained are different.
  • the first word graph space is constructed by extracting word graph units with the same attributes from the second word graph space, which can avoid the mismatch between the first search result and the second word graph model, causing errors. Recognition.
  • the construction of the above-mentioned first word graph model further includes the following steps:
  • Step 601 Train the first word graph model to fit the loss function to obtain the weight of the word graph unit in the first word graph space.
  • the word graph unit when the word graph unit is a language unit, the language unit that constructs the first word graph model can be combined according to the word graph combination relationship in the second word graph model, and the first word graph model can be trained to adjust the language unit the weight of. Get the new word graph space as the word graph space of the first word graph model.
  • the scoring result of the first word graph path can be adjusted by training the first word graph model.
  • the first word graph model when constructing the first word graph space by extracting word graph units in the second word graph space, the first word graph model can be trained to improve the recognition accuracy of the first word graph model In addition, it will not be affected by the second word graph space.
  • step 203 specifically includes:
  • Step 701 Extract the word graph unit in the first search result.
  • Step 702 Input the word graph unit in the first search result into the second word graph model for searching.
  • the language unit when the word map unit is a language unit, the language unit can be input into the second word map model for search, and the second word map path of the corresponding word map unit in the second word map model and the corresponding Path score.
  • the word graph unit is the first word graph path
  • the first word graph path is decomposed in the second word graph model to obtain the language unit, and then the language unit is searched for the path in the second word graph space to obtain the second word graph space.
  • the word map path and the corresponding path score when the word graph unit is the first word graph path, these first word graph paths are input into the second word graph model, and the second word graph path in the second word graph space of the second word graph model is performed. Matching, since the same path in the first word map space and the second word map space may have different path scores, it is equivalent to wide-area verification of the first search result in the second word map space, ensuring the accuracy of the speech recognition results .
  • the first search result is searched in the second word graph space in the form of a word graph unit, without acoustically decoding the speech information to be recognized, the search dimension is reduced, and the speed of speech recognition is improved.
  • step 204 specifically includes:
  • Step 801 Sort the second path according to the score of the second path.
  • Step 802 Output the speech recognition results corresponding to y second paths in order, where y is greater than or equal to 1.
  • the second path with a high score can be ranked first, and the second path with a low score can be ranked behind.
  • the complete sentence corresponding to the second word map path selected for output will be more intuitive. For example, if only one is selected for output, the complete sentence corresponding to the first second word map path can be extracted for output. , In the case of selecting multiple output, the top ones can be extracted for output, so that the user can select the output result.
  • the second path is sorted and then output. According to the complete sentences output by sorting, the output voice recognition result can be more convenient and intuitive.
  • the aforementioned non-volatile storage medium may be a non-volatile non-volatile storage medium such as a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), and the like.
  • this application provides an embodiment of a speech recognition device.
  • the device embodiment corresponds to the method embodiment shown in FIG. Used in various electronic devices.
  • a speech recognition device 900 of this embodiment includes: a first acquisition module 901, a first search module 902, a second search module 903, and an output module 904. among them:
  • the first acquiring module 901 is configured to acquire voice information to be recognized
  • the first search module 902 is configured to input the to-be-recognized voice information into the local first word graph model for decoding search to obtain a first search result, the first search result including the first path and the corresponding first path score,
  • the first word graph model includes acoustic model, pronunciation dictionary and first word graph space;
  • the second search module 903 is configured to input the first search result into the local second word graph model for searching, and obtain the second search result.
  • the second search result includes the second path and the corresponding second path score, where the second
  • the word graph model includes a second word graph space, and the first word graph space is a sub-word graph space of the second word graph space;
  • the output module 904 is configured to select a corresponding second path for output according to the second path score to obtain a voice recognition result.
  • the first word graph model is at least one first word graph model configured locally, and the speech recognition apparatus 900 further includes: a second acquisition module 905 and a selection module 906. among them,
  • the second obtaining module 905 is used to obtain the current context information of the user
  • the selection module 906 is configured to select the corresponding first word graph model to decode and search the voice information according to the user's current context information.
  • the first search result is a path result of at least one path
  • the first search module 902 includes: a decoding search unit 9021, an output unit 9022. among them,
  • the decoding search unit 9021 is configured to obtain the path result of the first path and the corresponding first path score through decoding search;
  • the first output unit 9022 is configured to sequentially select m path results among the n path results according to the first path score from high to low for output to obtain the first search result, where m is less than or equal to n.
  • the speech recognition device 900 further includes a first word graph model construction module 907, and the first word graph model construction module 907 includes; a first extraction unit 9071, a construction unit 9072. among them:
  • the first extraction unit 9071 is configured to extract the word graph unit from the pre-built second word graph space, and construct the first word graph space according to the word graph unit;
  • the construction unit 9072 is configured to construct the first word graph model according to the acoustic model, pronunciation dictionary, and first word graph space.
  • the first word graph model construction module 907 further includes a training unit 9073. among them:
  • the training unit 9073 trains the first word graph model, trains to fit the loss function, and obtains the weight of the word graph unit in the first word graph space.
  • the second search module 903 includes: a second extraction unit 9031, an input unit 9032. among them:
  • the second extraction unit 9031 is used to extract the word map unit in the first search result
  • the input unit 9032 is configured to input the word graph unit in the first search result into the second word graph model for search.
  • the output module 904 includes: a sorting unit 9041, a second output unit 9042. among them:
  • the sorting unit 9041 is configured to sort the second path according to the score of the second path
  • the second output unit 9042 is configured to output the voice recognition results corresponding to the y second paths in order, where y is greater than or equal to 1.
  • the voice recognition device provided in the embodiment of the present application can implement the various implementation manners in the method embodiments of FIG. 2 to FIG. 8 and the corresponding beneficial effects. To avoid repetition, details are not described herein again.
  • FIG. 16 is a block diagram of the basic structure of the computer device in this embodiment.
  • the computer device 16 includes a non-volatile memory 161, a processor 162, and a network interface 163 that are communicatively connected to each other through a system bus. It should be pointed out that only the computer device 16 with components 161-163 is shown in the figure, but it should be understood that it is not required to implement all the illustrated components, and more or fewer components may be implemented instead. Among them, those skilled in the art can understand that the computer device here is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions.
  • Its hardware includes, but is not limited to, a microprocessor, a dedicated Integrated Circuit (Application Specific Integrated Circuit, ASIC), Programmable Gate Array (Field-Programmable GateArray, FPGA), Digital Processor (Digital Signal Processor, DSP), embedded equipment, etc.
  • ASIC Application Specific Integrated Circuit
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable GateArray
  • DSP Digital Processor
  • the computer equipment can be computing equipment such as desktop computers, notebooks, palmtop computers, and cloud servers.
  • the computer device can interact with the user through a keyboard, mouse, remote control, touch panel, or voice control device.
  • the non-volatile memory 161 includes at least one type of readable non-volatile storage medium.
  • the readable non-volatile storage medium includes flash memory, hard disk, multimedia card, card-type non-volatile memory (for example, SD or DX).
  • Non-volatile memory, etc. includes read-only non-volatile memory (ROM), electrically erasable programmable read-only non-volatile memory (EEPROM), programmable read-only non-volatile memory (PROM), magnetic Non-volatile memory, magnetic disks, optical disks, etc.
  • the non-volatile memory 161 may be an internal storage unit of the computer device 16, such as a hard disk or memory of the computer device 16.
  • the non-volatile memory 161 may also be an external storage device of the computer device 16, such as a plug-in hard disk, a smart media card (SMC), and a secure digital device equipped on the computer device 16. (Secure Digital, SD) card, flash card (Flash Card), etc.
  • the non-volatile memory 161 may also include both the internal storage unit of the computer device 16 and its external storage device.
  • the non-volatile memory 161 is generally used to store an operating system and various application software installed in the computer device 16, such as executable code of a voice recognition method.
  • the non-volatile memory 161 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 162 may be a central processing unit (CPU), a controller, a microcontroller, a microprocessor, or other data processing chips.
  • the processor 162 is generally used to control the overall operation of the computer device 16.
  • the processor 162 is configured to run executable codes or process data stored in the non-volatile memory 161, for example, run executable codes for a voice recognition method.
  • the network interface 163 may include a wireless network interface or a wired network interface, and the network interface 163 is generally used to establish a communication connection between the computer device 16 and other electronic devices.
  • This application also provides another implementation manner, that is, a computer-readable non-volatile storage medium is provided.
  • the computer-readable non-volatile storage medium stores a type of speech recognition executable code, and the above-mentioned type of speech recognition executable code is
  • the code may be executed by at least one processor, so that the at least one processor executes the steps of a speech recognition method as described above.
  • the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a non-volatile storage medium (such as ROM, magnetic A disc, an optical disc) includes a number of instructions to enable a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to execute a voice recognition method of the various embodiments of the present application.
  • a terminal device which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.

Abstract

A speech identification method and apparatus, a computer device and a non-volatile storage medium, which relate to the technical field of artificial intelligence. The method comprises: acquiring speech information to be identified (201); inputting the speech information to be identified into a local first word graph model to perform decoding searching so as to obtain a first search result (202), the first search result comprising a first path and a corresponding first path score, and the first word graph model comprising an acoustic model, a pronunciation dictionary and a first word graph space; inputting the first search result into a local second word graph model for searching so as to obtain a second search result (203), the second search result comprising a second path and a corresponding second path score, the second word graph model comprising a second word graph space, and the first word graph space being a sub-word graph space of the second word graph space; and according to the second path score in the second search result, selecting the corresponding second path for output so as to obtain a speech identification result (204). By using the described method, the dimensions of searching are lowered and the amount of word graph searching is reduced, thus search time is shortened, and the speed of speech identification is increased.

Description

一种语音识别方法、装置、计算机设备及非易失性存储介质Speech recognition method, device, computer equipment and non-volatile storage medium
本申请以2019年9月20日提交的申请号为201910894996.8,名称为“一种语音识别方法、装置、计算机设备及存储介质”的中国发明专利申请为基础,并要求其优先权。This application is based on the Chinese invention patent application filed on September 20, 2019 with the application number 201910894996.8, titled "A voice recognition method, device, computer equipment and storage medium", and claims its priority.
技术领域Technical field
本申请涉及人工智能技术领域,尤其涉及一种语音识别方法、装置、计算机设备及非易失性存储介质。This application relates to the field of artificial intelligence technology, and in particular to a voice recognition method, device, computer equipment and non-volatile storage medium.
背景技术Background technique
N-Gram是大词汇连续语音识别中常用的一种语言模型,对中文而言,我们称之为汉语语言模型(CLM,Chinese Language Model)。汉语语言模型利用上下文中相邻词间的搭配信息,可以实现到汉字的自动转换。N-Gram is a commonly used language model in large vocabulary continuous speech recognition. For Chinese, we call it the Chinese Language Model (CLM, Chinese Language Model). The Chinese language model uses the collocation information between adjacent words in the context to realize automatic conversion to Chinese characters.
汉语语言模型利用上下文中相邻词间的搭配信息,在需要把连续无空格的拼音、笔划,或代表字母或笔划的数字,转换成汉字串(即句子)时,可以计算出具有最大概率的句子,从而实现到汉字的自动转换,无需用户手动选择,避开了许多汉字对应一个相同的拼音(或笔划串,或数字串)的重码问题。The Chinese language model uses the collocation information between adjacent words in the context, and when it is necessary to convert consecutive pinyin, strokes, or numbers representing letters or strokes into Chinese character strings (ie sentences), the one with the greatest probability can be calculated Sentences, so as to realize automatic conversion to Chinese characters, without the user's manual selection, avoiding the problem of multiple Chinese characters corresponding to the same pinyin (or stroke string or number string).
该模型基于这样一种假设,第N个词的出现只与前面N-1个词相关,而与其它任何词都不相关,整句的概率就是各个词出现概率的乘积,也就是关联N元的上下文。The model is based on the assumption that the appearance of the Nth word is only related to the previous N-1 words, and is not related to any other words. The probability of the entire sentence is the product of the probability of each word, that is, the correlation N yuan The context.
目前大多主流的语音识别解码器已经采用基于有限状态机(WFST)的解码网络,该解码网络把语言模型、词典和声学共享音字集统一集成为一个大的解码网络,路径搜索时,还需要结合声学解码维度进行搜索,搜索量大。At present, most mainstream speech recognition decoders have adopted a finite state machine (WFST)-based decoding network. The decoding network integrates language models, dictionaries, and acoustic shared phonetic character sets into a large decoding network. The path search needs to be combined The acoustic decoding dimension is searched, and the search volume is large.
发明内容Summary of the invention
本申请实施例的目的在于提供一种语音识别方法,可以减少解码网络的搜索维度,提高解码网络的搜索速度,进而提高语音识别的速度。The purpose of the embodiments of the present application is to provide a speech recognition method, which can reduce the search dimension of the decoding network, increase the search speed of the decoding network, and thereby increase the speed of speech recognition.
为了解决上述技术问题,本申请实施例提供一种语音识别方法,采用了如下所述的技术方案:In order to solve the above technical problems, the embodiments of the present application provide a voice recognition method, which adopts the following technical solutions:
包括下述步骤:It includes the following steps:
获取待识别语音信息;Obtain the voice information to be recognized;
将所述待识别语音信息输入本地的第一词图模型中进行解码搜索,得到第一搜索结果,第一搜索结果包括第一路径以及对应的第一路径分数,第一词图模型包括声学模型、发音词典及第一词图空间;Input the to-be-recognized speech information into the local first word graph model for decoding search, and obtain the first search result, the first search result includes the first path and the corresponding first path score, and the first word graph model includes the acoustic model , Pronunciation dictionary and first word image space;
将第一搜索结果输入本地的第二词图模型中进行搜索,得到第二搜索结果,第二搜索结果包括第二路径以及对应第二路径分数,其中,第二词图模型包括第二词图空间,第一词图空间为第二词图空间的子词图空间;Enter the first search result into the local second word graph model for searching, and obtain the second search result. The second search result includes the second path and the corresponding second path score. The second word graph model includes the second word graph. Space, the first word graph space is the sub word graph space of the second word graph space;
根据所述第二搜索结果中第二路径分数选择对应的第二路径进行输出,得到语音识别结果。The corresponding second path is selected and output according to the second path score in the second search result, and the speech recognition result is obtained.
为了解决上述技术问题,本申请实施例还提供一种语音识别装置,包括:In order to solve the foregoing technical problems, an embodiment of the present application further provides a voice recognition device, including:
获取模块,用于获取待识别语音信息;The acquiring module is used to acquire the voice information to be recognized;
第一搜索模块,用于将所述待识别语音信息输入本地的第一词图模型中进行解码搜索,得到第一搜索结果,第一搜索结果包括第一路径以及对应的第一路径分数,第一词图模型包括声学模型、发音词典及第一词图空间;The first search module is used to input the to-be-recognized speech information into the local first word graph model for decoding search, and obtain the first search result. The first search result includes the first path and the corresponding first path score. A word graph model includes acoustic model, pronunciation dictionary and first word graph space;
第二搜索模块,用于将第一搜索结果输入本地的第二词图模型中进行搜索,得到第二搜索结果,第二搜索结果包括第二路径以及对应第二路径分数,其中,第二词图模型包括第二词图空间,第一词图空间为第二词图空间的子词图空间;The second search module is used to input the first search result into the local second word graph model for searching, and obtain the second search result. The second search result includes the second path and the corresponding second path score, where the second word The graph model includes a second word graph space, and the first word graph space is a sub-word graph space of the second word graph space;
输出模块,用于根据第二路径分数选择对应的第二路径进行输出,得到语音识别结果。The output module is used to select the corresponding second path for output according to the second path score to obtain the voice recognition result.
为了解决上述技术问题,本申请实施例还提供一种计算机设备,采用了如下所述的技术方案:In order to solve the above technical problems, the embodiments of the present application also provide a computer device, which adopts the following technical solutions:
所述计算机设备,包括非易失性存储器和处理器,所述非易失性存储器中存储有可执行代码,所述处理器执行所述可执行代码时实现本申请实施例中提出的任一项所述的一种语音识别方法的步骤。The computer device includes a non-volatile memory and a processor, and executable code is stored in the non-volatile memory. When the processor executes the executable code, the processor implements any one of the proposals in the embodiments of the present application. The steps of a voice recognition method described in the item.
为了解决上述技术问题,本申请实施例还提供一种计算机可读非易失性存储介质,采用了如下所述的技术方案:In order to solve the above technical problems, the embodiments of the present application also provide a computer-readable non-volatile storage medium, which adopts the following technical solutions:
所述计算机可读非易失性存储介质上存储有可执行代码,所述可执行代码被处理器执行时实现本申请实施例中提出的任一项所述的一种语音识别方法的步骤。Executable code is stored on the computer-readable non-volatile storage medium, and when the executable code is executed by a processor, the steps of a speech recognition method according to any one of the embodiments of the present application are implemented.
与现有技术相比,本申通过将待识别语音信息输入一个小的词图模型中进行声学解码以及搜索,再将搜索结果直接输入到较大的词图模型中进行搜索,二次搜索过程无需再进行声学解码,可以使搜索的维度变低,有效降低词图搜索的量,从而降低搜索的时间,提高语音识别的速度。Compared with the prior art, this application inputs the voice information to be recognized into a small word graph model for acoustic decoding and search, and then directly inputs the search results into a larger word graph model for search. The second search process There is no need to perform acoustic decoding, and the dimensionality of the search can be reduced, effectively reducing the amount of word map search, thereby reducing the search time and improving the speed of speech recognition.
附图说明Description of the drawings
为了更清楚地说明本申请中的方案,下面将对本申请实施例描述中所需要使用的附图作一个简单介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the solution in this application more clearly, the following will briefly introduce the drawings used in the description of the embodiments of the application. Obviously, the drawings in the following description are some embodiments of the application. Ordinary technicians can obtain other drawings based on these drawings without creative work.
图1是本申请可以应用于其中的示例性系统架构示意图;Figure 1 is a schematic diagram of an exemplary system architecture to which the present application can be applied;
图2是本申请的一种语音识别方法的流程示意图;Fig. 2 is a schematic flowchart of a speech recognition method of the present application;
图3是本申请的另一种语音识别方法的流程示意图;Fig. 3 is a schematic flowchart of another voice recognition method of the present application;
图4是本申请图2实施例中步骤202的具体流程示意图;FIG. 4 is a schematic diagram of a specific flow of step 202 in the embodiment of FIG. 2 of the present application;
图5是本申请的一种第一词图模型构建的流程示意图;Fig. 5 is a schematic diagram of the process of constructing a first word graph model of the present application;
图6是本申请的另一种第一词图模型构建的流程示意图;Fig. 6 is a schematic diagram of the construction process of another first word graph model of the present application;
图7是本申请图2实施例中步骤203的具体流程示意图;FIG. 7 is a schematic diagram of a specific process of step 203 in the embodiment of FIG. 2 of the present application;
图8是本申请图2实施例中步骤204的具体流程示意图;FIG. 8 is a schematic diagram of a specific process of step 204 in the embodiment of FIG. 2 of the present application;
图9是本申请的一种语音识别装置的结构示意图;FIG. 9 is a schematic structural diagram of a speech recognition device of the present application;
图10是本申请的另一种语音识别装置的结构示意图;FIG. 10 is a schematic structural diagram of another voice recognition device of the present application;
图11是第一搜索模块902的具体结构示意图;FIG. 11 is a schematic diagram of a specific structure of the first search module 902;
图12是本申请的另一种语音识别装置的结构示意图;FIG. 12 is a schematic structural diagram of another voice recognition device of the present application;
图13是第一词图模型构建模块907的具体结构示意图;FIG. 13 is a schematic diagram of the specific structure of the first word graph model construction module 907;
图14是第二搜索模块903的具体结构示意图;FIG. 14 is a schematic diagram of a specific structure of the second search module 903;
图15是输出模块904的具体结构示意图;FIG. 15 is a schematic diagram of a specific structure of the output module 904;
图16是本申请的一种计算机设备基本结构框图。Fig. 16 is a block diagram of the basic structure of a computer device of the present application.
具体实施方式detailed description
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同;本文中在申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请;本申请的说明书和权利要求书及上述附图说明中的术语“包括”和“具有”以及它们的任何变形,意图在于覆盖不排他的包含。本申请的说明书和权利要求书或上述附图中的术语“第一”、“第二”等是用于区别不同对象,而不是用于描述特定顺序。Unless otherwise defined, all technical and scientific terms used herein have the same meanings as commonly understood by those skilled in the technical field of the application; the terms used in the specification of the application herein are only for describing specific embodiments. The purpose is not to limit the application; the terms "including" and "having" in the specification and claims of the application and the above-mentioned description of the drawings and any variations thereof are intended to cover non-exclusive inclusions. The terms "first", "second", etc. in the specification and claims of the application or the above-mentioned drawings are used to distinguish different objects, rather than to describe a specific sequence.
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。Reference to "embodiments" herein means that a specific feature, structure, or characteristic described in conjunction with the embodiments may be included in at least one embodiment of the present application. The appearance of the phrase in various places in the specification does not necessarily refer to the same embodiment, nor is it an independent or alternative embodiment mutually exclusive with other embodiments. Those skilled in the art clearly and implicitly understand that the embodiments described herein can be combined with other embodiments.
为了使本技术领域的人员更好地理解本申请方案,下面将结合附图,对本申请实施例中的技术方案进行清楚、完整地描述。In order to enable those skilled in the art to better understand the solutions of the present application, the technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings.
如图1所示,系统架构100可以包括终端设备101、102、103,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。As shown in FIG. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used to provide a medium for communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, and so on.
用户可以使用终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息等。终端设备101、102、103上可以安装有各种通讯用户端应用,例如网页浏览器应用、 购物类应用、搜索类应用、即时通信工具、邮箱用户端、社交平台软件等。The user can use the terminal devices 101, 102, and 103 to interact with the server 105 through the network 104 to receive or send messages and so on. Various communication client applications, such as web browser applications, shopping applications, search applications, instant messaging tools, mailbox clients, social platform software, etc., may be installed on the terminal devices 101, 102, and 103.
终端设备101、102、103可以是具有显示屏并且支持网页浏览的各种电子设备,包括但不限于智能手机、平板电脑、电子书阅读器、MP3播放器(Moving Picture ExpertsGroup Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving PictureExperts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、膝上型便携计算机和台式计算机等等。The terminal devices 101, 102, 103 may be various electronic devices with display screens and support for web browsing, including but not limited to smart phones, tablets, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, moving images) Experts compress standard audio layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, dynamic image experts compress standard audio layer 4) players, laptop portable computers and desktop computers, etc.
服务器105可以是提供各种服务的服务器,例如对终端设备101、102、103上显示的页面提供支持的后台服务器。The server 105 may be a server that provides various services, for example, a background server that provides support for pages displayed on the terminal devices 101, 102, and 103.
需要说明的是,本申请实施例所提供的一种语音识别方法一般由终端设备执行,相应地,一种语音识别装置一般设置于终端设备中。It should be noted that a voice recognition method provided in the embodiments of the present application is generally executed by a terminal device. Correspondingly, a voice recognition device is generally provided in the terminal device.
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的,根据实现需要,可以具有任意数目的终端设备、网络和服务器。It should be understood that the numbers of terminal devices, networks, and servers in FIG. 1 are only illustrative, and any number of terminal devices, networks, and servers may be provided according to implementation needs.
继续参考图2,示出了根据本申请的一种语音识别方法的一个实施例的流程图。上述的一种语音识别方法,包括以下步骤:Continuing to refer to FIG. 2, a flowchart of an embodiment of a voice recognition method according to the present application is shown. The above-mentioned speech recognition method includes the following steps:
步骤201,获取待识别语音信息。Step 201: Acquire voice information to be recognized.
在本实施例中,一种语音识别方法运行于其上的电子设备(例如图1所示的终端设备)可以通过有线连接方式或者无线连接方式获取待识别语音信息。需要指出的是,上述无线连接方式可以包括但不限于3G/4G连接、WiFi(Wireless-Fidelity)连接、蓝牙连接、WiMAX(Worldwide Interoperability for Microwave Access)连接、Zigbee连接、UWB(ultra wideband)连接、以及其他现在已知或将来开发的无线连接方式。In this embodiment, an electronic device (such as the terminal device shown in FIG. 1) on which a voice recognition method runs can obtain the voice information to be recognized through a wired connection or a wireless connection. It should be pointed out that the above wireless connection methods can include but are not limited to 3G/4G connection, WiFi (Wireless-Fidelity) connection, Bluetooth connection, WiMAX (Worldwide Interoperability for Microwave Access) connection, Zigbee connection, UWB (ultra wideband) connection, And other wireless connection methods that are currently known or developed in the future.
其中,上述的待识别语音信息可以通过麦克风进行采集,麦克风可以是以外设的形式进行设置,也可以是设备中内置的麦克风,比如,设置在录音笔、手机、平板、MP4、笔记本等设备内的麦克风。或者,上述的待识别语音信息也可以是通过用户进行上传得到,比如,将采集到语音存放在存储设备中,通过读取存储设备中的数据得到对应的语音信息。又或者,上述的待识别语音信息还可以是用户通过社交软件交流时所获取到的对方的语音信息。Among them, the aforementioned voice information to be recognized can be collected through a microphone. The microphone can be set in the form of an external device or a built-in microphone in the device, for example, set in a voice recorder, mobile phone, tablet, MP4, notebook, etc. Microphone. Alternatively, the aforementioned voice information to be recognized may also be obtained by uploading by the user, for example, storing the collected voice in a storage device, and obtaining the corresponding voice information by reading data in the storage device. Or, the aforementioned voice information to be recognized may also be the voice information of the other party obtained when the user communicates through social software.
在一种可能的实现方式中,待识别语音信息还可以是经过域转换的语音信息,比如,已经通过时域转换为频域的语音信息。In a possible implementation manner, the voice information to be recognized may also be voice information that has undergone domain conversion, for example, voice information that has been converted into frequency domain through time domain.
上述的语音信息也可以称为语音信号或语音数据。The above-mentioned voice information may also be referred to as voice signal or voice data.
步骤202,将所述待识别语音信息输入本地的第一词图模型中进行解码搜索,得到第一搜索结果,第一搜索结果包括第一路径以及对应的第一路径分数,第一词图模型包括声学模型、发音词典及第一词图空间。Step 202: Input the to-be-recognized speech information into a local first word graph model for decoding search, and obtain a first search result. The first search result includes the first path and the corresponding first path score. The first word graph model Including acoustic model, pronunciation dictionary and first word image space.
其中,上述的本地可以是Linux系统下的离线环境,该离线环境中还可以配置其他场景的离线语音工具,上述的待识别语音信息为步骤201中的获取的待识别语音信息,上述的第一词图模型为本地的词图模型,将第一词图模型配置在本地,可以不通过网络就可以对语音信息进行解码,从而提高了语音识别的速度。第一词图模型可以是基于wfst的词图模型, 第一词图模型中包括声学模型、发音词典及第一词图空间,上述的声学模型可以对用户语音信息进行声学解码,使语音信息解码形成音素单元,上述的发音词典用于将音素单元进行组合,形成音素词,上述的第一词图空间中,各音素词连接成路径,形成语言单元。通过第一词图模型对待识别语音信息进行解码搜索,第一搜索结果为第一词图空间中得到的搜索结果,第一搜索结果包括多个第一路径,每条路径包括对应的路径分数,路径分数用于表示该条路径的可信程度,分数越高,则表示该条路径越可信。Wherein, the aforementioned local can be an offline environment under the Linux system, and offline speech tools in other scenarios can also be configured in the offline environment. The aforementioned speech information to be recognized is the speech information to be recognized obtained in step 201, and the aforementioned first The word graph model is a local word graph model. If the first word graph model is configured locally, the speech information can be decoded without going through the network, thereby improving the speed of speech recognition. The first word graph model can be a word graph model based on wfst. The first word graph model includes an acoustic model, a pronunciation dictionary, and a first word graph space. The above-mentioned acoustic model can acoustically decode the user's speech information, so that the speech information can be decoded. A phoneme unit is formed. The above pronunciation dictionary is used to combine phoneme units to form phoneme words. In the first word graph space mentioned above, each phoneme word is connected to form a path to form a language unit. The first word graph model is used to decode and search the speech information to be recognized. The first search result is the search result obtained in the first word graph space. The first search result includes multiple first paths, and each path includes a corresponding path score. The path score is used to indicate the credibility of the path, the higher the score, the more credible the path.
其中,路径为各音素词的连接及连接权重,比如:Among them, the path is the connection and connection weight of each phoneme word, such as:
今天(权重0.9)天气(权重0.8)怎么样(0.9),该路径评分为所有权重的积,0.9*0.8*0.9=0.648How about today's (weight 0.9) weather (weight 0.8) (0.9), the path score is the product of all weights, 0.9*0.8*0.9=0.648
近天(权重0.3)天气(权重0.2)怎么样(权重0.8),该路径评分为所有权重的积,0.3*0.2*0.8=0.048。How about the recent weather (weight 0.3) and weather (weight 0.2) (weight 0.8), the path score is the product of all weights, 0.3*0.2*0.8=0.048.
上述的权重由对第一词图模型进行训练得到,训练语料可以是网上公开的训练语料,比如《人民日报》2000年到2012年的全部训练语料。The above-mentioned weights are obtained by training the first word graph model, and the training corpus can be the training corpus publicly available on the Internet, such as all the training corpus of the People's Daily from 2000 to 2012.
步骤203,将第一搜索结果输入本地的第二词图模型中进行搜索,得到第二搜索结果,第二搜索结果包括第二路径以及对应第二路径分数,其中,第二词图模型包括第二词图空间,第一词图空间为第二词图空间的子词图空间。Step 203: Input the first search result into the local second word graph model for searching, and obtain the second search result. The second search result includes the second path and the corresponding second path score, wherein the second word graph model includes the first word graph model. Two word graph space, the first word graph space is the sub word graph space of the second word graph space.
在本实施例中,第一搜索结果可以是步骤202中的第一搜索结果,也可以是nbest结果。需要说明的是,第二词图模型中不配置声学模型与词典,使用第一词图模型的第一搜索结果做为输入,可以省去声学解码的过程,第二词图模型可以为本地的词图模型,将第二词图模型配置在本地,可以不用通过网络就可以对语音信息进行识别,从而提高了语音识别的速度。第二词图模型可以是基于wfst的词图模型,第二词图模型中的第二词图空间可以是静态词图空间,上述的静态词图空间表示已经训练好的,音素词权重不变的词图空间,通过静态词图网络对第一搜索结果进行搜索,第二搜索结果为第二词图模型中得到的搜索结果,第二搜索结果包括多个第二路径,每条路径包括对应的路径分数,路径分数用于表示该条路径的可信程度,分数越高,则表示该条路径越可信。路径分数为该路径中音素词权重的积,上述音素词权重可以通过对第二词图模型进行训练,直到损失函数拟合,即可得到音素词的权重。In this embodiment, the first search result may be the first search result in step 202 or the nbest result. It should be noted that the acoustic model and dictionary are not configured in the second word graph model, and the first search result of the first word graph model is used as input, which can save the process of acoustic decoding. The second word graph model can be local The word graph model, the second word graph model is configured locally, and the voice information can be recognized without going through the network, thereby improving the speed of speech recognition. The second word graph model can be a word graph model based on wfst, and the second word graph space in the second word graph model can be a static word graph space. The above static word graph space means that it has been trained and the phoneme word weight remains unchanged In the word graph space, the first search result is searched through the static word graph network. The second search result is the search result obtained in the second word graph model. The second search result includes multiple second paths, and each path includes the corresponding The path score is used to indicate the credibility of the path. The higher the score, the more credible the path. The path score is the product of the weights of phonemes in the path. The weights of the phonemes can be obtained by training the second word graph model until the loss function is fitted.
可选的,第二词图模型中的第二词图空间可以用户进行整理得到,即第二词图模型中的第二词图空间可以小于传统的词图网络,降低词图网络的复杂度,从而提高解码搜索的速度,提高解码的实时率。Optionally, the second word graph space in the second word graph model can be sorted by the user, that is, the second word graph space in the second word graph model can be smaller than the traditional word graph network, reducing the complexity of the word graph network , Thereby increasing the speed of decoding search and increasing the real-time rate of decoding.
步骤204,根据第二搜索结果中的第二路径分数选择对应的第二路径进行输出,得到语音识别结果。Step 204: Select a corresponding second path for output according to the second path score in the second search result, and obtain a voice recognition result.
在本实施例中,第二路径包括音素词组成的完整语句以及对应的路径分数,路径分数用来表示该语句的可信度,路径分数越高,则语句为语音信息的真实内容的可信度越高。可以选取路径分数最高的第二路径所对应的完整语句进行输出,从而得到一个语音识别结果。另 外,也可以选取多个路径分数较高的第二路径所对应的完整语句进行输出,从而得到多个语音识别结果进行输出,用户可以从多个语音识别结果中进行选取。In this embodiment, the second path includes a complete sentence composed of phoneme words and a corresponding path score. The path score is used to indicate the credibility of the sentence. The higher the path score, the credibility of the true content of the voice information. The higher the degree. The complete sentence corresponding to the second path with the highest path score can be selected for output, thereby obtaining a speech recognition result. In addition, multiple complete sentences corresponding to second paths with higher path scores can also be selected for output, thereby obtaining multiple voice recognition results for output, and the user can select from multiple voice recognition results.
在本实施例中,获取待识别语音信息;将待识别语音信息输入本地的第一词图模型中进行解码搜索,得到第一搜索结果,第一搜索结果包括第一路径以及对应的第一路径分数,第一词图模型包括声学模型、发音词典及第一词图空间;将第一搜索结果输入本地的第二词图模型中进行搜索,得到第二搜索结果,第二搜索结果包括第二路径以及对应第二路径分数,其中,第二词图模型包括第二词图空间,第一词图空间为第二词图空间的子词图空间;根据第二路径分数选择对应的第二路径进行输出,得到语音识别结果。通过将待识别语音信息输入一个小的词图模型中进行声学解码以及搜索,再将搜索结果直接输入到较大的词图模型中进行搜索,二次搜索过程无需再进行声学解码,可以使搜索的维度变低,有效降低词图搜索的量,从而降低搜索的时间,提高语音识别的速度。In this embodiment, the voice information to be recognized is acquired; the voice information to be recognized is input into the local first word graph model for decoding and search, and the first search result is obtained. The first search result includes the first path and the corresponding first path Score, the first word graph model includes acoustic model, pronunciation dictionary and first word graph space; input the first search result into the local second word graph model to search, get the second search result, the second search result includes the second Paths and corresponding second path scores, where the second word graph model includes a second word graph space, and the first word graph space is a sub-word graph space of the second word graph space; the corresponding second path is selected according to the second path score Output, get the result of speech recognition. By inputting the speech information to be recognized into a small word graph model for acoustic decoding and searching, and then directly inputting the search results into a larger word graph model for searching, the second search process does not require acoustic decoding, which can make the search The dimensionality becomes lower, which effectively reduces the amount of word map search, thereby reducing the search time and improving the speed of speech recognition.
进一步的,如图3所示,在步骤202之前,上述语音识别方法还包括:Further, as shown in FIG. 3, before step 202, the above voice recognition method further includes:
步骤301,获取用户当前的语境信息。Step 301: Acquire current context information of the user.
上述当前的语境信息可以根据时间进行确定,比如在9点到17点为工作时间,则可以确定语境为工作语境,在周未,则可以确定为休假语境,在22点以后8点之前,则可以确定为休息语境。也可以根据待识别语音的获取来进行确定,比如待识别语音由微信好友处获取,则可以确定为朋友聊天语境,待识别语音由微信或其他社交软件中备注为客户的用户处获取,则可以确定为工作语境。在一种可能的实施方式中,用户的语境也可以由用户自动进行确定,通过用户自行选取语境,得到的语境信息更精确。The above-mentioned current context information can be determined according to the time. For example, if the working hours are from 9 am to 17:00, the context can be determined as the working context, and at the end of the week, it can be determined as the vacation context. After 22:00, 8 Before the point, it can be determined as a resting context. It can also be determined based on the acquisition of the voice to be recognized. For example, the voice to be recognized is obtained from a friend on WeChat, it can be determined as a friend’s chat context, and the voice to be recognized is obtained from a user who notes as a customer in WeChat or other social software, then Can be determined as the working context. In a possible implementation manner, the user's context may also be automatically determined by the user, and the context information obtained by the user can be more accurate by selecting the context by himself.
步骤302,根据用户当前的语境信息选择对应的第一词图模型对语音信息进行解码搜索。Step 302: Select the corresponding first word graph model according to the user's current context information to decode and search the voice information.
在本实施例中,上述的第一词图模型可以是具有语境属性的第一词图模型,每个第一词图模型对应一个或多个语境属性,可以通过步骤301中获取到的语境信息匹配对应的第一词图模型。通过语境信息匹配到对应的第一词图模型,可以使第一词图模型所得到的结果更贴合语境,提高精准度。In this embodiment, the above-mentioned first word graph model may be a first word graph model with context attributes, and each first word graph model corresponds to one or more context attributes, which can be obtained through step 301 The context information matches the corresponding first word graph model. Matching the context information to the corresponding first word graph model can make the results obtained by the first word graph model more suitable for the context and improve the accuracy.
进一步的,如图4所示,上述第一搜索结果为至少一个路径的路径结果,将所述待识别语音信息输入本地的第一词图模型中进行解码搜索,得到第一搜索结果的步骤具体包括:Further, as shown in FIG. 4, the above-mentioned first search result is a path result of at least one path, and the voice information to be recognized is input into the local first word graph model for decoding search, and the steps of obtaining the first search result are detailed include:
步骤401,通过解码搜索获取第一路径的路径结果以及对应的第一路径分数。Step 401: Obtain the path result of the first path and the corresponding first path score through decoding search.
步骤402,根据第一路径分数由高到低依次选取n个路径结果中的m个路径结果进行输出,得到第一搜索结果,其中,m小于等于n。Step 402: According to the first path score from high to low, m path results from n path results are sequentially selected for output, to obtain a first search result, where m is less than or equal to n.
本实施例中,通过对语音信息在第一词图模型中进行解码搜索,可以得到第一词图模型下的搜索结果(第一路径)的评分,即是对至少一个第一路径评分,具体的,n个搜索结果(第一路径)对应有n个评分,得到根据评分排序的nbest结果做为第一搜索结果。In this embodiment, by decoding and searching the voice information in the first word graph model, the score of the search result (the first path) under the first word graph model can be obtained, that is, the score of at least one first path is scored. Yes, n search results (first path) correspond to n scores, and nbest results sorted according to the scores are obtained as the first search result.
例如:对待识别语音信息为“今天天气怎么样”在第一词图模型中的进行搜索,这样经过第一词图模型解码后会给出200个nbest的解码结果:For example: search the first word graph model for the speech information to be recognized as "what's the weather today", so that after the first word graph model is decoded, it will give 200 nbest decoding results:
今天 天气 怎么样How's the weather today
近天 天气 怎么样How is the weather in recent days
今天 填起 怎么样How about filling in today
假设一共200个nbest结果;Assuming a total of 200 nbest results;
通过第一词图模型得到200个nbest(200best)结果,则可以选取100个或者全部200个nbest结果做为第一搜索结果。此时,n为200,m为100。If 200 nbest (200best) results are obtained through the first word graph model, 100 or all 200 nbest results can be selected as the first search result. At this time, n is 200 and m is 100.
在一种可能的实现方式,可以将第一搜索结果按nbest打分进行排序,即将第一路径分数最高对应的搜索结果排在前面。In a possible implementation manner, the first search results may be sorted according to nbest scores, that is, the search results corresponding to the highest first path score are ranked first.
本实施例中,通过取nbest结果中的m个第一搜索结果,作为第二词图模型的输入,可以减少第二词图模型的输入量。In this embodiment, by taking the m first search results in the nbest results as the input of the second word graph model, the input amount of the second word graph model can be reduced.
进一步的,如图5所示,上述第一词图模型的构建包括以下步骤:Further, as shown in FIG. 5, the construction of the above-mentioned first word graph model includes the following steps:
步骤501,从预先构建好的第二词图空间中提取出词图单元,并根据所述词图单元构建第一词图空间。Step 501: Extract a word graph unit from the pre-built second word graph space, and construct a first word graph space according to the word graph unit.
步骤502,根据声学模型、发音词典、第一词图空间对所述第一词图模型进行构建。Step 502: Construct the first word graph model according to the acoustic model, pronunciation dictionary, and first word graph space.
其中,第二词图模型中的第二词图空间可以是通过本地词典进行配置,也可以是预先下载到本地的词图空间。词图单元可以包括语言单元及对应的权重,语言单元可以理解为第一搜索结果中的音素词。在一种可能的实现方式中,词图单元还可以理解为词图路径。具体的,可以根据第二词图中的语境属性,在第二词图空间中提取出具有各语境属性词图单元构建不同语境的第一词图空间,可以使语音信息在第一词图模型中进行搜索解码的范围变小,从而提高第一词图模型对语音信息解码的速度。以上步骤可以理解为对第二词图空间进行剪枝得到第一词图空间。需要理解的是,上述第一词图模型的数量可以是一个或多个。Wherein, the second word graph space in the second word graph model may be configured through a local dictionary, or may be a word graph space pre-downloaded to the local. The word map unit may include a language unit and a corresponding weight, and the language unit may be understood as a phoneme word in the first search result. In a possible implementation, the word graph unit can also be understood as a word graph path. Specifically, according to the context attributes in the second word graph, word graph units with various context attributes can be extracted from the second word graph space to construct the first word graph space of different contexts, so that the voice information can be in the first word graph space. The search and decoding range in the word graph model becomes smaller, thereby improving the speed of the first word graph model to decode speech information. The above steps can be understood as pruning the second word graph space to obtain the first word graph space. It should be understood that the number of the aforementioned first word graph models may be one or more.
另外,在另一种可能的实现方式中,可以对第一词图空间进行增枝,增加相近语境属性的词图单元,使第一词图空间扩展成为第二词图空间。In addition, in another possible implementation manner, the first word graph space can be augmented to add word graph units with similar context attributes to expand the first word graph space into a second word graph space.
另外,需要说明的是,剪枝后得到的第一词图空间中的各语言单元的权重会随着模型训练而发生变化,相同的语言单元的权重,在第一词图空间与第二词图空间的权重是不相同的。同样的,增枝后得到第二词图空间中的各语言单元的权重与第一词图空间中相同的语言单元的权重也是不相同的。即是,相同的路径在第一词图模型与第二词图模型进行搜索,得到的路径评分不同。比如:In addition, it should be noted that the weight of each language unit in the first word graph space obtained after pruning will change with the model training. The weight of the same language unit will be in the first word graph space and the second word graph space. The weight of the graph space is not the same. Similarly, the weight of each language unit in the second word graph space obtained after branching is different from the weight of the same language unit in the first word graph space. That is, the same path is searched in the first word graph model and the second word graph model, and the path scores obtained are different. such as:
第一词图模型中,今天(权重0.9) 天气(权重0.8) 怎么样(0.9),该路径评分为所有权重的积,0.9*0.8*0.9=0.648In the first word graph model, how about today (weight 0.9) weather (weight 0.8) (0.9), the path score is the product of all weights, 0.9*0.8*0.9=0.648
第二词图模型中,今天(权重0.99) 天气(权重0.98) 怎么样(0.99),该路径评分为所有权重的积,0.9*0.8*0.9=0.960498。In the second word graph model, how about today (weight 0.99) weather (weight 0.98) (0.99), the path score is the product of all weights, 0.9*0.8*0.9=0.960498.
在本实施例中,通过从第二词图空间中提取中具有相同属性的词图单元来对第一词图空 间进行构建,可以避免第一搜索结果与第二词图模型不匹配,造成误识别。In this embodiment, the first word graph space is constructed by extracting word graph units with the same attributes from the second word graph space, which can avoid the mismatch between the first search result and the second word graph model, causing errors. Recognition.
进一步的,如图6所示,上述第一词图模型的构建还包括以下步骤:Further, as shown in FIG. 6, the construction of the above-mentioned first word graph model further includes the following steps:
步骤601,对第一词图模型进行训练,训练至损失函数拟合,得到第一词图空间中词图单元的权重。Step 601: Train the first word graph model to fit the loss function to obtain the weight of the word graph unit in the first word graph space.
其中,当词图单元为语言单元时,可以将构建第一词图模型的语言单元按第二词图模型中的词图组合关系进行组合,可以通过训练第一词图模型来调整语言单元间的权重。得到新的词图空间做为第一词图模型的词图空间。当词图单元为第二路径时,可以通过训练第一词图模型来调整第一词图路径的评分结果。Among them, when the word graph unit is a language unit, the language unit that constructs the first word graph model can be combined according to the word graph combination relationship in the second word graph model, and the first word graph model can be trained to adjust the language unit the weight of. Get the new word graph space as the word graph space of the first word graph model. When the word graph unit is the second path, the scoring result of the first word graph path can be adjusted by training the first word graph model.
在本实施例中,在通过提取第二词图空间中的词图单元构建第一词图空间时,可以通过对第一词图模型进行训练,从而使第一词图模型的识别精确度提高,另外,不会受到第二词图空间影响。In this embodiment, when constructing the first word graph space by extracting word graph units in the second word graph space, the first word graph model can be trained to improve the recognition accuracy of the first word graph model In addition, it will not be affected by the second word graph space.
进一步的,如图7所示,步骤203具体包括:Further, as shown in FIG. 7, step 203 specifically includes:
步骤701,提取第一搜索结果中的词图单元。Step 701: Extract the word graph unit in the first search result.
步骤702,将第一搜索结果中的词图单元输入到第二词图模型中进行搜索。Step 702: Input the word graph unit in the first search result into the second word graph model for searching.
在本实施例中,当词图单元为语言单元时,可以将语言单元输入到第二词图模型中进行搜索,得到第二词图模型中对应词图单元的第二词图路径及对应的路径评分。当词图单元为第一词图路径时,在第二词图模型中对第一词图路径进行分解,得到语言单元,再将语言单元在第二词图空间中进行路径搜索,得到第二词图路径及对应的路径评分。另外,在词图单元为第一词图路径时,将这些第一词图路径输入到第二词图模型中,与第二词图模型的第二词图空间中的第二词图路径进行匹配,由于第一词图空间与第二词图空间中相同路径可能拥有不同的路径评分,相当于在第二词图空间中对第一搜索结果进行广域验证,保证了语音识别结果的精度。In this embodiment, when the word map unit is a language unit, the language unit can be input into the second word map model for search, and the second word map path of the corresponding word map unit in the second word map model and the corresponding Path score. When the word graph unit is the first word graph path, the first word graph path is decomposed in the second word graph model to obtain the language unit, and then the language unit is searched for the path in the second word graph space to obtain the second word graph space. The word map path and the corresponding path score. In addition, when the word graph unit is the first word graph path, these first word graph paths are input into the second word graph model, and the second word graph path in the second word graph space of the second word graph model is performed. Matching, since the same path in the first word map space and the second word map space may have different path scores, it is equivalent to wide-area verification of the first search result in the second word map space, ensuring the accuracy of the speech recognition results .
在本实施例中,以词图单元的形式将第一搜索结果在第二词图空间中进行搜索,不用再对待识别语音信息进行声学解码,降低了搜索维度,从而提高了语音识别的速度。In this embodiment, the first search result is searched in the second word graph space in the form of a word graph unit, without acoustically decoding the speech information to be recognized, the search dimension is reduced, and the speed of speech recognition is improved.
更进一步的,如图8所示,上述步骤204具体包括:Furthermore, as shown in FIG. 8, the above step 204 specifically includes:
步骤801,根据所述第二路径分数的高低对第二路径进行排序。Step 801: Sort the second path according to the score of the second path.
步骤802,按排序输出y个第二路径对应的语音识别结果,其中,y大于等于1。Step 802: Output the speech recognition results corresponding to y second paths in order, where y is greater than or equal to 1.
其中,可以将评分高的第二路径排在前面,将评分低的排在后面。这样,在选择输出的第二词图路径对应的完整语句会比较直观,比如只选择一个进行输出的情况下,可以将排在最前面的一个第二词图路径对应的完整语句提取出来进行输出,在选择多个进行输出的情况下,可以将排在前面的几个提取出来进行输出,以供用户对输出结果进行选取。Among them, the second path with a high score can be ranked first, and the second path with a low score can be ranked behind. In this way, the complete sentence corresponding to the second word map path selected for output will be more intuitive. For example, if only one is selected for output, the complete sentence corresponding to the first second word map path can be extracted for output. , In the case of selecting multiple output, the top ones can be extracted for output, so that the user can select the output result.
在本实施例中,对第二路径进行排序后再输出,根据排序输出的完整语句,可以使输出的语音识别结果更方便直观。In this embodiment, the second path is sorted and then output. According to the complete sentences output by sorting, the output voice recognition result can be more convenient and intuitive.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过可 执行代码来指令相关的硬件来完成,该可执行代码可存储于一计算机可读取非易失性存储介质中,该可执行代码在执行时,可包括如上述各方法的实施例的流程。其中,前述的非易失性存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)等非易失性非易失性存储介质等。A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through executable code, which can be stored in a computer readable non-volatile storage. In the medium, when the executable code is executed, it may include the processes of the above-mentioned method embodiments. Among them, the aforementioned non-volatile storage medium may be a non-volatile non-volatile storage medium such as a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), and the like.
应该理解的是,虽然附图的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,其可以以其他的顺序执行。而且,附图的流程图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,其执行顺序也不必然是依次进行,而是可以与其他步骤或者其他步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although the various steps in the flowchart of the drawings are displayed in sequence as indicated by the arrows, these steps are not necessarily executed in sequence in the order indicated by the arrows. Unless explicitly stated in this article, the execution of these steps is not strictly limited in order, and they can be executed in other orders. Moreover, at least part of the steps in the flowchart of the drawings may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times, and the order of execution is also It is not necessarily performed sequentially, but may be performed alternately or alternately with at least a part of other steps or sub-steps or stages of other steps.
进一步参考图9,作为对上述图2所示方法的实现,本申请提供了一种语音识别装置的一个实施例,该装置实施例与图2所示的方法实施例相对应,该装置具体可以应用于各种电子设备中。With further reference to FIG. 9, as an implementation of the method shown in FIG. 2, this application provides an embodiment of a speech recognition device. The device embodiment corresponds to the method embodiment shown in FIG. Used in various electronic devices.
如图9所示,本实施例的一种语音识别装置900包括:第一获取模块901、第一搜索模块902、第二搜索模块903、输出模块904。其中:As shown in FIG. 9, a speech recognition device 900 of this embodiment includes: a first acquisition module 901, a first search module 902, a second search module 903, and an output module 904. among them:
第一获取模块901,用于获取待识别语音信息;The first acquiring module 901 is configured to acquire voice information to be recognized;
第一搜索模块902,用于将所述待识别语音信息输入本地的第一词图模型中进行解码搜索,得到第一搜索结果,第一搜索结果包括第一路径以及对应的第一路径分数,第一词图模型包括声学模型、发音词典及第一词图空间;The first search module 902 is configured to input the to-be-recognized voice information into the local first word graph model for decoding search to obtain a first search result, the first search result including the first path and the corresponding first path score, The first word graph model includes acoustic model, pronunciation dictionary and first word graph space;
第二搜索模块903,用于将第一搜索结果输入本地的第二词图模型中进行搜索,得到第二搜索结果,第二搜索结果包括第二路径以及对应第二路径分数,其中,第二词图模型包括第二词图空间,第一词图空间为第二词图空间的子词图空间;The second search module 903 is configured to input the first search result into the local second word graph model for searching, and obtain the second search result. The second search result includes the second path and the corresponding second path score, where the second The word graph model includes a second word graph space, and the first word graph space is a sub-word graph space of the second word graph space;
输出模块904,用于根据第二路径分数选择对应的第二路径进行输出,得到语音识别结果。The output module 904 is configured to select a corresponding second path for output according to the second path score to obtain a voice recognition result.
进一步的,参阅图10,所述第一词图模型为配置在本地的至少一个第一词图模型,所述语音识别装置900还包括:第二获取模块905和选择模块906。其中,Further, referring to FIG. 10, the first word graph model is at least one first word graph model configured locally, and the speech recognition apparatus 900 further includes: a second acquisition module 905 and a selection module 906. among them,
第二获取模块905,用于获取用户当前的语境信息;The second obtaining module 905 is used to obtain the current context information of the user;
选择模块906,用于根据用户当前的语境信息选择对应的第一词图模型对语音信息进行解码搜索。The selection module 906 is configured to select the corresponding first word graph model to decode and search the voice information according to the user's current context information.
进一步的,参阅图11,所述第一搜索结果为至少一个路径的路径结果,所述第一搜索模块902包括:解码搜索单元9021、输出单元9022。其中,Further, referring to FIG. 11, the first search result is a path result of at least one path, and the first search module 902 includes: a decoding search unit 9021, an output unit 9022. among them,
解码搜索单元9021,用于通过解码搜索获取第一路径的路径结果以及对应的第一路径分数;The decoding search unit 9021 is configured to obtain the path result of the first path and the corresponding first path score through decoding search;
第一输出单元9022,用于根据所述第一路径分数由高到低依次选取n个路径结果中的 m个路径结果进行输出,得到第一搜索结果,其中,m小于等于n。The first output unit 9022 is configured to sequentially select m path results among the n path results according to the first path score from high to low for output to obtain the first search result, where m is less than or equal to n.
进一步的,参阅图12,语音识别装置900还包括第一词图模型构建模块907,所述第一词图模型构建模块907包括;第一提取单元9071、构建单元9072。其中:Further, referring to FIG. 12, the speech recognition device 900 further includes a first word graph model construction module 907, and the first word graph model construction module 907 includes; a first extraction unit 9071, a construction unit 9072. among them:
第一提取单元9071,用于从预先构建好的第二词图空间中提取出词图单元,并根据所述词图单元构建第一词图空间;The first extraction unit 9071 is configured to extract the word graph unit from the pre-built second word graph space, and construct the first word graph space according to the word graph unit;
构建单元9072,用于根据声学模型、发音词典、第一词图空间对所述第一词图模型进行构建。The construction unit 9072 is configured to construct the first word graph model according to the acoustic model, pronunciation dictionary, and first word graph space.
进一步的,参阅图13,所述第一词图模型构建模块907还包括训练单元9073。其中:Further, referring to FIG. 13, the first word graph model construction module 907 further includes a training unit 9073. among them:
训练单元9073,对所述第一词图模型进行训练,训练至损失函数拟合,得到所述第一词图空间中词图单元的权重。The training unit 9073 trains the first word graph model, trains to fit the loss function, and obtains the weight of the word graph unit in the first word graph space.
进一步的,参阅图14,所述第二搜索模块903包括:第二提取单元9031、输入单元9032。其中:Further, referring to FIG. 14, the second search module 903 includes: a second extraction unit 9031, an input unit 9032. among them:
第二提取单元9031,用于提取第一搜索结果中的词图单元;The second extraction unit 9031 is used to extract the word map unit in the first search result;
输入单元9032,用于将所述第一搜索结果中的词图单元输入到第二词图模型中进行搜索。The input unit 9032 is configured to input the word graph unit in the first search result into the second word graph model for search.
进一步的,参阅图15,所述输出模块904包括:排序单元9041、第二输出单元9042。其中:Further, referring to FIG. 15, the output module 904 includes: a sorting unit 9041, a second output unit 9042. among them:
排序单元9041,用于根据所述第二路径分数的高低对第二路径进行排序;The sorting unit 9041 is configured to sort the second path according to the score of the second path;
第二输出单元9042,用于按排序输出y个第二路径对应的语音识别结果,其中,y大于等于1。The second output unit 9042 is configured to output the voice recognition results corresponding to the y second paths in order, where y is greater than or equal to 1.
本申请实施例提供的一种语音识别装置能够实现图2至图8的方法实施例中的各个实施方式,以及相应有益效果,为避免重复,这里不再赘述。The voice recognition device provided in the embodiment of the present application can implement the various implementation manners in the method embodiments of FIG. 2 to FIG. 8 and the corresponding beneficial effects. To avoid repetition, details are not described herein again.
为解决上述技术问题,本申请实施例还提供计算机设备。具体请参阅图16,图16为本实施例计算机设备基本结构框图。In order to solve the above technical problems, the embodiments of the present application also provide computer equipment. Please refer to FIG. 16 for details. FIG. 16 is a block diagram of the basic structure of the computer device in this embodiment.
计算机设备16括通过系统总线相互通信连接非易失性存储器161、处理器162、网络接口163。需要指出的是,图中仅示出了具有组件161-163的计算机设备16,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。其中,本技术领域技术人员可以理解,这里的计算机设备是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable GateArray,FPGA)、数字处理器(Digital Signal Processor,DSP)、嵌入式设备等。The computer device 16 includes a non-volatile memory 161, a processor 162, and a network interface 163 that are communicatively connected to each other through a system bus. It should be pointed out that only the computer device 16 with components 161-163 is shown in the figure, but it should be understood that it is not required to implement all the illustrated components, and more or fewer components may be implemented instead. Among them, those skilled in the art can understand that the computer device here is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions. Its hardware includes, but is not limited to, a microprocessor, a dedicated Integrated Circuit (Application Specific Integrated Circuit, ASIC), Programmable Gate Array (Field-Programmable GateArray, FPGA), Digital Processor (Digital Signal Processor, DSP), embedded equipment, etc.
计算机设备可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。计算机设备可以与用户通过键盘、鼠标、遥控器、触摸板或声控设备等方式进行人机交互。The computer equipment can be computing equipment such as desktop computers, notebooks, palmtop computers, and cloud servers. The computer device can interact with the user through a keyboard, mouse, remote control, touch panel, or voice control device.
非易失性存储器161至少包括一种类型的可读非易失性存储介质,可读非易失性存储介 质包括闪存、硬盘、多媒体卡、卡型非易失性存储器(例如,SD或DX非易失性存储器等)、只读非易失性存储器(ROM)、电可擦除可编程只读非易失性存储器(EEPROM)、可编程只读非易失性存储器(PROM)、磁性非易失性存储器、磁盘、光盘等。在一些实施例中,非易失性存储器161可以是计算机设备16的内部存储单元,例如该计算机设备16的硬盘或内存。在另一些实施例中,非易失性存储器161也可以是计算机设备16的外部存储设备,例如该计算机设备16上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,非易失性存储器161还可以既包括计算机设备16的内部存储单元也包括其外部存储设备。本实施例中,非易失性存储器161通常用于存储安装于计算机设备16的操作系统和各类应用软件,例如一种语音识别方法的可执行代码等。此外,非易失性存储器161还可以用于暂时地存储已经输出或者将要输出的各类数据。The non-volatile memory 161 includes at least one type of readable non-volatile storage medium. The readable non-volatile storage medium includes flash memory, hard disk, multimedia card, card-type non-volatile memory (for example, SD or DX). Non-volatile memory, etc.), read-only non-volatile memory (ROM), electrically erasable programmable read-only non-volatile memory (EEPROM), programmable read-only non-volatile memory (PROM), magnetic Non-volatile memory, magnetic disks, optical disks, etc. In some embodiments, the non-volatile memory 161 may be an internal storage unit of the computer device 16, such as a hard disk or memory of the computer device 16. In other embodiments, the non-volatile memory 161 may also be an external storage device of the computer device 16, such as a plug-in hard disk, a smart media card (SMC), and a secure digital device equipped on the computer device 16. (Secure Digital, SD) card, flash card (Flash Card), etc. Of course, the non-volatile memory 161 may also include both the internal storage unit of the computer device 16 and its external storage device. In this embodiment, the non-volatile memory 161 is generally used to store an operating system and various application software installed in the computer device 16, such as executable code of a voice recognition method. In addition, the non-volatile memory 161 can also be used to temporarily store various types of data that have been output or will be output.
处理器162在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器162通常用于控制计算机设备16的总体操作。本实施例中,处理器162用于运行非易失性存储器161中存储的可执行代码或者处理数据,例如运行一种语音识别方法的可执行代码。In some embodiments, the processor 162 may be a central processing unit (CPU), a controller, a microcontroller, a microprocessor, or other data processing chips. The processor 162 is generally used to control the overall operation of the computer device 16. In this embodiment, the processor 162 is configured to run executable codes or process data stored in the non-volatile memory 161, for example, run executable codes for a voice recognition method.
网络接口163可包括无线网络接口或有线网络接口,该网络接口163通常用于在计算机设备16与其他电子设备之间建立通信连接。The network interface 163 may include a wireless network interface or a wired network interface, and the network interface 163 is generally used to establish a communication connection between the computer device 16 and other electronic devices.
本申请还提供了另一种实施方式,即提供一种计算机可读非易失性存储介质,计算机可读非易失性存储介质存储有一种语音识别可执行代码,上述一种语音识别可执行代码可被至少一个处理器执行,以使至少一个处理器执行如上述的一种语音识别方法的步骤。This application also provides another implementation manner, that is, a computer-readable non-volatile storage medium is provided. The computer-readable non-volatile storage medium stores a type of speech recognition executable code, and the above-mentioned type of speech recognition executable code is The code may be executed by at least one processor, so that the at least one processor executes the steps of a speech recognition method as described above.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个非易失性存储介质(如ROM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例的一种语音识别方法。Through the description of the above implementation manners, those skilled in the art can clearly understand that the above-mentioned embodiment method can be implemented by means of software plus the necessary general hardware platform, of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a non-volatile storage medium (such as ROM, magnetic A disc, an optical disc) includes a number of instructions to enable a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to execute a voice recognition method of the various embodiments of the present application.
显然,以上所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例,附图中给出了本申请的较佳实施例,但并不限制本申请的专利范围。本申请可以以许多不同的形式来实现,相反地,提供这些实施例的目的是使对本申请的公开内容的理解更加透彻全面。尽管参照前述实施例对本申请进行了详细的说明,对于本领域的技术人员来而言,其依然可以对前述各具体实施方式所记载的技术方案进行修改,或者对其中部分技术特征进行等效替换。凡是利用本申请说明书及附图内容所做的等效结构,直接或间接运用在其他相关的技术领域,均同理在本申请专利保护范围之内。Obviously, the embodiments described above are only a part of the embodiments of the present application, rather than all of the embodiments. The drawings show preferred embodiments of the present application, but do not limit the patent scope of the present application. This application can be implemented in many different forms. On the contrary, the purpose of providing these examples is to make the understanding of the disclosure of this application more thorough and comprehensive. Although this application has been described in detail with reference to the foregoing embodiments, for those skilled in the art, it is still possible for those skilled in the art to modify the technical solutions described in each of the foregoing specific embodiments, or equivalently replace some of the technical features. . All equivalent structures made using the contents of the description and drawings of this application, directly or indirectly used in other related technical fields, are similarly within the scope of patent protection of this application.

Claims (19)

  1. 一种语音识别方法,其特征在于,包括下述步骤:A speech recognition method, characterized in that it comprises the following steps:
    获取待识别语音信息;Obtain the voice information to be recognized;
    将所述待识别语音信息输入本地的第一词图模型中进行解码搜索,得到第一搜索结果,所述第一搜索结果包括第一路径以及对应的第一路径分数,所述第一词图模型包括声学模型、发音词典及第一词图空间;Input the to-be-recognized speech information into the local first word graph model for decoding and search to obtain a first search result. The first search result includes a first path and a corresponding first path score. The first word graph The model includes acoustic model, pronunciation dictionary and first word image space;
    将所述第一搜索结果输入本地的第二词图模型中进行搜索,得到第二搜索结果,所述第二搜索结果包括第二路径以及对应第二路径分数,其中,所述第二词图模型包括第二词图空间,所述第一词图空间为第二词图空间的子词图空间;The first search result is input into the local second word graph model for searching, and the second search result is obtained. The second search result includes a second path and a corresponding second path score, wherein the second word graph The model includes a second word graph space, and the first word graph space is a sub-word graph space of the second word graph space;
    根据所述第二搜索结果中第二路径分数选择对应的第二路径进行输出,得到语音识别结果。The corresponding second path is selected and output according to the second path score in the second search result, and the speech recognition result is obtained.
  2. 根据权利要求1所述的语音识别方法,其特征在于,所述第一词图模型为配置在本地的至少一个第一词图模型,所述第一词图模型对应训练有语境属性,在所述将所述待识别语音信息输入本地的第一词图模型中进行解码搜索的步骤之前,所述方法还包括:The speech recognition method according to claim 1, wherein the first word graph model is at least one first word graph model configured locally, and the first word graph model is correspondingly trained with context attributes, Before the step of inputting the voice information to be recognized into the first local word graph model for decoding and searching, the method further includes:
    获取用户当前的语境信息;Get the user's current context information;
    根据用户当前的语境信息选择对应的第一词图模型对语音信息进行解码搜索。According to the user's current context information, the corresponding first word graph model is selected to decode and search the voice information.
  3. 根据权利要求1所述的语音识别方法,其特征在于,所述第一搜索结果包括至少一个第一路径的路径结果,所述将所述待识别语音信息输入本地的第一词图模型中进行解码搜索,得到第一搜索结果的步骤包括:The speech recognition method according to claim 1, wherein the first search result includes at least one path result of a first path, and the input of the speech information to be recognized into a local first word graph model is performed. The steps of decoding the search to obtain the first search result include:
    通过解码搜索获取第一路径的路径结果以及对应的第一路径分数;Obtain the path result of the first path and the corresponding first path score through decoding search;
    根据所述第一路径分数由高到低依次选取n个路径结果中的m个路径结果进行输出,得到第一搜索结果,其中,m小于等于n。According to the first path score from high to low, m path results from n path results are selected and outputted to obtain the first search result, where m is less than or equal to n.
  4. 根据权利要求1所述的语音识别方法,其特征在于,所述第一词图模型,其构建方法包括以下步骤:The speech recognition method according to claim 1, wherein the method for constructing the first word graph model comprises the following steps:
    从预先构建好的第二词图空间中提取出词图单元,并根据所述词图单元构建第一词图空间;Extracting the word graph unit from the pre-built second word graph space, and constructing the first word graph space according to the word graph unit;
    根据声学模型、发音词典、第一词图空间对所述第一词图模型进行构建。The first word graph model is constructed according to the acoustic model, pronunciation dictionary, and first word graph space.
  5. 根据权利要求4所述的语音识别方法,其特征在于,所述第一词图模型的构建还包括以下步骤:The speech recognition method according to claim 4, wherein the construction of the first word graph model further comprises the following steps:
    对所述第一词图模型进行训练,训练至损失函数拟合,得到所述第一词图空间中词图单元的权重。The first word graph model is trained to fit the loss function, and the weight of the word graph unit in the first word graph space is obtained.
  6. 根据权利要求4所述的语音识别方法,其特征在于,所述将所述第一搜索结果输入本地的第二词图模型中进行搜索的步骤包括:The speech recognition method according to claim 4, wherein the step of inputting the first search result into a local second word graph model for searching comprises:
    提取第一搜索结果中的词图单元;Extract the word map unit in the first search result;
    将所述第一搜索结果中的词图单元输入到第二词图模型中进行搜索。Input the word graph unit in the first search result into the second word graph model for searching.
  7. 根据权利要求1至6中任一所述的语音识别方法,其特征在于,所述根据所述第二路径分数选择对应的第二路径进行输出,得到语音识别结果的步骤具体包括:The speech recognition method according to any one of claims 1 to 6, wherein the step of selecting a corresponding second path for output according to the second path score, and obtaining a speech recognition result specifically includes:
    根据所述第二路径分数的高低对第二路径进行排序;Sort the second path according to the score of the second path;
    按排序输出y个第二路径对应的语音识别结果,其中,y大于等于1。Output the speech recognition results corresponding to y second paths in order, where y is greater than or equal to 1.
  8. 一种语音识别装置,其特征在于,包括:A speech recognition device is characterized in that it comprises:
    获取模块,用于获取待识别语音信息;The acquiring module is used to acquire the voice information to be recognized;
    第一搜索模块,用于将所述待识别语音信息输入本地的第一词图模型中进行解码搜索,得到第一搜索结果,第一搜索结果包括第一路径以及对应的第一路径分数,第一词图模型包括声学模型、发音词典及第一词图空间;The first search module is used to input the to-be-recognized speech information into the local first word graph model for decoding search, and obtain the first search result. The first search result includes the first path and the corresponding first path score. A word graph model includes acoustic model, pronunciation dictionary and first word graph space;
    第二搜索模块,用于将第一搜索结果输入本地的第二词图模型中进行搜索,得到第二搜索结果,第二搜索结果包括第二路径以及对应第二路径分数,其中,第二词图模型包括第二词图空间,第一词图空间为第二词图空间的子词图空间;The second search module is used to input the first search result into the local second word graph model for searching, and obtain the second search result. The second search result includes the second path and the corresponding second path score, where the second word The graph model includes a second word graph space, and the first word graph space is a sub-word graph space of the second word graph space;
    输出模块,用于根据第二路径分数选择对应的第二路径进行输出,得到语音识别结果。The output module is used to select the corresponding second path for output according to the second path score to obtain the voice recognition result.
  9. 根据权利要求8所述的语音识别装置,其特征在于,还包括:The speech recognition device according to claim 8, further comprising:
    第二获取模块,用于获取用户当前的语境信息;The second acquisition module is used to acquire the current context information of the user;
    选择模块,用于根据用户当前的语境信息选择对应的第一词图模型对语音信息进行解码搜索。The selection module is used to select the corresponding first word graph model to decode and search the voice information according to the user's current context information.
  10. 根据权利要求8所述的语音识别装置,其特征在于,还包括:The speech recognition device according to claim 8, further comprising:
    第一提取单元,用于从预先构建好的第二词图空间中提取出词图单元,并根据所述词图单元构建第一词图空间;The first extraction unit is configured to extract the word graph unit from the second word graph space constructed in advance, and construct the first word graph space according to the word graph unit;
    构建单元,用于根据声学模型、发音词典、第一词图空间对所述第一词图模型进行构建;The construction unit is used to construct the first word graph model according to the acoustic model, pronunciation dictionary, and first word graph space;
    训练单元,对所述第一词图模型进行训练,训练至损失函数拟合,得到所述第一词图空间中词图单元的权重。The training unit trains the first word graph model, trains it to fit the loss function, and obtains the weight of the word graph unit in the first word graph space.
  11. 一种计算机设备,包括非易失性存储器和处理器,所述非易失性存储器中存储有计算机可读指令,其特征在于:所述处理器执行所述计算机可读指令时实现以下步骤:A computer device includes a non-volatile memory and a processor. The non-volatile memory stores computer-readable instructions, and is characterized in that: when the processor executes the computer-readable instructions, the following steps are implemented:
    获取待识别语音信息;Obtain the voice information to be recognized;
    将所述待识别语音信息输入本地的第一词图模型中进行解码搜索,得到第一搜索结果,所述第一搜索结果包括第一路径以及对应的第一路径分数,所述第一词图模型包括声学模型、发音词典及第一词图空间;Input the to-be-recognized speech information into the local first word graph model for decoding and search to obtain a first search result. The first search result includes a first path and a corresponding first path score. The first word graph The model includes acoustic model, pronunciation dictionary and first word image space;
    将所述第一搜索结果输入本地的第二词图模型中进行搜索,得到第二搜索结果,所述第二搜索结果包括第二路径以及对应第二路径分数,其中,所述第二词图模型包括第二词图空间,所述第一词图空间为第二词图空间的子词图空间;The first search result is input into the local second word graph model for searching, and the second search result is obtained. The second search result includes a second path and a corresponding second path score, wherein the second word graph The model includes a second word graph space, and the first word graph space is a sub-word graph space of the second word graph space;
    根据所述第二搜索结果中第二路径分数选择对应的第二路径进行输出,得到语音识别结果。The corresponding second path is selected and output according to the second path score in the second search result, and the speech recognition result is obtained.
  12. 根据权利要求11所述的一种计算机设备,其特征在于,所述第一词图模型为配置在本地的至少一个第一词图模型,所述第一词图模型对应训练有语境属性,在所述将所述待识别语音信息输入本地的第一词图模型中进行解码搜索的步骤之前,所述处理器执行所述计算机可读指令时还包括以下步骤:The computer device according to claim 11, wherein the first word graph model is at least one first word graph model configured locally, and the first word graph model is correspondingly trained with context attributes, Before the step of inputting the voice information to be recognized into the first local word graph model for decoding and searching, the processor further includes the following steps when executing the computer-readable instruction:
    获取用户当前的语境信息;Get the user's current context information;
    根据用户当前的语境信息选择对应的第一词图模型对语音信息进行解码搜索。According to the user's current context information, the corresponding first word graph model is selected to decode and search the voice information.
  13. 根据权利要求11所述的一种计算机设备,其特征在于,所述第一搜索结果包括至少一个第一路径的路径结果,所述将所述待识别语音信息输入本地的第一词图模型中进行解码搜索,得到第一搜索结果的步骤包括:The computer device according to claim 11, wherein the first search result includes at least one path result of a first path, and the voice information to be recognized is input into a local first word graph model The steps of performing a decoding search to obtain the first search result include:
    通过解码搜索获取第一路径的路径结果以及对应的第一路径分数;Obtain the path result of the first path and the corresponding first path score through decoding search;
    根据所述第一路径分数由高到低依次选取n个路径结果中的m个路径结果进行输出,得到第一搜索结果,其中,m小于等于n。According to the first path score from high to low, m path results from n path results are selected and outputted to obtain the first search result, where m is less than or equal to n.
  14. 根据权利要求11所述的一种计算机设备,其特征在于,所述第一词图模型的构建包括以下步骤:The computer device according to claim 11, wherein the construction of the first word graph model comprises the following steps:
    从预先构建好的第二词图空间中提取出词图单元,并根据所述词图单元构建第一词图空间;Extracting the word graph unit from the pre-built second word graph space, and constructing the first word graph space according to the word graph unit;
    根据声学模型、发音词典、第一词图空间对所述第一词图模型进行构建。The first word graph model is constructed according to the acoustic model, pronunciation dictionary, and first word graph space.
    对所述第一词图模型进行训练,训练至损失函数拟合,得到所述第一词图空间中词图单元的权重。The first word graph model is trained to fit the loss function, and the weight of the word graph unit in the first word graph space is obtained.
  15. 根据权利要求11所述的一种计算机设备,其特征在于,所述将所述第一搜索结果输入本地的第二词图模型中进行搜索的步骤包括:The computer device according to claim 11, wherein the step of inputting the first search result into a local second word graph model for searching comprises:
    提取第一搜索结果中的词图单元;Extract the word map unit in the first search result;
    将所述第一搜索结果中的词图单元输入到第二词图模型中进行搜索。16、一种非易失性的计算机可读存储介质,其特征在于,所述计算机可读非易失性存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现如下语音识别方法的步骤:Input the word graph unit in the first search result into the second word graph model for searching. 16. A non-volatile computer-readable storage medium, characterized in that computer-readable instructions are stored on the computer-readable non-volatile storage medium, and the computer-readable instructions are implemented when executed by a processor The steps of the speech recognition method are as follows:
    获取待识别语音信息;Obtain the voice information to be recognized;
    将所述待识别语音信息输入本地的第一词图模型中进行解码搜索,得到第一搜索结果,所述第一搜索结果包括第一路径以及对应的第一路径分数,所述第一词图模型包括声学模型、发音词典及第一词图空间;Input the to-be-recognized speech information into the local first word graph model for decoding and search to obtain a first search result. The first search result includes a first path and a corresponding first path score. The first word graph The model includes acoustic model, pronunciation dictionary and first word image space;
    将所述第一搜索结果输入本地的第二词图模型中进行搜索,得到第二搜索结果,所述第二搜索结果包括第二路径以及对应第二路径分数,其中,所述第二词图模型包括第二词图空间,所述第一词图空间为第二词图空间的子词图空间;The first search result is input into the local second word graph model for searching, and the second search result is obtained. The second search result includes a second path and a corresponding second path score, wherein the second word graph The model includes a second word graph space, and the first word graph space is a sub-word graph space of the second word graph space;
    根据所述第二搜索结果中第二路径分数选择对应的第二路径进行输出,得到语音识别结果。The corresponding second path is selected and output according to the second path score in the second search result, and the speech recognition result is obtained.
  16. 根据权利要求16所述的一种计算机可读非易失性存储介质,其特征在于,所述第一词图模型为配置在本地的至少一个第一词图模型,所述第一词图模型对应训练有语境属性,在所述将所述待识别语音信息输入本地的第一词图模型中进行解码搜索的步骤之前,所述计算机可读指令被处理器执行时还实现如下步骤:The computer-readable non-volatile storage medium according to claim 16, wherein the first word graph model is at least one first word graph model configured locally, and the first word graph model Correspondingly trained with context attributes, before the step of inputting the to-be-recognized speech information into the local first word graph model for decoding and searching, the computer-readable instruction is executed by the processor and further implements the following steps:
    获取用户当前的语境信息;Get the user's current context information;
    根据用户当前的语境信息选择对应的第一词图模型对语音信息进行解码搜索。According to the user's current context information, the corresponding first word graph model is selected to decode and search the voice information.
  17. 根据权利要求16所述的一种计算机可读非易失性存储介质,其特征在于,所述第一搜索结果包括至少一个第一路径的路径结果,所述将所述待识别语音信息输入本地的第一词图模型中进行解码搜索,得到第一搜索结果的步骤包括:The computer-readable non-volatile storage medium according to claim 16, wherein the first search result includes at least one path result of a first path, and the voice information to be recognized is input into the local The steps of performing decoding search in the first word graph model of, and obtaining the first search result include:
    通过解码搜索获取第一路径的路径结果以及对应的第一路径分数;Obtain the path result of the first path and the corresponding first path score through decoding search;
    根据所述第一路径分数由高到低依次选取n个路径结果中的m个路径结果进行输出,得到第一搜索结果,其中,m小于等于n。According to the first path score from high to low, m path results from n path results are selected and outputted to obtain the first search result, where m is less than or equal to n.
  18. 根据权利要求16所述的一种计算机可读非易失性存储介质,其特征在于,所述第一词图模型的构建包括以下步骤:The computer-readable non-volatile storage medium according to claim 16, wherein the construction of the first word graph model comprises the following steps:
    从预先构建好的第二词图空间中提取出词图单元,并根据所述词图单元构建第一词图空间;Extracting the word graph unit from the pre-built second word graph space, and constructing the first word graph space according to the word graph unit;
    根据声学模型、发音词典、第一词图空间对所述第一词图模型进行构建;Constructing the first word graph model according to the acoustic model, the pronunciation dictionary, and the first word graph space;
    对所述第一词图模型进行训练,训练至损失函数拟合,得到所述第一词图空间中词图单元的权重。The first word graph model is trained to fit the loss function, and the weight of the word graph unit in the first word graph space is obtained.
  19. 根据权利要求19所述的一种计算机可读非易失性存储介质,所述将所述第一搜索结果输入本地的第二词图模型中进行搜索的具体包括:18. The computer-readable non-volatile storage medium according to claim 19, said inputting said first search result into a local second word graph model for searching specifically comprises:
    提取第一搜索结果中的词图单元;Extract the word map unit in the first search result;
    将所述第一搜索结果中的词图单元输入到第二词图模型中进行搜索。Input the word graph unit in the first search result into the second word graph model for searching.
PCT/CN2019/116920 2019-09-20 2019-11-10 Speech identification method and apparatus, computer device and non-volatile storage medium WO2021051514A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910894996.8 2019-09-20
CN201910894996.8A CN110808032B (en) 2019-09-20 2019-09-20 Voice recognition method, device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2021051514A1 true WO2021051514A1 (en) 2021-03-25

Family

ID=69487614

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/116920 WO2021051514A1 (en) 2019-09-20 2019-11-10 Speech identification method and apparatus, computer device and non-volatile storage medium

Country Status (2)

Country Link
CN (1) CN110808032B (en)
WO (1) WO2021051514A1 (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111341305B (en) * 2020-03-05 2023-09-26 苏宁云计算有限公司 Audio data labeling method, device and system
CN111681661B (en) * 2020-06-08 2023-08-08 北京有竹居网络技术有限公司 Speech recognition method, apparatus, electronic device and computer readable medium
CN111916058A (en) * 2020-06-24 2020-11-10 西安交通大学 Voice recognition method and system based on incremental word graph re-scoring
CN112560496B (en) * 2020-12-09 2024-02-02 北京百度网讯科技有限公司 Training method and device of semantic analysis model, electronic equipment and storage medium
CN113223495B (en) * 2021-04-25 2022-08-26 北京三快在线科技有限公司 Abnormity detection method and device based on voice recognition
CN112863489B (en) * 2021-04-26 2021-07-27 腾讯科技(深圳)有限公司 Speech recognition method, apparatus, device and medium
CN113643706B (en) * 2021-07-14 2023-09-26 深圳市声扬科技有限公司 Speech recognition method, device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102592595A (en) * 2012-03-19 2012-07-18 安徽科大讯飞信息科技股份有限公司 Voice recognition method and system
CN107195296A (en) * 2016-03-15 2017-09-22 阿里巴巴集团控股有限公司 A kind of audio recognition method, device, terminal and system
CN108305634A (en) * 2018-01-09 2018-07-20 深圳市腾讯计算机系统有限公司 Coding/decoding method, decoder and storage medium
US10032451B1 (en) * 2016-12-20 2018-07-24 Amazon Technologies, Inc. User recognition for speech processing systems
CN108510990A (en) * 2018-07-04 2018-09-07 百度在线网络技术(北京)有限公司 Audio recognition method, device, user equipment and storage medium
CN109036391A (en) * 2018-06-26 2018-12-18 华为技术有限公司 Audio recognition method, apparatus and system

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100612839B1 (en) * 2004-02-18 2006-08-18 삼성전자주식회사 Method and apparatus for domain-based dialog speech recognition
KR20140028174A (en) * 2012-07-13 2014-03-10 삼성전자주식회사 Method for recognizing speech and electronic device thereof
CN106856092B (en) * 2015-12-09 2019-11-15 中国科学院声学研究所 Chinese speech keyword retrieval method based on feedforward neural network language model
CN106328147B (en) * 2016-08-31 2022-02-01 中国科学技术大学 Speech recognition method and device
CN106782513B (en) * 2017-01-25 2019-08-23 上海交通大学 Speech recognition realization method and system based on confidence level
CN110070859B (en) * 2018-01-23 2023-07-14 阿里巴巴集团控股有限公司 Voice recognition method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102592595A (en) * 2012-03-19 2012-07-18 安徽科大讯飞信息科技股份有限公司 Voice recognition method and system
CN107195296A (en) * 2016-03-15 2017-09-22 阿里巴巴集团控股有限公司 A kind of audio recognition method, device, terminal and system
US10032451B1 (en) * 2016-12-20 2018-07-24 Amazon Technologies, Inc. User recognition for speech processing systems
CN108305634A (en) * 2018-01-09 2018-07-20 深圳市腾讯计算机系统有限公司 Coding/decoding method, decoder and storage medium
CN109036391A (en) * 2018-06-26 2018-12-18 华为技术有限公司 Audio recognition method, apparatus and system
CN108510990A (en) * 2018-07-04 2018-09-07 百度在线网络技术(北京)有限公司 Audio recognition method, device, user equipment and storage medium

Also Published As

Publication number Publication date
CN110808032B (en) 2023-12-22
CN110808032A (en) 2020-02-18

Similar Documents

Publication Publication Date Title
WO2021051514A1 (en) Speech identification method and apparatus, computer device and non-volatile storage medium
WO2021232725A1 (en) Voice interaction-based information verification method and apparatus, and device and computer storage medium
US10614803B2 (en) Wake-on-voice method, terminal and storage medium
EP3832519A1 (en) Method and apparatus for evaluating translation quality
CN107430859B (en) Mapping input to form fields
US8620658B2 (en) Voice chat system, information processing apparatus, speech recognition method, keyword data electrode detection method, and program for speech recognition
WO2021139108A1 (en) Intelligent emotion recognition method and apparatus, electronic device, and storage medium
US11217236B2 (en) Method and apparatus for extracting information
WO2020001458A1 (en) Speech recognition method, device, and system
US10290299B2 (en) Speech recognition using a foreign word grammar
WO2021135438A1 (en) Multilingual speech recognition model training method, apparatus, device, and storage medium
EP3405912A1 (en) Analyzing textual data
US20140372119A1 (en) Compounded Text Segmentation
US20240021202A1 (en) Method and apparatus for recognizing voice, electronic device and medium
JPWO2005101235A1 (en) Dialogue support device
WO2020238045A1 (en) Intelligent speech recognition method and apparatus, and computer-readable storage medium
WO2021218028A1 (en) Artificial intelligence-based interview content refining method, apparatus and device, and medium
US20230127787A1 (en) Method and apparatus for converting voice timbre, method and apparatus for training model, device and medium
CN111126084B (en) Data processing method, device, electronic equipment and storage medium
US20230004798A1 (en) Intent recognition model training and intent recognition method and apparatus
CN112489634A (en) Language acoustic model training method and device, electronic equipment and computer medium
CN110503956B (en) Voice recognition method, device, medium and electronic equipment
CN114706973A (en) Extraction type text abstract generation method and device, computer equipment and storage medium
CN113850291A (en) Text processing and model training method, device, equipment and storage medium
CN110852075B (en) Voice transcription method and device capable of automatically adding punctuation marks and readable storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19945487

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19945487

Country of ref document: EP

Kind code of ref document: A1