WO2021051514A1 - Procédé et appareil d'identification de parole, dispositif informatique et support d'informations non volatile - Google Patents

Procédé et appareil d'identification de parole, dispositif informatique et support d'informations non volatile Download PDF

Info

Publication number
WO2021051514A1
WO2021051514A1 PCT/CN2019/116920 CN2019116920W WO2021051514A1 WO 2021051514 A1 WO2021051514 A1 WO 2021051514A1 CN 2019116920 W CN2019116920 W CN 2019116920W WO 2021051514 A1 WO2021051514 A1 WO 2021051514A1
Authority
WO
WIPO (PCT)
Prior art keywords
word graph
path
search result
model
word
Prior art date
Application number
PCT/CN2019/116920
Other languages
English (en)
Chinese (zh)
Inventor
李秀丰
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021051514A1 publication Critical patent/WO2021051514A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/081Search algorithms, e.g. Baum-Welch or Viterbi

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a voice recognition method, device, computer equipment and non-volatile storage medium.
  • N-Gram is a commonly used language model in large vocabulary continuous speech recognition.
  • CLM Chinese Language Model
  • the Chinese language model uses the collocation information between adjacent words in the context to realize automatic conversion to Chinese characters.
  • the Chinese language model uses the collocation information between adjacent words in the context, and when it is necessary to convert consecutive pinyin, strokes, or numbers representing letters or strokes into Chinese character strings (ie sentences), the one with the greatest probability can be calculated Sentences, so as to realize automatic conversion to Chinese characters, without the user's manual selection, avoiding the problem of multiple Chinese characters corresponding to the same pinyin (or stroke string or number string).
  • the model is based on the assumption that the appearance of the Nth word is only related to the previous N-1 words, and is not related to any other words.
  • the probability of the entire sentence is the product of the probability of each word, that is, the correlation N yuan The context.
  • the decoding network integrates language models, dictionaries, and acoustic shared phonetic character sets into a large decoding network.
  • the path search needs to be combined The acoustic decoding dimension is searched, and the search volume is large.
  • the purpose of the embodiments of the present application is to provide a speech recognition method, which can reduce the search dimension of the decoding network, increase the search speed of the decoding network, and thereby increase the speed of speech recognition.
  • the embodiments of the present application provide a voice recognition method, which adopts the following technical solutions:
  • the second search result includes the second path and the corresponding second path score.
  • the second word graph model includes the second word graph. Space, the first word graph space is the sub word graph space of the second word graph space;
  • the corresponding second path is selected and output according to the second path score in the second search result, and the speech recognition result is obtained.
  • an embodiment of the present application further provides a voice recognition device, including:
  • the acquiring module is used to acquire the voice information to be recognized
  • the first search module is used to input the to-be-recognized speech information into the local first word graph model for decoding search, and obtain the first search result.
  • the first search result includes the first path and the corresponding first path score.
  • a word graph model includes acoustic model, pronunciation dictionary and first word graph space;
  • the second search module is used to input the first search result into the local second word graph model for searching, and obtain the second search result.
  • the second search result includes the second path and the corresponding second path score, where the second word
  • the graph model includes a second word graph space, and the first word graph space is a sub-word graph space of the second word graph space;
  • the output module is used to select the corresponding second path for output according to the second path score to obtain the voice recognition result.
  • the embodiments of the present application also provide a computer device, which adopts the following technical solutions:
  • the computer device includes a non-volatile memory and a processor, and executable code is stored in the non-volatile memory.
  • executable code is stored in the non-volatile memory.
  • the processor executes the executable code, the processor implements any one of the proposals in the embodiments of the present application. The steps of a voice recognition method described in the item.
  • the embodiments of the present application also provide a computer-readable non-volatile storage medium, which adopts the following technical solutions:
  • Executable code is stored on the computer-readable non-volatile storage medium, and when the executable code is executed by a processor, the steps of a speech recognition method according to any one of the embodiments of the present application are implemented.
  • this application inputs the voice information to be recognized into a small word graph model for acoustic decoding and search, and then directly inputs the search results into a larger word graph model for search.
  • the second search process There is no need to perform acoustic decoding, and the dimensionality of the search can be reduced, effectively reducing the amount of word map search, thereby reducing the search time and improving the speed of speech recognition.
  • Figure 1 is a schematic diagram of an exemplary system architecture to which the present application can be applied;
  • Fig. 2 is a schematic flowchart of a speech recognition method of the present application
  • Fig. 3 is a schematic flowchart of another voice recognition method of the present application.
  • FIG. 4 is a schematic diagram of a specific flow of step 202 in the embodiment of FIG. 2 of the present application;
  • Fig. 5 is a schematic diagram of the process of constructing a first word graph model of the present application.
  • Fig. 6 is a schematic diagram of the construction process of another first word graph model of the present application.
  • FIG. 7 is a schematic diagram of a specific process of step 203 in the embodiment of FIG. 2 of the present application.
  • FIG. 8 is a schematic diagram of a specific process of step 204 in the embodiment of FIG. 2 of the present application.
  • FIG. 9 is a schematic structural diagram of a speech recognition device of the present application.
  • FIG. 10 is a schematic structural diagram of another voice recognition device of the present application.
  • FIG. 11 is a schematic diagram of a specific structure of the first search module 902;
  • FIG. 12 is a schematic structural diagram of another voice recognition device of the present application.
  • FIG. 13 is a schematic diagram of the specific structure of the first word graph model construction module 907;
  • FIG. 14 is a schematic diagram of a specific structure of the second search module 903;
  • FIG. 15 is a schematic diagram of a specific structure of the output module 904;
  • Fig. 16 is a block diagram of the basic structure of a computer device of the present application.
  • the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105.
  • the network 104 is used to provide a medium for communication links between the terminal devices 101, 102, 103 and the server 105.
  • the network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, and so on.
  • the user can use the terminal devices 101, 102, and 103 to interact with the server 105 through the network 104 to receive or send messages and so on.
  • Various communication client applications such as web browser applications, shopping applications, search applications, instant messaging tools, mailbox clients, social platform software, etc., may be installed on the terminal devices 101, 102, and 103.
  • the terminal devices 101, 102, 103 may be various electronic devices with display screens and support for web browsing, including but not limited to smart phones, tablets, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, moving images) Experts compress standard audio layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, dynamic image experts compress standard audio layer 4) players, laptop portable computers and desktop computers, etc.
  • MP3 players Motion Picture Experts Group Audio Layer III, moving images
  • MP4 Moving Picture Experts Group Audio Layer IV, dynamic image experts compress standard audio layer 4
  • laptop portable computers and desktop computers etc.
  • the server 105 may be a server that provides various services, for example, a background server that provides support for pages displayed on the terminal devices 101, 102, and 103.
  • a voice recognition method provided in the embodiments of the present application is generally executed by a terminal device.
  • a voice recognition device is generally provided in the terminal device.
  • terminal devices, networks, and servers in FIG. 1 are only illustrative, and any number of terminal devices, networks, and servers may be provided according to implementation needs.
  • FIG. 2 a flowchart of an embodiment of a voice recognition method according to the present application is shown.
  • the above-mentioned speech recognition method includes the following steps:
  • Step 201 Acquire voice information to be recognized.
  • an electronic device (such as the terminal device shown in FIG. 1) on which a voice recognition method runs can obtain the voice information to be recognized through a wired connection or a wireless connection.
  • the above wireless connection methods can include but are not limited to 3G/4G connection, WiFi (Wireless-Fidelity) connection, Bluetooth connection, WiMAX (Worldwide Interoperability for Microwave Access) connection, Zigbee connection, UWB (ultra wideband) connection, And other wireless connection methods that are currently known or developed in the future.
  • the aforementioned voice information to be recognized can be collected through a microphone.
  • the microphone can be set in the form of an external device or a built-in microphone in the device, for example, set in a voice recorder, mobile phone, tablet, MP4, notebook, etc. Microphone.
  • the aforementioned voice information to be recognized may also be obtained by uploading by the user, for example, storing the collected voice in a storage device, and obtaining the corresponding voice information by reading data in the storage device.
  • the aforementioned voice information to be recognized may also be the voice information of the other party obtained when the user communicates through social software.
  • the voice information to be recognized may also be voice information that has undergone domain conversion, for example, voice information that has been converted into frequency domain through time domain.
  • the above-mentioned voice information may also be referred to as voice signal or voice data.
  • Step 202 Input the to-be-recognized speech information into a local first word graph model for decoding search, and obtain a first search result.
  • the first search result includes the first path and the corresponding first path score.
  • the first word graph model Including acoustic model, pronunciation dictionary and first word image space.
  • the aforementioned local can be an offline environment under the Linux system, and offline speech tools in other scenarios can also be configured in the offline environment.
  • the aforementioned speech information to be recognized is the speech information to be recognized obtained in step 201, and the aforementioned first
  • the word graph model is a local word graph model. If the first word graph model is configured locally, the speech information can be decoded without going through the network, thereby improving the speed of speech recognition.
  • the first word graph model can be a word graph model based on wfst.
  • the first word graph model includes an acoustic model, a pronunciation dictionary, and a first word graph space.
  • the above-mentioned acoustic model can acoustically decode the user's speech information, so that the speech information can be decoded.
  • a phoneme unit is formed.
  • the above pronunciation dictionary is used to combine phoneme units to form phoneme words.
  • each phoneme word is connected to form a path to form a language unit.
  • the first word graph model is used to decode and search the speech information to be recognized.
  • the first search result is the search result obtained in the first word graph space.
  • the first search result includes multiple first paths, and each path includes a corresponding path score.
  • the path score is used to indicate the credibility of the path, the higher the score, the more credible the path.
  • the path is the connection and connection weight of each phoneme word, such as:
  • weights are obtained by training the first word graph model, and the training corpus can be the training corpus publicly available on the Internet, such as all the training corpus of the People's Daily from 2000 to 2012.
  • Step 203 Input the first search result into the local second word graph model for searching, and obtain the second search result.
  • the second search result includes the second path and the corresponding second path score, wherein the second word graph model includes the first word graph model.
  • Two word graph space, the first word graph space is the sub word graph space of the second word graph space.
  • the first search result may be the first search result in step 202 or the nbest result.
  • the acoustic model and dictionary are not configured in the second word graph model, and the first search result of the first word graph model is used as input, which can save the process of acoustic decoding.
  • the second word graph model can be local The word graph model, the second word graph model is configured locally, and the voice information can be recognized without going through the network, thereby improving the speed of speech recognition.
  • the second word graph model can be a word graph model based on wfst, and the second word graph space in the second word graph model can be a static word graph space.
  • the above static word graph space means that it has been trained and the phoneme word weight remains unchanged
  • the first search result is searched through the static word graph network.
  • the second search result is the search result obtained in the second word graph model.
  • the second search result includes multiple second paths, and each path includes the corresponding
  • the path score is used to indicate the credibility of the path. The higher the score, the more credible the path.
  • the path score is the product of the weights of phonemes in the path. The weights of the phonemes can be obtained by training the second word graph model until the loss function is fitted.
  • the second word graph space in the second word graph model can be sorted by the user, that is, the second word graph space in the second word graph model can be smaller than the traditional word graph network, reducing the complexity of the word graph network , Thereby increasing the speed of decoding search and increasing the real-time rate of decoding.
  • Step 204 Select a corresponding second path for output according to the second path score in the second search result, and obtain a voice recognition result.
  • the second path includes a complete sentence composed of phoneme words and a corresponding path score.
  • the path score is used to indicate the credibility of the sentence. The higher the path score, the credibility of the true content of the voice information. The higher the degree.
  • the complete sentence corresponding to the second path with the highest path score can be selected for output, thereby obtaining a speech recognition result.
  • multiple complete sentences corresponding to second paths with higher path scores can also be selected for output, thereby obtaining multiple voice recognition results for output, and the user can select from multiple voice recognition results.
  • the voice information to be recognized is acquired; the voice information to be recognized is input into the local first word graph model for decoding and search, and the first search result is obtained.
  • the first search result includes the first path and the corresponding first path Score, the first word graph model includes acoustic model, pronunciation dictionary and first word graph space; input the first search result into the local second word graph model to search, get the second search result, the second search result includes the second Paths and corresponding second path scores, where the second word graph model includes a second word graph space, and the first word graph space is a sub-word graph space of the second word graph space; the corresponding second path is selected according to the second path score Output, get the result of speech recognition.
  • the second search process does not require acoustic decoding, which can make the search
  • the dimensionality becomes lower, which effectively reduces the amount of word map search, thereby reducing the search time and improving the speed of speech recognition.
  • the above voice recognition method further includes:
  • Step 301 Acquire current context information of the user.
  • the above-mentioned current context information can be determined according to the time. For example, if the working hours are from 9 am to 17:00, the context can be determined as the working context, and at the end of the week, it can be determined as the vacation context. After 22:00, 8 Before the point, it can be determined as a resting context. It can also be determined based on the acquisition of the voice to be recognized. For example, the voice to be recognized is obtained from a friend on WeChat, it can be determined as a friend’s chat context, and the voice to be recognized is obtained from a user who notes as a customer in WeChat or other social software, then Can be determined as the working context. In a possible implementation manner, the user's context may also be automatically determined by the user, and the context information obtained by the user can be more accurate by selecting the context by himself.
  • Step 302 Select the corresponding first word graph model according to the user's current context information to decode and search the voice information.
  • the above-mentioned first word graph model may be a first word graph model with context attributes, and each first word graph model corresponds to one or more context attributes, which can be obtained through step 301
  • the context information matches the corresponding first word graph model. Matching the context information to the corresponding first word graph model can make the results obtained by the first word graph model more suitable for the context and improve the accuracy.
  • the above-mentioned first search result is a path result of at least one path
  • Step 401 Obtain the path result of the first path and the corresponding first path score through decoding search.
  • Step 402 According to the first path score from high to low, m path results from n path results are sequentially selected for output, to obtain a first search result, where m is less than or equal to n.
  • the score of the search result (the first path) under the first word graph model can be obtained, that is, the score of at least one first path is scored. Yes, n search results (first path) correspond to n scores, and nbest results sorted according to the scores are obtained as the first search result.
  • the first search results may be sorted according to nbest scores, that is, the search results corresponding to the highest first path score are ranked first.
  • the input amount of the second word graph model can be reduced.
  • the construction of the above-mentioned first word graph model includes the following steps:
  • Step 501 Extract a word graph unit from the pre-built second word graph space, and construct a first word graph space according to the word graph unit.
  • Step 502 Construct the first word graph model according to the acoustic model, pronunciation dictionary, and first word graph space.
  • the second word graph space in the second word graph model may be configured through a local dictionary, or may be a word graph space pre-downloaded to the local.
  • the word map unit may include a language unit and a corresponding weight, and the language unit may be understood as a phoneme word in the first search result.
  • the word graph unit can also be understood as a word graph path.
  • word graph units with various context attributes can be extracted from the second word graph space to construct the first word graph space of different contexts, so that the voice information can be in the first word graph space.
  • the search and decoding range in the word graph model becomes smaller, thereby improving the speed of the first word graph model to decode speech information.
  • the above steps can be understood as pruning the second word graph space to obtain the first word graph space. It should be understood that the number of the aforementioned first word graph models may be one or more.
  • the first word graph space can be augmented to add word graph units with similar context attributes to expand the first word graph space into a second word graph space.
  • the weight of each language unit in the first word graph space obtained after pruning will change with the model training.
  • the weight of the same language unit will be in the first word graph space and the second word graph space.
  • the weight of the graph space is not the same.
  • the weight of each language unit in the second word graph space obtained after branching is different from the weight of the same language unit in the first word graph space. That is, the same path is searched in the first word graph model and the second word graph model, and the path scores obtained are different.
  • the first word graph space is constructed by extracting word graph units with the same attributes from the second word graph space, which can avoid the mismatch between the first search result and the second word graph model, causing errors. Recognition.
  • the construction of the above-mentioned first word graph model further includes the following steps:
  • Step 601 Train the first word graph model to fit the loss function to obtain the weight of the word graph unit in the first word graph space.
  • the word graph unit when the word graph unit is a language unit, the language unit that constructs the first word graph model can be combined according to the word graph combination relationship in the second word graph model, and the first word graph model can be trained to adjust the language unit the weight of. Get the new word graph space as the word graph space of the first word graph model.
  • the scoring result of the first word graph path can be adjusted by training the first word graph model.
  • the first word graph model when constructing the first word graph space by extracting word graph units in the second word graph space, the first word graph model can be trained to improve the recognition accuracy of the first word graph model In addition, it will not be affected by the second word graph space.
  • step 203 specifically includes:
  • Step 701 Extract the word graph unit in the first search result.
  • Step 702 Input the word graph unit in the first search result into the second word graph model for searching.
  • the language unit when the word map unit is a language unit, the language unit can be input into the second word map model for search, and the second word map path of the corresponding word map unit in the second word map model and the corresponding Path score.
  • the word graph unit is the first word graph path
  • the first word graph path is decomposed in the second word graph model to obtain the language unit, and then the language unit is searched for the path in the second word graph space to obtain the second word graph space.
  • the word map path and the corresponding path score when the word graph unit is the first word graph path, these first word graph paths are input into the second word graph model, and the second word graph path in the second word graph space of the second word graph model is performed. Matching, since the same path in the first word map space and the second word map space may have different path scores, it is equivalent to wide-area verification of the first search result in the second word map space, ensuring the accuracy of the speech recognition results .
  • the first search result is searched in the second word graph space in the form of a word graph unit, without acoustically decoding the speech information to be recognized, the search dimension is reduced, and the speed of speech recognition is improved.
  • step 204 specifically includes:
  • Step 801 Sort the second path according to the score of the second path.
  • Step 802 Output the speech recognition results corresponding to y second paths in order, where y is greater than or equal to 1.
  • the second path with a high score can be ranked first, and the second path with a low score can be ranked behind.
  • the complete sentence corresponding to the second word map path selected for output will be more intuitive. For example, if only one is selected for output, the complete sentence corresponding to the first second word map path can be extracted for output. , In the case of selecting multiple output, the top ones can be extracted for output, so that the user can select the output result.
  • the second path is sorted and then output. According to the complete sentences output by sorting, the output voice recognition result can be more convenient and intuitive.
  • the aforementioned non-volatile storage medium may be a non-volatile non-volatile storage medium such as a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), and the like.
  • this application provides an embodiment of a speech recognition device.
  • the device embodiment corresponds to the method embodiment shown in FIG. Used in various electronic devices.
  • a speech recognition device 900 of this embodiment includes: a first acquisition module 901, a first search module 902, a second search module 903, and an output module 904. among them:
  • the first acquiring module 901 is configured to acquire voice information to be recognized
  • the first search module 902 is configured to input the to-be-recognized voice information into the local first word graph model for decoding search to obtain a first search result, the first search result including the first path and the corresponding first path score,
  • the first word graph model includes acoustic model, pronunciation dictionary and first word graph space;
  • the second search module 903 is configured to input the first search result into the local second word graph model for searching, and obtain the second search result.
  • the second search result includes the second path and the corresponding second path score, where the second
  • the word graph model includes a second word graph space, and the first word graph space is a sub-word graph space of the second word graph space;
  • the output module 904 is configured to select a corresponding second path for output according to the second path score to obtain a voice recognition result.
  • the first word graph model is at least one first word graph model configured locally, and the speech recognition apparatus 900 further includes: a second acquisition module 905 and a selection module 906. among them,
  • the second obtaining module 905 is used to obtain the current context information of the user
  • the selection module 906 is configured to select the corresponding first word graph model to decode and search the voice information according to the user's current context information.
  • the first search result is a path result of at least one path
  • the first search module 902 includes: a decoding search unit 9021, an output unit 9022. among them,
  • the decoding search unit 9021 is configured to obtain the path result of the first path and the corresponding first path score through decoding search;
  • the first output unit 9022 is configured to sequentially select m path results among the n path results according to the first path score from high to low for output to obtain the first search result, where m is less than or equal to n.
  • the speech recognition device 900 further includes a first word graph model construction module 907, and the first word graph model construction module 907 includes; a first extraction unit 9071, a construction unit 9072. among them:
  • the first extraction unit 9071 is configured to extract the word graph unit from the pre-built second word graph space, and construct the first word graph space according to the word graph unit;
  • the construction unit 9072 is configured to construct the first word graph model according to the acoustic model, pronunciation dictionary, and first word graph space.
  • the first word graph model construction module 907 further includes a training unit 9073. among them:
  • the training unit 9073 trains the first word graph model, trains to fit the loss function, and obtains the weight of the word graph unit in the first word graph space.
  • the second search module 903 includes: a second extraction unit 9031, an input unit 9032. among them:
  • the second extraction unit 9031 is used to extract the word map unit in the first search result
  • the input unit 9032 is configured to input the word graph unit in the first search result into the second word graph model for search.
  • the output module 904 includes: a sorting unit 9041, a second output unit 9042. among them:
  • the sorting unit 9041 is configured to sort the second path according to the score of the second path
  • the second output unit 9042 is configured to output the voice recognition results corresponding to the y second paths in order, where y is greater than or equal to 1.
  • the voice recognition device provided in the embodiment of the present application can implement the various implementation manners in the method embodiments of FIG. 2 to FIG. 8 and the corresponding beneficial effects. To avoid repetition, details are not described herein again.
  • FIG. 16 is a block diagram of the basic structure of the computer device in this embodiment.
  • the computer device 16 includes a non-volatile memory 161, a processor 162, and a network interface 163 that are communicatively connected to each other through a system bus. It should be pointed out that only the computer device 16 with components 161-163 is shown in the figure, but it should be understood that it is not required to implement all the illustrated components, and more or fewer components may be implemented instead. Among them, those skilled in the art can understand that the computer device here is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions.
  • Its hardware includes, but is not limited to, a microprocessor, a dedicated Integrated Circuit (Application Specific Integrated Circuit, ASIC), Programmable Gate Array (Field-Programmable GateArray, FPGA), Digital Processor (Digital Signal Processor, DSP), embedded equipment, etc.
  • ASIC Application Specific Integrated Circuit
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable GateArray
  • DSP Digital Processor
  • the computer equipment can be computing equipment such as desktop computers, notebooks, palmtop computers, and cloud servers.
  • the computer device can interact with the user through a keyboard, mouse, remote control, touch panel, or voice control device.
  • the non-volatile memory 161 includes at least one type of readable non-volatile storage medium.
  • the readable non-volatile storage medium includes flash memory, hard disk, multimedia card, card-type non-volatile memory (for example, SD or DX).
  • Non-volatile memory, etc. includes read-only non-volatile memory (ROM), electrically erasable programmable read-only non-volatile memory (EEPROM), programmable read-only non-volatile memory (PROM), magnetic Non-volatile memory, magnetic disks, optical disks, etc.
  • the non-volatile memory 161 may be an internal storage unit of the computer device 16, such as a hard disk or memory of the computer device 16.
  • the non-volatile memory 161 may also be an external storage device of the computer device 16, such as a plug-in hard disk, a smart media card (SMC), and a secure digital device equipped on the computer device 16. (Secure Digital, SD) card, flash card (Flash Card), etc.
  • the non-volatile memory 161 may also include both the internal storage unit of the computer device 16 and its external storage device.
  • the non-volatile memory 161 is generally used to store an operating system and various application software installed in the computer device 16, such as executable code of a voice recognition method.
  • the non-volatile memory 161 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 162 may be a central processing unit (CPU), a controller, a microcontroller, a microprocessor, or other data processing chips.
  • the processor 162 is generally used to control the overall operation of the computer device 16.
  • the processor 162 is configured to run executable codes or process data stored in the non-volatile memory 161, for example, run executable codes for a voice recognition method.
  • the network interface 163 may include a wireless network interface or a wired network interface, and the network interface 163 is generally used to establish a communication connection between the computer device 16 and other electronic devices.
  • This application also provides another implementation manner, that is, a computer-readable non-volatile storage medium is provided.
  • the computer-readable non-volatile storage medium stores a type of speech recognition executable code, and the above-mentioned type of speech recognition executable code is
  • the code may be executed by at least one processor, so that the at least one processor executes the steps of a speech recognition method as described above.
  • the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a non-volatile storage medium (such as ROM, magnetic A disc, an optical disc) includes a number of instructions to enable a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to execute a voice recognition method of the various embodiments of the present application.
  • a terminal device which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.

Abstract

L'invention concerne un procédé et un appareil d'identification de parole, un dispositif informatique et un support d'informations non volatile, qui appartiennent au domaine technique de l'intelligence artificielle. Le procédé comprend les étapes consistant : à acquérir des informations de parole à identifier (201) ; à entrer les informations de parole à identifier dans un premier modèle de graphe de mots local pour effectuer une recherche de décodage de façon à obtenir un premier résultat de recherche (202), le premier résultat de recherche comprenant un premier chemin et une première note de chemin correspondant, et le premier modèle de graphe de mots comprenant un modèle acoustique, un dictionnaire de prononciation et un premier espace de graphe de mots ; à entrer le premier résultat de recherche dans un second modèle de graphe de mots local à des fins de recherche de façon à obtenir un second résultat de recherche (203), le second résultat de recherche comprenant un second chemin et une seconde note de chemin correspondant, le second modèle de graphe de mots comprenant un second espace de graphe de mots, et le premier espace de graphe de mots étant un sous-espace de graphe de mots du second espace de graphe de mots ; et selon la seconde note de chemin dans le second résultat de recherche, à sélectionner le second chemin correspondant de sortie de façon à obtenir un résultat d'identification de parole (204). Par utilisation du procédé décrit, les dimensions de recherche sont abaissées et la quantité de recherche de graphe de mots est réduite, ce qui permet de raccourcir le temps de recherche et d'augmenter la vitesse d'identification de parole.
PCT/CN2019/116920 2019-09-20 2019-11-10 Procédé et appareil d'identification de parole, dispositif informatique et support d'informations non volatile WO2021051514A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910894996.8 2019-09-20
CN201910894996.8A CN110808032B (zh) 2019-09-20 2019-09-20 一种语音识别方法、装置、计算机设备及存储介质

Publications (1)

Publication Number Publication Date
WO2021051514A1 true WO2021051514A1 (fr) 2021-03-25

Family

ID=69487614

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/116920 WO2021051514A1 (fr) 2019-09-20 2019-11-10 Procédé et appareil d'identification de parole, dispositif informatique et support d'informations non volatile

Country Status (2)

Country Link
CN (1) CN110808032B (fr)
WO (1) WO2021051514A1 (fr)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111341305B (zh) * 2020-03-05 2023-09-26 苏宁云计算有限公司 一种音频数据标注方法、装置及系统
CN111681661B (zh) * 2020-06-08 2023-08-08 北京有竹居网络技术有限公司 语音识别的方法、装置、电子设备和计算机可读介质
CN111916058A (zh) * 2020-06-24 2020-11-10 西安交通大学 一种基于增量词图重打分的语音识别方法及系统
CN112560496B (zh) * 2020-12-09 2024-02-02 北京百度网讯科技有限公司 语义分析模型的训练方法、装置、电子设备及存储介质
CN113223495B (zh) * 2021-04-25 2022-08-26 北京三快在线科技有限公司 一种基于语音识别的异常检测方法及装置
CN112863489B (zh) * 2021-04-26 2021-07-27 腾讯科技(深圳)有限公司 语音识别方法、装置、设备及介质
CN113643706B (zh) * 2021-07-14 2023-09-26 深圳市声扬科技有限公司 语音识别方法、装置、电子设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102592595A (zh) * 2012-03-19 2012-07-18 安徽科大讯飞信息科技股份有限公司 语音识别方法及系统
CN107195296A (zh) * 2016-03-15 2017-09-22 阿里巴巴集团控股有限公司 一种语音识别方法、装置、终端及系统
CN108305634A (zh) * 2018-01-09 2018-07-20 深圳市腾讯计算机系统有限公司 解码方法、解码器及存储介质
US10032451B1 (en) * 2016-12-20 2018-07-24 Amazon Technologies, Inc. User recognition for speech processing systems
CN108510990A (zh) * 2018-07-04 2018-09-07 百度在线网络技术(北京)有限公司 语音识别方法、装置、用户设备及存储介质
CN109036391A (zh) * 2018-06-26 2018-12-18 华为技术有限公司 语音识别方法、装置及系统

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100612839B1 (ko) * 2004-02-18 2006-08-18 삼성전자주식회사 도메인 기반 대화 음성인식방법 및 장치
KR20140028174A (ko) * 2012-07-13 2014-03-10 삼성전자주식회사 음성 인식 방법 및 이를 적용한 전자 장치
CN106856092B (zh) * 2015-12-09 2019-11-15 中国科学院声学研究所 基于前向神经网络语言模型的汉语语音关键词检索方法
CN106328147B (zh) * 2016-08-31 2022-02-01 中国科学技术大学 语音识别方法和装置
CN106782513B (zh) * 2017-01-25 2019-08-23 上海交通大学 基于置信度的语音识别实现方法及系统
CN110070859B (zh) * 2018-01-23 2023-07-14 阿里巴巴集团控股有限公司 一种语音识别方法及装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102592595A (zh) * 2012-03-19 2012-07-18 安徽科大讯飞信息科技股份有限公司 语音识别方法及系统
CN107195296A (zh) * 2016-03-15 2017-09-22 阿里巴巴集团控股有限公司 一种语音识别方法、装置、终端及系统
US10032451B1 (en) * 2016-12-20 2018-07-24 Amazon Technologies, Inc. User recognition for speech processing systems
CN108305634A (zh) * 2018-01-09 2018-07-20 深圳市腾讯计算机系统有限公司 解码方法、解码器及存储介质
CN109036391A (zh) * 2018-06-26 2018-12-18 华为技术有限公司 语音识别方法、装置及系统
CN108510990A (zh) * 2018-07-04 2018-09-07 百度在线网络技术(北京)有限公司 语音识别方法、装置、用户设备及存储介质

Also Published As

Publication number Publication date
CN110808032A (zh) 2020-02-18
CN110808032B (zh) 2023-12-22

Similar Documents

Publication Publication Date Title
WO2021051514A1 (fr) Procédé et appareil d'identification de parole, dispositif informatique et support d'informations non volatile
WO2021232725A1 (fr) Procédé et appareil de vérification d'informations basée sur une interaction vocale et dispositif et support de stockage sur ordinateur
US10614803B2 (en) Wake-on-voice method, terminal and storage medium
EP3832519A1 (fr) Procédé et appareil pour évaluer la qualité de traduction
CN107430859B (zh) 将输入映射到表单域
US10176804B2 (en) Analyzing textual data
US8620658B2 (en) Voice chat system, information processing apparatus, speech recognition method, keyword data electrode detection method, and program for speech recognition
WO2021139108A1 (fr) Appareil et procédé de reconnaissance intelligente d'émotions, dispositif électronique et support d'enregistrement
US11217236B2 (en) Method and apparatus for extracting information
WO2020001458A1 (fr) Procédé, dispositif et système de reconnaissance vocale
US10290299B2 (en) Speech recognition using a foreign word grammar
WO2021135438A1 (fr) Procédé d'entraînement de modèle de reconnaissance vocale multilingue, appareil, dispositif et support de stockage
US20240021202A1 (en) Method and apparatus for recognizing voice, electronic device and medium
US20140372119A1 (en) Compounded Text Segmentation
JPWO2005101235A1 (ja) 対話支援装置
WO2020238045A1 (fr) Procédé et appareil de reconnaissance vocale intelligents et support de stockage lisible par ordinateur
WO2021218028A1 (fr) Procédé, appareil et dispositif d'affinage de contenu d'entretien basé sur l'intelligence artificielle, et support
US20230127787A1 (en) Method and apparatus for converting voice timbre, method and apparatus for training model, device and medium
CN111126084B (zh) 数据处理方法、装置、电子设备和存储介质
US20230004798A1 (en) Intent recognition model training and intent recognition method and apparatus
CN110852075B (zh) 自动添加标点符号的语音转写方法、装置及可读存储介质
CN110503956B (zh) 语音识别方法、装置、介质及电子设备
CN114706973A (zh) 抽取式文本摘要生成方法、装置、计算机设备及存储介质
CN113850291A (zh) 文本处理及模型训练方法、装置、设备和存储介质
CN112489634A (zh) 语言的声学模型训练方法、装置、电子设备及计算机介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19945487

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19945487

Country of ref document: EP

Kind code of ref document: A1