WO2021179701A1 - 多语种语音识别方法、装置及电子设备 - Google Patents

多语种语音识别方法、装置及电子设备 Download PDF

Info

Publication number
WO2021179701A1
WO2021179701A1 PCT/CN2020/134543 CN2020134543W WO2021179701A1 WO 2021179701 A1 WO2021179701 A1 WO 2021179701A1 CN 2020134543 W CN2020134543 W CN 2020134543W WO 2021179701 A1 WO2021179701 A1 WO 2021179701A1
Authority
WO
WIPO (PCT)
Prior art keywords
recognition result
target
probability
speech
recognition
Prior art date
Application number
PCT/CN2020/134543
Other languages
English (en)
French (fr)
Inventor
刘博卿
王健宗
张之勇
程宁
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021179701A1 publication Critical patent/WO2021179701A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise

Definitions

  • the present disclosure relates to the field of speech recognition technology, and in particular to a multilingual speech recognition method, device and electronic equipment.
  • the present disclosure provides a multilingual speech recognition method, device and electronic equipment, the main purpose of which is to reduce the difficulty of multilingual speech recognition.
  • a multilingual speech recognition method includes: obtaining a target speech to be recognized; calling a pre-trained acoustic model and a pre-trained multilingual language model to decode the target speech, and obtain A search grid for the recognition result of the target speech; multiple pre-trained monolingual language models are called to re-score the recognition result search grid, and a candidate recognition result of the corresponding language is screened out, and the candidates are respectively determined
  • the recognition result is the probability of the target recognition result of the target voice; the candidate recognition results are sorted according to the probabilities in descending order, and the candidate recognition results with a preset number of digits in the top are selected.
  • Target recognition result is the probability of the target recognition result of the target voice; the candidate recognition results are sorted according to the probabilities in descending order, and the candidate recognition results with a preset number of digits in the top are selected.
  • the present disclosure also provides a multilingual speech recognition device.
  • the device includes: an acquisition module configured to acquire a target voice to be recognized; a decoding module configured to call a pre-trained acoustic model and a pre-trained multi-language The language language model decodes the target speech, and obtains the recognition result search grid of the target speech; the re-scoring module is configured to call a plurality of pre-trained monolingual language models to re-respect the recognition result search grid.
  • Scoring respectively screening out a candidate recognition result of a corresponding language, and respectively determining the probability that the candidate recognition result is the target recognition result of the target voice;
  • the candidate recognition results are sorted, and the target recognition results are selected from the candidate recognition results of preset digits in the ranking.
  • the present disclosure also provides an electronic device, the electronic device includes: a memory, storing at least one instruction; and a processor, executing the instructions stored in the memory to implement the following method: acquiring the target voice to be recognized Call the pre-trained acoustic model and the pre-trained multilingual language model to decode the target voice, and obtain the recognition result search grid of the target voice; call a plurality of pre-trained monolingual language models to respectively recognize the target voice The result is searched on the grid for re-scoring, and a candidate recognition result of the corresponding language is screened out, and the probability that the candidate recognition result is the target recognition result of the target speech is determined respectively; The candidate recognition results are sorted, and the target recognition results are selected from the candidate recognition results with a preset number of digits in the ranking.
  • the present disclosure also provides a computer-readable storage medium, in which at least one instruction is stored, and the at least one instruction is executed by a processor in an electronic device to implement the following method: obtaining The target voice to be recognized; call the pre-trained acoustic model and the pre-trained multilingual language model to decode the target voice, and obtain the recognition result search grid of the target voice; call the pre-trained multiple monolingual language models Re-scoring the recognition result search grid, respectively screen out a candidate recognition result of the corresponding language, and respectively determine the probability that the candidate recognition result is the target recognition result of the target voice; The candidate recognition results are sorted in the smallest order, and the target recognition results are selected from the candidate recognition results with a preset number of digits in the top rank.
  • the embodiments of the present disclosure circumvent the step of detecting language categories when performing multilingual speech recognition, thereby avoiding the difficulties caused by training the language detection model, reducing the difficulty of multilingual speech recognition, and can be applied to In the field of smart government affairs, smart medical field or other fields with voice recognition needs, thus promoting the construction of smart cities. Similarly, this solution can be applied to the online consultation link in digital medical care.
  • FIG. 1 is a schematic flowchart of a multilingual speech recognition method provided by an embodiment of the present disclosure.
  • FIG. 2 is a schematic diagram of modules of a multilingual speech recognition device provided by an embodiment of the present disclosure.
  • FIG. 3 is a schematic diagram of the internal structure of an electronic device for implementing a multilingual speech recognition method provided by an embodiment of the present disclosure.
  • the technical solution of this application can be applied to the fields of artificial intelligence, smart city, digital medical, blockchain and/or big data technology, such as smart government affairs, smart medical care, etc.
  • the data involved in this application such as speech, training text, and/or recognition results, can be stored in a database, or can be stored in a blockchain, such as distributed storage through a blockchain, which is not limited in this application.
  • the present disclosure provides a multilingual speech recognition method.
  • FIG. 1 it is a schematic flowchart of a multilingual speech recognition method provided by an embodiment of the present disclosure.
  • the method can be executed by a device, and the device can be implemented by software and/or hardware.
  • the multilingual speech recognition method includes the following steps.
  • Step S1 Obtain the target voice to be recognized.
  • Step S2 Invoke the pre-trained acoustic model and the pre-trained multilingual language model to decode the target speech, and obtain a search grid for the recognition result of the target speech.
  • Step S3 Invoke multiple pre-trained monolingual language models to re-score the recognition result search grid, respectively filter out a candidate recognition result of a corresponding language, and respectively determine that the candidate recognition result is the target speech The probability of the target recognition result.
  • Step S4 Sort the candidate recognition results in descending order of the probability, and filter the target recognition results from the candidate recognition results with a preset number of digits in the ranking.
  • the multilingual speech recognition method provided by the embodiments of the present disclosure can be applied to various application scenarios where speech recognition needs exist, such as intelligent customer service and intelligent robot consultation.
  • the multilingual speech recognition method provided by the embodiment of the present disclosure is applied to intelligent customer service.
  • the intelligent customer service system is pre-trained and configured in the cloud server with an acoustic model, a multilingual language model, a Chinese language model, an English language model, and a Japanese language model.
  • the multilingual language model is a language model trained using training text that mixes Chinese, English, and Japanese;
  • the Chinese language model is a language model trained using only Chinese training text;
  • the English language model is trained using only English training text The obtained language model;
  • the Japanese language model is a language model that is trained using only Japanese training text.
  • the smart customer service system When the user establishes a call with the smart customer service system through the client (for example: mobile phone, personal computer), the smart customer service system uploads the user's voice collected through the client to the cloud server, and calls the acoustic model and the multiple The language language model decodes the user's voice and obtains the recognition result search grid.
  • the client for example: mobile phone, personal computer
  • the smart customer service system uploads the user's voice collected through the client to the cloud server, and calls the acoustic model and the multiple The language language model decodes the user's voice and obtains the recognition result search grid.
  • the intelligent customer service system sorts the above three candidate recognition results according to probability P1, probability P2, and probability P3 in descending order, selects the target recognition results from the top two candidate recognition results, and determines the user’s Real semantics.
  • the intelligent customer service system determines that the candidate recognition result R2 is the real semantics of the user, the intelligent customer service system can further determine the response to the candidate recognition result R2 according to the preset natural language processing model, and then change the response from The text is converted into speech, which is then fed back to the user in the form of speech. Therefore, even if the user mixes English or Japanese into Chinese during the call, the intelligent customer service system can accurately understand the user's real semantics and then accurately respond.
  • the target voice to be recognized is acquired.
  • the user's chat voice can be collected through a microphone, and then the chat voice can be used as the target voice to be recognized, and the semantics of the chat voice can be determined through subsequent processing in the embodiment of the present disclosure, that is, the target recognition result of the target voice can be obtained.
  • the chat voice can be translated into a voice in a specific language, or the chat voice can be converted into text.
  • the acquired target voice may include multiple languages.
  • the target voice includes Chinese and English; or, includes Chinese and Japanese; or, includes Chinese, English, and Japanese.
  • At least one acoustic model is pre-trained, and at least one multilingual language model is pre-trained.
  • the acoustic model and the multilingual language model are called to decode the target voice, and a search grid for the recognition result of the target voice is obtained.
  • the recognition result search grid can be regarded as a search space used to search for the correct target recognition result, and it is a compact data structure containing multiple alternative paths.
  • calling the acoustic model and the multilingual language model to decode the target speech is to search in the state network of the target speech, and score the searched paths, and then perform the state network according to the scores. Pruning and filtering, so as to obtain a more compact recognition result search grid in which the contained path is closer to the target recognition result.
  • the acoustic model refers to a model that processes the original audio data to extract phoneme information in the audio data.
  • a Gaussian mixture model and a Hidden Markov Model can be used as the acoustic model
  • a Deep Neural Network can also be used as the acoustic model.
  • Multilingual language model refers to a language model obtained when the training text used in training contains multiple different languages. Since the training of the multilingual speech model is mainly limited to the limited number of languages contained in the training text and the limited number of specific sentences contained in the training text, when the multilingual speech model is put into actual speech recognition, all The recognition result obtained by decoding has a higher degree of confusion in the search grid.
  • the method before calling the pre-trained acoustic model and the pre-trained multilingual language model to decode the target speech, further includes: denoising the target speech to obtain the denoised target speech; Feature extraction is performed on the target speech after the noise reduction process, and the speech frame sequence of the target speech used as the input of the acoustic model is obtained.
  • noise reduction processing is performed on the target voice to enhance the target voice. Then perform feature extraction on the target speech after the noise reduction process, convert the target speech after the noise reduction process from the time domain signal to the frequency domain signal, and obtain the speech frame sequence of the target speech used as the input of the acoustic model (for example: The FBANK feature of the target voice divided by the frame).
  • distortion removal processing or other speech preprocessing may also be performed on the target speech to further improve the accuracy of multilingual speech recognition as a whole.
  • the multilingual language model is pre-trained by the following method.
  • the method includes: obtaining a first training text corresponding to a first language and a second training text corresponding to a second language; combining the first training text and the second training text Commonly input the multilingual language model to obtain the first recognition result of the first training text and the second recognition result of the second training text output by the multilingual language model; determine the multilingual language model based on the first recognition result and the second recognition result Recognition error: Adjust the parameters of the multilingual language model by backpropagating the recognition error until the recognition error is less than the preset error threshold.
  • a training text containing two languages is used to pre-train a multilingual language model.
  • the first training text is the first language
  • the second training text is the second language
  • the first language and the second language are different languages.
  • the first training text and the second training text are mixed together and input into the multilingual language model; then the multilingual language model processes the input text with the current parameters, Output the first recognition result of the first training text and the second recognition result of the second training text.
  • the first recognition result output by the multilingual language model may be wrong, and the second recognition result output may also be wrong.
  • the correct recognition result of the first training text is determined in advance, and the correct recognition result of the second training text is also determined in advance, it can be determined based on the first recognition result and the second recognition result output by the multilingual language model.
  • Recognition error of multilingual language model is used to adjust the parameters of the multilingual language model in order from back to front.
  • the first training text and the second training text are mixed together again and input into the multilingual language model, and looping continuously until the recognition error of the multilingual language model is less than the preset error threshold.
  • the preset error threshold is 5%.
  • a certain amount of Chinese training text and a certain amount of English training text are collected in advance, and each sentence in the Chinese training text is randomly inserted into the English training text in sentence units to obtain a mixed training text that mixes Chinese and English .
  • the mixed training text is input into a language model A to be trained, and the first recognition result of the Chinese part and the second recognition result of the English part output by the language model A are obtained, and then the recognition error of the language model A is determined on this basis. Furthermore, the parameters of the language model A are adjusted by back-propagating the recognition error.
  • this embodiment only exemplarily shows the process of pre-training a multilingual language model. It is understandable that, according to the application requirements of the scene, when pre-training a multilingual language model, training texts in three or more languages can be mixed together. This embodiment should not limit the function and scope of use of the present disclosure.
  • calling the pre-trained acoustic model and the pre-trained multilingual language model to decode the target voice includes: calling the acoustic model to take the voice frame of the target voice as input, and for each voice frame, the output voice frame corresponds to The first probability of each state and the second probability of each state transitioning to each other; the third probability used to describe the statistical law of the word order obtained by the pre-training of the multilingual language model; based on the first probability, the second probability, and the first probability
  • the target speech is decoded with three probabilities, and the search grid of the recognition result is obtained.
  • decoding mainly refers to a search space composed of knowledge sources such as acoustic models, acoustic contexts, and language models.
  • the probability corresponding to each path is obtained through search, and then the best path or ranking can be selected according to the probability.
  • the acoustic model integrates the knowledge of acoustics and phonology, takes the speech frame sequence of the target speech as input, and outputs the first probability and the corresponding state of the speech frame for each speech frame in the speech frame sequence.
  • the second probability of mutual transition between the states of the speech frame That is, the first probability describes the probability between the frame and the state; the second probability describes the probability between the state and the state.
  • the state refers to a phonetic unit with a smaller granularity than a phoneme in speech processing, and one phoneme corresponds to multiple states.
  • the essence of speech recognition is to determine the phoneme and then determine the word order on the basis of determining the state corresponding to each speech frame. The determined word order is the recognition result.
  • the multilingual language model After the multilingual language model is pre-trained, it has prior knowledge of word order statistics that matches the pre-training process. Therefore, the multilingual language model uses the prior knowledge to output the third probability used to describe the statistical law of word order.
  • the target speech is decoded based on the first probability, the second probability, and the third probability to obtain the recognition result search grid.
  • decoding the target voice based on the first probability, the second probability, and the third probability to obtain the recognition result search grid includes: establishing the state of the target voice according to the first probability, the second probability, and the third probability Network, the nodes in the state network are used to describe the voice frame in a single state, and the edges in the state network are used to describe the voice frame in each single state; the Viterbi algorithm is used to search in the state network to obtain the recognition result search grid .
  • the established state network of the target voice is graph structure data.
  • the speech frame is expanded in each state according to the first probability to obtain the corresponding node in the state network.
  • the first probability of the speech frame t being the first state s1 is a1
  • the first probability of being the second state s2 is a2
  • the second probability of being the third state s3 is a3.
  • three nodes in the state network are obtained, namely, the voice frame t-s1, the voice frame t-s2, and the voice frame t-s3.
  • the edges with specific weights between each node are established to obtain the state network.
  • the structure of (directed acyclic graph, directed acyclic graph) saves the selected N paths, and obtains the recognition result of the graph structure data to search the grid. Specifically, after selecting these N paths, select all the nodes included in the N paths, restore the original position of each node in the state network, and also according to the first probability, the second probability, and the third probability Re-establish the edges with specific weights between the selected and restored nodes to obtain the recognition result search grid.
  • the following three rules are mainly followed: 1. If the path with the highest probability passes through a certain point of the state network, the sub-path from the starting point to the point must also be from the beginning to the point. The path with the highest probability; 2. Assuming that there are k states at the i-th moment, there are k shortest paths from the beginning to the k states at time i, and the final shortest path must pass through one of them; 3. According to the above two points, When calculating the shortest path of the i+1th state, only the shortest path from the beginning to the current k state values and the shortest path from the current state value to the i+1th state value need to be considered.
  • each monolingual language model only performs speech recognition for the corresponding language.
  • the multiple monolingual language models are called to re-score the recognition result search grid.
  • Each monolingual language model will screen out the most likely recognition result in the corresponding language after re-scoring, that is, the candidate recognition result of the corresponding language. At the same time, the probability that the candidate recognition result is the target recognition result in the corresponding language is also determined.
  • language model B for speech recognition for Chinese
  • language model C for speech recognition for English
  • language model D for speech recognition for Japanese.
  • call language model B After obtaining the recognition result and searching the grid, call language model B to re-score it, and select the candidate recognition result R1 with the highest probability of P1 in Chinese; call language model C to re-score it, and filter out the highest probability in English It is the candidate recognition result R2 of P2; the language model D is called to re-score it, and the candidate recognition result R3 with the highest probability of P3 in Japanese is screened out.
  • the monolingual language model is an ngram language model based on statistics.
  • the advantage of this embodiment is that the statistical-based ngram language model is used as the monolingual language model, which avoids the restriction of the use of neural networks as the monolingual language model to the machine, improves the processing speed, and thereby improves the online real-time multilingual speech Recognition efficiency of recognition.
  • the multiple candidate recognition results obtained are sorted according to the corresponding probabilities in descending order, and then preset positions from the top Select the target recognition result from the number of candidate recognition results.
  • the preset number of digits before the ranking generally does not include the last one.
  • screening the target recognition result from the candidate recognition results of the preset number of digits in the top includes: determining the first candidate recognition result as the target recognition result.
  • the first candidate recognition result among the top M positions is determined as the target recognition result, that is, the maximum probability
  • the corresponding candidate recognition result is determined as the target recognition result.
  • M is a preset positive integer greater than 1.
  • the preset M is 2.
  • the obtained multiple candidate recognition results include a candidate recognition result R1 with a probability of P1, a candidate recognition result R2 with a probability of P2, and a candidate recognition result R3 with a probability of P3.
  • P1 is less than P2, and P1 is greater than P3, then the top two positions are the candidate recognition result R2 and the candidate recognition result R1 in turn. Then the candidate recognition result R2 is determined as the target recognition result.
  • screening the target recognition result from the candidate recognition results of the preset number of digits in the top includes: randomly selecting a candidate recognition result from the candidate recognition results of the preset number of digits in the ranking as the target recognition result.
  • a candidate recognition result is randomly selected from the top M positions as the target recognition result.
  • M is a preset positive integer greater than 1.
  • the preset M is 3.
  • the multiple candidate recognition results obtained include candidate recognition result R1 with probability P1, candidate recognition result R2 with probability P2, candidate recognition result R3 with probability P3, and candidate recognition result R4 with probability P4.
  • P1 is greater than P2
  • P2 is greater than P3, and P3 is greater than P4
  • the top 3 positions are candidate recognition result R1, candidate recognition result R2, and candidate recognition result R3.
  • the advantage of this embodiment is that by introducing a certain degree of randomness in the control range, the occurrence of similar deviations is avoided when the multilingual speech recognition is not sufficiently perfect and not enough to cover various application scenarios.
  • FIG. 2 it is a functional block diagram of the multilingual speech recognition device of the present disclosure.
  • the multilingual voice recognition apparatus 100 of the present disclosure may be installed in electronic equipment.
  • the multilingual speech recognition device may include an acquisition module 101, a decoding module 102, a re-scoring module 103, and a screening module 104.
  • the module of the present disclosure can also be called a unit, which refers to a series of computer program segments that can be executed by the processor of the electronic device and can complete fixed functions, and are stored in the memory of the electronic device.
  • each module/unit is as follows.
  • the acquiring module 101 is configured to acquire the target voice to be recognized.
  • the decoding module 102 is configured to call a pre-trained acoustic model and a pre-trained multilingual language model to decode the target speech, and obtain a search grid for the recognition result of the target speech.
  • the re-scoring module 103 is configured to call a plurality of pre-trained monolingual language models to re-score the recognition result search grid, respectively filter out a candidate recognition result of a corresponding language, and respectively determine that the candidate recognition result is all The probability of the target recognition result of the target speech.
  • the screening module 104 is configured to sort the candidate recognition results in descending order of the probability, and filter the target recognition results from the candidate recognition results with a preset number of digits in the ranking.
  • FIG. 3 it is a schematic diagram of the structure of an electronic device implementing the multilingual speech recognition method of the present disclosure.
  • the electronic device 1 may include a processor 10, a memory 11, and a bus, and may also include a computer program stored in the memory 11 and running on the processor 10, such as a multilingual speech recognition program 12.
  • the memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, mobile hard disk, multimedia card, card-type memory (for example: SD or DX memory, etc.), magnetic memory, magnetic disk, CD etc.
  • the memory 11 may be an internal storage unit of the electronic device 1 in some embodiments, for example, a mobile hard disk of the electronic device 1.
  • the memory 11 may also be an external storage device of the electronic device 1, such as a plug-in mobile hard disk, a smart memory card (Smart Media Card, SMC), and a secure digital (Secure Digital) equipped on the electronic device 1.
  • the memory 11 may also include both an internal storage unit of the electronic device 1 and an external storage device.
  • the memory 11 can be used not only to store application software and various data installed in the electronic device 1, such as codes of a multilingual speech recognition program, etc., but also to temporarily store data that has been output or will be output.
  • the processor 10 may be composed of integrated circuits in some embodiments, for example, may be composed of a single packaged integrated circuit, or may be composed of multiple integrated circuits with the same function or different functions, including one or more Combinations of central processing unit (CPU), microprocessor, digital processing chip, graphics processor, and various control chips, etc.
  • the processor 10 is the control core of the electronic device (Control Unit), using various interfaces and lines to connect the various components of the entire electronic device, by running or executing the programs or modules (such as multilingual speech recognition programs, etc.) stored in the memory 11, and calling the programs or modules stored in the memory 11 To perform various functions of the electronic device 1 and process data.
  • the bus may be a peripheral component interconnect (PCI) bus or an extended industry standard structure (extended industry standard structure). industry standard architecture, EISA for short) bus, etc.
  • PCI peripheral component interconnect
  • extended industry standard structure extended industry standard structure
  • EISA industry standard architecture
  • the bus can be divided into address bus, data bus, control bus and so on.
  • the bus is configured to implement connection and communication between the memory 11 and at least one processor 10 and the like.
  • FIG. 3 only shows an electronic device with components. Those skilled in the art can understand that the structure shown in FIG. 3 does not constitute a limitation on the electronic device 1, and may include fewer or more components than shown in the figure. Components, or a combination of certain components, or different component arrangements.
  • the electronic device 1 may also include a power source (such as a battery) for supplying power to various components.
  • the power source may be logically connected to the at least one processor 10 through a power management device, thereby controlling the power supply.
  • the device implements functions such as charge management, discharge management, and power consumption management.
  • the power supply may also include any components such as one or more DC or AC power supplies, recharging devices, power failure detection circuits, power converters or inverters, and power status indicators.
  • the electronic device 1 may also include a variety of sensors, Bluetooth modules, Wi-Fi modules, etc., which will not be repeated here.
  • the electronic device 1 may also include a network interface.
  • the network interface may include a wired interface and/or a wireless interface (such as a Wi-Fi interface, a Bluetooth interface, etc.), which is usually used in the electronic device 1 Establish a communication connection with other electronic devices.
  • the electronic device 1 may also include a user interface.
  • the user interface may be a display (Display) and an input unit (such as a keyboard (Keyboard)).
  • the user interface may also be a standard wired interface or a wireless interface.
  • the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, and an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, etc.
  • the display can also be appropriately called a display screen or a display unit, which is used to display the information processed in the electronic device 1 and to display a visualized user interface.
  • the multilingual speech recognition program 12 stored in the memory 11 in the electronic device 1 is a combination of multiple instructions. When running in the processor 10, it can achieve: obtain the target speech to be recognized; call the pre-trained The acoustic model and the pre-trained multilingual language model decode the target speech to obtain the recognition result search grid of the target speech; call a plurality of pre-trained monolingual language models to perform the recognition result search grid respectively Re-scoring, respectively screening out a candidate recognition result of a corresponding language, and respectively determining the probability that the candidate recognition result is the target recognition result of the target voice; according to the probability of descending order of the candidate recognition results Sort, and screen out the target recognition result from the candidate recognition results of the preset number of digits in the top ranking.
  • the integrated module/unit of the electronic device 1 is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory) .
  • the embodiment of the present application also provides a computer-readable storage medium on which a computer program is stored, and the computer program is executed by a processor to realize the steps of the method in the above-mentioned embodiment, or when the computer program is executed by the processor, the above-mentioned implementation is realized
  • the storage medium involved in this application such as a computer-readable storage medium, may be non-volatile or volatile.
  • modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional modules in the various embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional modules.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

一种多语种语音识别方法、装置及电子设备,涉及语音识别技术领域,该多语种语音识别方法包括:获取待识别的目标语音(S1);调用预训练的声学模型以及预训练的多语种语言模型对目标语音进行解码,获取目标语音的识别结果搜索网格(S2);调用预训练的多个单语种语言模型分别对识别结果搜索网格进行重打分,分别筛选出一个对应语种的候选识别结果,并分别确定候选识别结果为目标语音的目标识别结果的概率(S3);按照概率从大到小的顺序将候选识别结果进行排序,并从排名前预设位数的候选识别结果中筛选出目标识别结果(S4)。本方案可以降低多语种语音识别的难度。同样的,本方案可以应用于数字医疗中在线问诊环节中。

Description

多语种语音识别方法、装置及电子设备
本申请要求于2020年10月19日提交中国专利局、申请号为202011119841.6,发明名称为“多语种语音识别方法、装置及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本公开涉及语音识别技术领域,尤其涉及一种多语种语音识别方法、装置及电子设备。
背景技术
随着国际间文化的频繁碰撞,用户在日常生活的语音交流中,常常会将多种语种的语言混合在一起。例如:“你看到我的pen了吗?”这一语音将中文与英文混合在了一起。发明人发现,现有技术中,在针对多语种语音进行识别时,先使用语种检测模型对语音所包含的语种类别进行检测,进而再基于检测所确定的语种类别进行语言识别以得到识别结果。由此可见,现有技术中多语种语音识别的准确度受制于语种检测模型的准确度,一旦语种检测模型出现错误,语音的识别结果也会出错。因此现有技术中为保证语音识别结果的准确度,必须训练出高准确度的语种检测模型,而训练出高准确度的语种检测模型对于训练数据以及训练时间的要求均很高,难度较大。
技术问题
本公开提供一种多语种语音识别方法、装置及电子设备,其主要目的在于降低多语种语音识别的难度。
技术解决方案
为实现上述目的,本公开提供的一种多语种语音识别方法,包括:获取待识别的目标语音;调用预训练的声学模型以及预训练的多语种语言模型对所述目标语音进行解码,获取所述目标语音的识别结果搜索网格;调用预训练的多个单语种语言模型分别对所述识别结果搜索网格进行重打分,分别筛选出一个对应语种的候选识别结果,并分别确定所述候选识别结果为所述目标语音的目标识别结果的概率;按照所述概率从大到小的顺序将所述候选识别结果进行排序,并从排名前预设位数的候选识别结果中筛选出所述目标识别结果。
为了解决上述问题,本公开还提供一种多语种语音识别装置,所述装置包括:获取模块,配置为获取待识别的目标语音;解码模块,配置为调用预训练的声学模型以及预训练的多语种语言模型对所述目标语音进行解码,获取所述目标语音的识别结果搜索网格;重打分模块,配置为调用预训练的多个单语种语言模型分别对所述识别结果搜索网格进行重打分,分别筛选出一个对应语种的候选识别结果,并分别确定所述候选识别结果为所述目标语音的目标识别结果的概率;筛选模块,配置为按照所述概率从大到小的顺序将所述候选识别结果进行排序,并从排名前预设位数的候选识别结果中筛选出所述目标识别结果。
为了解决上述问题,本公开还提供一种电子设备,所述电子设备包括:存储器,存储至少一个指令;及处理器,执行所述存储器中存储的指令以实现以下方法:获取待识别的目标语音;调用预训练的声学模型以及预训练的多语种语言模型对所述目标语音进行解码,获取所述目标语音的识别结果搜索网格;调用预训练的多个单语种语言模型分别对所述识别结果搜索网格进行重打分,分别筛选出一个对应语种的候选识别结果,并分别确定所述候选识别结果为所述目标语音的目标识别结果的概率;按照所述概率从大到小的顺序将所述候选识别结果进行排序,并从排名前预设位数的候选识别结果中筛选出所述目标识别结果。
为了解决上述问题,本公开还提供一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一个指令,所述至少一个指令被电子设备中的处理器执行以实现以下方法:获取待识别的目标语音;调用预训练的声学模型以及预训练的多语种语言模型对所述目标语音进行解码,获取所述目标语音的识别结果搜索网格;调用预训练的多个单语种语言模型分别对所述识别结果搜索网格进行重打分,分别筛选出一个对应语种的候选识别结果,并分别确定所述候选识别结果为所述目标语音的目标识别结果的概率;按照所述概率从大到小的顺序将所述候选识别结果进行排序,并从排名前预设位数的候选识别结果中筛选出所述目标识别结果。
有益效果
本公开实施例规避掉了在进行多语种语音识别时对语种类别进行检测的这一步骤,从而避免了因训练语种检测模型所带来的困难,降低了多语种语音识别的难度,可应用于智慧政务领域、智慧医疗领域或者其他有着语音识别需求的领域中,从而推动智慧城市的建设。同样的,本方案可以应用于数字医疗中在线问诊环节中。
附图说明
图1为本公开一实施例提供的多语种语音识别方法的流程示意图。
图2为本公开一实施例提供的多语种语音识别装置的模块示意图。
图3为本公开一实施例提供的实现多语种语音识别方法的电子设备的内部结构示意图。
本公开目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
本发明的实施方式
应当理解,此处所描述的具体实施例仅仅用以解释本公开,并不用于限定本公开。
本申请的技术方案可应用于人工智能、智慧城市、数字医疗、区块链和/或大数据技术领域,如可实现智慧政务、智慧医疗等。可选的,本申请涉及的数据如语音、训练文本和/或识别结果等可存储于数据库中,或者可以存储于区块链中,比如通过区块链分布式存储,本申请不做限定。
本公开提供一种多语种语音识别方法。参照图1所示,为本公开一实施例提供的多语种语音识别方法的流程示意图。该方法可以由一个装置执行,该装置可以由软件和/或硬件实现。
在本实施例中,多语种语音识别方法包括以下步骤。
步骤S1、获取待识别的目标语音。
步骤S2、调用预训练的声学模型以及预训练的多语种语言模型对所述目标语音进行解码,获取所述目标语音的识别结果搜索网格。
步骤S3、调用预训练的多个单语种语言模型分别对所述识别结果搜索网格进行重打分,分别筛选出一个对应语种的候选识别结果,并分别确定所述候选识别结果为所述目标语音的目标识别结果的概率。
步骤S4、按照所述概率从大到小的顺序将所述候选识别结果进行排序,并从排名前预设位数的候选识别结果中筛选出所述目标识别结果。
本公开实施例所提供的多语种语音识别方法可以应用于各种存在语音识别需求的应用场景中,例如:智能客服、智能机器人问诊。
在一实施例中,将本公开实施例所提供的多语种语音识别方法应用于智能客服中。智能客服系统预先在云端服务器中训练并配置好一个声学模型、一个多语种语言模型、一个中文语言模型、一个英文语言模型以及一个日文语言模型。该多语种语言模型是采用混合了中文、英文以及日文的训练文本训练得到的语言模型;该中文语言模型是只采用中文训练文本训练得到的语言模型;该英文语言模型是只采用英文训练文本训练得到的语言模型;该日文语言模型是只采用日文训练文本训练得到的语言模型。
用户通过客户端(例如:手机、个人电脑)与该智能客服系统建立通话的过程中,该智能客服系统将通过该客户端采集到的用户语音上传至云端服务器,并调用该声学模型以及该多语种语言模型对用户语音进行解码,得到识别结果搜索网格。调用该中文语言模型对该识别结果搜索网格进行重打分,根据重打分所进行的搜索筛选出中文下的候选识别结果R1,同时确定候选识别结果R1为用户的真实语义的概率P1;并列地,调用该英文语言模型对该识别结果搜索网格进行重打分,筛选出英文下的候选识别结果R2,同时确定候选识别结果R2为用户的真实语义的概率P2;并列地,调用该日文语言模型对该识别结果搜索网格进行重打分,筛选出日文下的候选识别结果R3,同时确定候选识别结果R3为用户的真实语义的概率P3。
进而该智能客服系统按照概率P1、概率P2以及概率P3从大到小的顺序,将上述三个候选识别结果进行排序,从排名前两位的候选识别结果中筛选出目标识别结果,确定用户的真实语义。
若该智能客服系统确定候选识别结果R2为用户的真实语义,则该智能客服系统可以进一步的根据预设的自然语言处理模型确定对该候选识别结果R2所应作出的应答,并将该应答从文本转换为语音,进而以语音的形式反馈给用户。从而即使用户在通话过程中将英文或者日文混入中文,该智能客服系统也能够准确地理解用户的真实语义,进而准确地作出应答。
本公开实施例中,获取待识别的目标语音。具体的,可以通过麦克风采集用户的聊天语音,进而将该聊天语音作为待识别的目标语音,通过本公开实施例中后续处理确定该聊天语音的语义,即得到该目标语音的目标识别结果。进一步的,可以在得到的目标识别结果的基础上,将该聊天语音翻译为特定语种的语音,或者将该聊天语音转换为文本。
其中,所获取到的目标语音可以包含多个语种。例如:目标语音包含中文以及英文;或者,包含中文以及日文;或者,包含中文、英文以及日文。
本公开实施例中,预训练有至少一个的声学模型,预训练有至少一个的多语种语言模型。获取到目标语音后,调用该声学模型以及该多语种语言模型对目标语音进行解码,得到目标语音的识别结果搜索网格。该识别结果搜索网格可以看作用于搜索出正确的目标识别结果的搜索空间,是一种紧凑的包含多个可选路径的数据结构。具体的,调用该声学模型以及该多语种语言模型对目标语音进行解码,是在目标语音的状态网络中进行搜索,并对搜索出的各条路径进行打分,进而根据打出的分数对状态网络进行修剪筛选,从而得到更为紧凑的其中所包含路径更接近目标识别结果的识别结果搜索网格。
其中,声学模型指的是对原始的音频数据进行处理,提取出音频数据中的音素信息的模型。具体的,本公开实施例中可以使用混合高斯模型以及隐马尔科夫模型(GMM-HMM)作为声学模型;也可以使用深度神经网络(Deep Neural Network,DNN)作为声学模型。
多语种语言模型指的是训练时所使用的训练文本包含多个不同的语种所得到的语言模型。由于训练该多语种语音模型时,主要限于训练文本所包含的有限个数的语种以及训练文本所包含的有限个数的具体语句,因此将该多语种语音模型投入实际的语音识别时,其所解码得到的识别结果搜索网格的困惑度较高。
在一实施例中,在调用预训练的声学模型以及预训练的多语种语言模型对目标语音进行解码之前,方法还包括:对目标语音进行降噪处理,得到降噪处理后的目标语音;对降噪处理后的目标语音进行特征提取,得到用于作为声学模型的输入的目标语音的语音帧序列。
该实施例中,获取到目标语音后,在对目标语音进行解码之前,对目标语音进行降噪处理以增强目标语音。再对降噪处理后的目标语音进行特征提取,将降噪处理后的目标语音从时域信号转换为频域信号,得到用于作为声学模型的输入的目标语音的语音帧序列(例如:以帧划分的目标语音的FBANK特征)。
可以理解的,在对目标语音进行解码之前,除了对目标语音进行降噪处理外,还可以对目标语音进行失真消除处理或者其他语音预处理,以进一步从整体提高多语种语音识别的准确度。
在一实施例中,通过以下方法预训练多语种语言模型,方法包括:获取第一语种对应的第一训练文本以及第二语种对应的第二训练文本;将第一训练文本以及第二训练文本共同输入多语种语言模型,得到多语种语言模型输出的第一训练文本的第一识别结果以及第二训练文本的第二识别结果;基于第一识别结果以及第二识别结果确定多语种语言模型的识别误差;通过反向传播识别误差调整多语种语言模型的参数,直到识别误差小于预设误差阈值。
该实施例中,采用包含两个语种的训练文本预训练多语种语言模型。
具体的,第一训练文本为第一语种,第二训练文本为第二语种,第一语种与第二语种为不同语种。在保证每个句子的语义完整的前提下,将第一训练文本与第二训练文本混合在一起,并输入该多语种语言模型;进而该多语种语言模型以当前参数对输入的文本进行处理,输出第一训练文本的第一识别结果以及第二训练文本的第二识别结果。此时,该多语种语言模型所输出的第一识别结果可能是存在错误的,所输出的第二识别结果也可能是存在错误的。
由于第一训练文本的正确识别结果是事先确定的,第二训练文本的正确识别结果也是事先确定的,因此,可以基于该多语种语言模型所输出的第一识别结果以及第二识别结果确定该多语种语言模型的识别误差。进而采用反向传播算法,按照从后往前的顺序,依次调整该多语种语言模型的参数。再次将第一训练文本与第二训练文本混合在一起并输入该多语种语言模型,不断循环,直到该多语种语言模型的识别误差小于预设误差阈值。
例如:预设误差阈值为5%。预先采集一定量的中文训练文本以及一定量的英文训练文本,以句子为单位,将该中文训练文本中的各句子随机插入到该英文训练文本中,从而得到混合了中文与英文的混合训练文本。
将该混合训练文本输入一待训练的语言模型A,得到语言模型A输出的中文部分的第一识别结果以及英文部分的第二识别结果,进而在此基础上确定语言模型A的识别误差。进而通过反向传播该识别误差调整语言模型A的参数。
再将该混合训练文本输入调整了参数后的语言模型A,不断循环,直到语言模型A的识别误差小于5%,得到预训练完成的能够以一定准确度识别中英文混合文本的多语种语言模型A。
需要说明的是,该实施例只是示例性地展示了预训练多语种语言模型的过程。可以理解的,根据场景应用需求,在预训练多语种语言模型时,可以将三个或者三个以上的语种的训练文本混合在一起。该实施例不应对本公开的功能和使用范围造成限制。
在一实施例中,调用预训练的声学模型以及预训练的多语种语言模型对目标语音进行解码,包括:调用声学模型以目标语音的语音帧为输入,针对每一语音帧,输出语音帧对应各状态的第一概率以及各状态之间相互转移的第二概率;获取多语种语言模型经预训练后得到的用于描述词序统计规律的第三概率;基于第一概率、第二概率以及第三概率对目标语音进行解码,得到识别结果搜索网格。
该实施例中,解码主要指的是在由声学模型、声学上下文和语言模型等知识源组成的搜索空间中,通过搜索得到各路径所对应的概率,进而可以根据概率挑选出最佳路径或者排名前N的路径。其中,每条路径对应着一个识别结果;N为预设的正整数。
具体的,声学模型将声学和发音学的知识进行整合,以目标语音的语音帧序列为输入,针对该语音帧序列中的每一语音帧,均输出该语音帧对应各状态的第一概率以及该语音帧的各状态之间相互转移的第二概率。即,第一概率描述的是帧与状态之间的概率;第二概率描述的是状态与状态之间的概率。其中,状态指的是在语音处理中,比音素粒度更小的语音单元,一个音素对应多个状态。语音识别的本质便是在确定每一语音帧对应的状态的基础上再确定音素再确定词序,所确定的词序即识别结果。
多语种语言模型在经过预训练后,便具有与预训练过程相匹配的有关词序统计规律的先验知识。从而多语种语言模型以该先验知识输出用于描述词序统计规律的第三概率。
进而基于该第一概率、该第二概率以及该第三概率对目标语音进行解码,得到识别结果搜索网格。
在一实施例中,基于第一概率、第二概率以及第三概率对目标语音进行解码,得到识别结果搜索网格,包括:根据第一概率、第二概率以及第三概率建立目标语音的状态网络,状态网络中的节点用于描述单一状态下的语音帧,状态网络中的边用于描述各个单一状态下的语音帧;采用维特比算法在状态网络中进行搜索,得到识别结果搜索网格。
该实施例中,建立得到的目标语音的状态网络为图结构数据。具体的,根据第一概率将语音帧在各状态下进行展开,得到在状态网络中的对应节点。例如:语音帧t为第一状态s1的第一概率为a1,为第二状态s2的第一概率为a2,为第三状态s3的第二概率为a3。则将语音帧t在各状态下展开后,得到在状态网络中的3个节点,分别为语音帧t-s1、语音帧t-s2以及语音帧t-s3。
再根据状态之间相互转移的第二概率以及用于描述词序统计规律的第三规律建立各节点之间带有具体权重的边,从而得到状态网络。
再采用维特比算法在该状态网络中进行搜索,对搜索到的路径进行评分。根据该评分在确定最佳路径的同时,也可以确定次优的路径。进而将排名前N位的路径选出,并以DAG (directed acyclic graph,有向无环图)的结构将选出的这N条路径保存起来,得到图结构数据的识别结果搜索网格。具体的,选出这N条路径后,挑选出该N条路径所包括的所有节点,还原每一节点原本在状态网络中的位置,并同样地根据第一概率、第二概率以及第三概率重新建立所挑选还原出的各节点之间带有具体权重的边,从而得到识别结果搜索网格。
其中,采用维特比算法在进行搜索时,主要遵循以下三点规则:1、如果概率最大的路径经过状态网络的某点,则从开始点到该点的子路径也一定是从开始到该点路径中概率最大的;2、假设第i时刻有k个状态,从开始到i时刻的k个状态有k条最短路径,而最终的最短路径必然经过其中的一条;3、根据上述两点,在计算第i+1状态的最短路径时,只需考虑从开始到当前的k个状态值的最短路径和当前状态值到第i+1状态值的最短路径。
本公开实施例中,预训练有多个单语种语言模型,每个单语种语言模型只针对对应语种进行语音识别。获取到识别结果搜索网格后,调用该多个单语种语言模型分别对该识别结果搜索网格进行重打分。经过上述说明可知,对目标语音进行解码是在多语种语言模型的先验知识的基础上,在目标语音的状态网络中搜索得到该识别结果搜索网格;则,单语种语言模型的重打分可以看作在单语种语言模型的先验知识的基础上,在该识别结果搜索网络再次进行搜索得到对应的候选识别结果。重打分的实施过程与解码的实施过程大致相似,主要区别在于,多语种语言模型的先验知识所描述的是多个语种混合在一起的语言环境下的词序统计规律,单语种语言模型的先验知识所描述的是特定单一语种的语言环境下的词序统计规律。
每个单语种语言模型在重打分后均会筛选出一个对应语种下最可能的识别结果,即,对应语种的候选识别结果。同时,在对应语种下候选识别结果为目标识别结果的概率也得到了确定。
例如:预训练有3个单语种语言模型,分别为专门针对中文进行语音识别的语言模型B、专门针对英文进行语音识别的语言模型C以及专门针对日文进行语音识别的语言模型D。得到识别结果搜索网格后,调用语言模型B对其进行重打分,筛选出中文下的最高概率为P1的候选识别结果R1;调用语言模型C对其进行重打分,筛选出英文下的最高概率为P2的候选识别结果R2;调用语言模型D对其进行重打分,筛选出日文下的最高概率为P3的候选识别结果R3。
在一实施例中,单语种语言模型为基于统计的ngram语言模型。
该实施例的优点在于,采用基于统计的ngram语言模型作为单语种语言模型,避免了采用神经网络作为单语种语言模型对机器的限制,提高了处理速度,从而提高了线上实时的多语种语音识别的识别效率。
本公开实施例中,得到多个候选识别结果以及对应的为目标识别结果的概率后,按照对应概率从大到小的顺序将得到的多个候选识别结果进行排序,进而从排名前预设位数的候选识别结果中筛选出目标识别结果。其中,排名前预设位数一般不包括排名最后一位。
在一实施例中,从排名前预设位数的候选识别结果中筛选出目标识别结果,包括:将排名第一位的候选识别结果确定为目标识别结果。
该实施例中,按照概率从大到小的顺序将得到的多个候选识别结果排序后,从排名前M位中的第一位的候选识别结果确定为目标识别结果,即,将最大的概率对应的候选识别结果确定为目标识别结果。其中,M为预设的大于1的正整数。
例如:预设的M为2。得到的多个候选识别结果有概率为P1的候选识别结果R1、概率为P2的候选识别结果R2以及概率为P3的候选识别结果R3。其中,P1小于P2,P1大于P3,则排在前2位的依次为候选识别结果R2、候选识别结果R1。则将候选识别结果R2确定为目标识别结果。
在一实施例中,从排名前预设位数的候选识别结果中筛选出目标识别结果,包括:从排名前预设位数的候选识别结果中随机选取出一候选识别结果作为目标识别结果。
该实施例中,按照概率从大到小的顺序将得到的多个候选识别结果排序后,从排名前M位中随机选取一候选识别结果作为目标识别结果。其中,M为预设的大于1的正整数。
例如:预设的M为3。得到的多个候选识别结果有概率为P1的候选识别结果R1、概率为P2的候选识别结果R2、概率为P3的候选识别结果R3以及概率为P4的候选识别结果R4。其中,P1大于P2,P2大于P3,P3大于P4,则排在前3位的依次为候选识别结果R1、候选识别结果R2以及候选识别结果R3。则随机从候选识别结果R1、候选识别结果R2以及候选识别结果R3中随机选取一个作为目标识别结果。
该实施例的优点在于,通过在控制范围内引入一定程度的随机性,避免了在多语种语音识别尚未足够完善、尚未足够覆盖各类应用场景的情况下持续出现同类偏差的发生。
如图2所示,是本公开多语种语音识别装置的功能模块图。
本公开的多语种语音识别装置100可以安装于电子设备中。根据实现的功能,多语种语音识别装置可以包括获取模块101、解码模块102、重打分模块103、筛选模块104。本公开的模块也可以称之为单元,是指一种能够被电子设备处理器所执行,并且能够完成固定功能的一系列计算机程序段,其存储在电子设备的存储器中。
在本实施例中,关于各模块/单元的功能如下。
获取模块101配置为获取待识别的目标语音。
解码模块102配置为调用预训练的声学模型以及预训练的多语种语言模型对所述目标语音进行解码,获取所述目标语音的识别结果搜索网格。
重打分模块103配置为调用预训练的多个单语种语言模型分别对所述识别结果搜索网格进行重打分,分别筛选出一个对应语种的候选识别结果,并分别确定所述候选识别结果为所述目标语音的目标识别结果的概率。
筛选模块104配置为按照所述概率从大到小的顺序将所述候选识别结果进行排序,并从排名前预设位数的候选识别结果中筛选出所述目标识别结果。
具体地,所述多语种语音识别装置100的功能模块具体所实现的功能可参考图1对应实施例中相关步骤的描述,在此不赘述。
如图3所示,是本公开实现多语种语音识别方法的电子设备的结构示意图。
所述电子设备1可以包括处理器10、存储器11和总线,还可以包括存储在所述存储器11中并可在所述处理器10上运行的计算机程序,如多语种语音识别程序12。
其中,所述存储器11至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、移动硬盘、多媒体卡、卡型存储器(例如:SD或DX存储器等)、磁性存储器、磁盘、光盘等。所述存储器11在一些实施例中可以是电子设备1的内部存储单元,例如该电子设备1的移动硬盘。所述存储器11在另一些实施例中也可以是电子设备1的外部存储设备,例如电子设备1上配备的插接式移动硬盘、智能存储卡(Smart Media Card, SMC)、安全数字(Secure Digital, SD)卡、闪存卡(Flash Card)等。进一步地,所述存储器11还可以既包括电子设备1的内部存储单元也包括外部存储设备。所述存储器11不仅可以用于存储安装于电子设备1的应用软件及各类数据,例如多语种语音识别程序的代码等,还可以用于暂时地存储已经输出或者将要输出的数据。
所述处理器10在一些实施例中可以由集成电路组成,例如可以由单个封装的集成电路所组成,也可以是由多个相同功能或不同功能封装的集成电路所组成,包括一个或者多个中央处理器(Central Processing unit,CPU)、微处理器、数字处理芯片、图形处理器及各种控制芯片的组合等。所述处理器10是所述电子设备的控制核心(Control Unit),利用各种接口和线路连接整个电子设备的各个部件,通过运行或执行存储在所述存储器11内的程序或者模块(例如多语种语音识别程序等),以及调用存储在所述存储器11内的数据,以执行电子设备1的各种功能和处理数据。
所述总线可以是外设部件互连标准(peripheral component interconnect,简称PCI)总线或扩展工业标准结构(extended industry standard architecture,简称EISA)总线等。该总线可以分为地址总线、数据总线、控制总线等。所述总线被设置为实现所述存储器11以及至少一个处理器10等之间的连接通信。
图3仅示出了具有部件的电子设备,本领域技术人员可以理解的是,图3示出的结构并不构成对所述电子设备1的限定,可以包括比图示更少或者更多的部件,或者组合某些部件,或者不同的部件布置。
例如,尽管未示出,所述电子设备1还可以包括给各个部件供电的电源(比如电池),优选地,电源可以通过电源管理装置与所述至少一个处理器10逻辑相连,从而通过电源管理装置实现充电管理、放电管理、以及功耗管理等功能。电源还可以包括一个或一个以上的直流或交流电源、再充电装置、电源故障检测电路、电源转换器或者逆变器、电源状态指示器等任意组件。所述电子设备1还可以包括多种传感器、蓝牙模块、Wi-Fi模块等,在此不再赘述。
进一步地,所述电子设备1还可以包括网络接口,可选地,所述网络接口可以包括有线接口和/或无线接口(如WI-FI接口、蓝牙接口等),通常用于在该电子设备1与其他电子设备之间建立通信连接。
可选地,该电子设备1还可以包括用户接口,用户接口可以是显示器(Display)、输入单元(比如键盘(Keyboard)),可选地,用户接口还可以是标准的有线接口、无线接口。可选地,在一些实施例中,显示器可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode,有机发光二极管)触摸器等。其中,显示器也可以适当的称为显示屏或显示单元,用于显示在电子设备1中处理的信息以及用于显示可视化的用户界面。
应该了解,所述实施例仅为说明之用,在专利申请范围上并不受此结构的限制。
所述电子设备1中的所述存储器11存储的多语种语音识别程序12是多个指令的组合,在所述处理器10中运行时,可以实现:获取待识别的目标语音;调用预训练的声学模型以及预训练的多语种语言模型对所述目标语音进行解码,获取所述目标语音的识别结果搜索网格;调用预训练的多个单语种语言模型分别对所述识别结果搜索网格进行重打分,分别筛选出一个对应语种的候选识别结果,并分别确定所述候选识别结果为所述目标语音的目标识别结果的概率;按照所述概率从大到小的顺序将所述候选识别结果进行排序,并从排名前预设位数的候选识别结果中筛选出所述目标识别结果。
具体地,所述处理器10对上述指令的具体实现方法可参考图1对应实施例中相关步骤的描述,在此不赘述。
进一步地,所述电子设备1集成的模块/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。所述计算机可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)。
本申请实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现上述实施例中方法的步骤,或者,计算机程序被处理器执行时实现上述实施例中装置的各模块/单元的功能,这里不再赘述。可选的,本申请涉及的存储介质如计算机可读存储介质可以是非易失性的,也可以是易失性的。
在本公开所提供的几个实施例中,应该理解到,所揭露的设备,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。
所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。
另外,在本公开各个实施例中的各功能模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能模块的形式实现。
对于本领域技术人员而言,显然本公开不限于上述示范性实施例的细节,而且在不背离本公开的精神或基本特征的情况下,能够以其他的具体形式实现本公开。
因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本公开的范围由所附权利要求而不是上述说明限定,因此旨在将落在权利要求的等同要件的含义和范围内的所有变化涵括在本公开内。不应将权利要求中的任何附关联图标记视为限制所涉及的权利要求。
此外,显然“包括”一词不排除其他单元或步骤,单数不排除复数。系统权利要求中陈述的多个单元或装置也可以由一个单元或装置通过软件或者硬件来实现。第二等词语用来表示名称,而并不表示任何特定的顺序。
最后应说明的是,以上实施例仅用以说明本公开的技术方案而非限制,尽管参照较佳实施例对本公开进行了详细说明,本领域的普通技术人员应当理解,可以对本公开的技术方案进行修改或等同替换,而不脱离本公开技术方案的精神和范围。

Claims (20)

  1. 一种多语种语音识别方法,其中,所述方法包括:
    获取待识别的目标语音;
    调用预训练的声学模型以及预训练的多语种语言模型对所述目标语音进行解码,获取所述目标语音的识别结果搜索网格;
    调用预训练的多个单语种语言模型分别对所述识别结果搜索网格进行重打分,分别筛选出一个对应语种的候选识别结果,并分别确定所述候选识别结果为所述目标语音的目标识别结果的概率;
    按照所述概率从大到小的顺序将所述候选识别结果进行排序,并从排名前预设位数的候选识别结果中筛选出所述目标识别结果。
  2. 如权利要求1所述的方法,其中,在调用预训练的声学模型以及预训练的多语种语言模型对所述目标语音进行解码之前,所述方法还包括:
    对所述目标语音进行降噪处理,得到降噪处理后的目标语音;
    对所述降噪处理后的目标语音进行特征提取,得到用于作为所述声学模型的输入的所述目标语音的语音帧序列。
  3. 如权利要求1所述的方法,其中,通过以下方法预训练所述多语种语言模型,所述方法包括:
    获取第一语种对应的第一训练文本以及第二语种对应的第二训练文本;
    将所述第一训练文本以及所述第二训练文本共同输入所述多语种语言模型,得到所述多语种语言模型输出的所述第一训练文本的第一识别结果以及所述第二训练文本的第二识别结果;
    基于所述第一识别结果以及所述第二识别结果确定所述多语种语言模型的识别误差;
    通过反向传播所述识别误差调整所述多语种语言模型的参数,直到所述识别误差小于预设误差阈值。
  4. 如权利要求1所述的方法,其中,调用预训练的声学模型以及预训练的多语种语言模型对所述目标语音进行解码,包括:
    调用所述声学模型以所述目标语音的语音帧序列为输入,针对每一语音帧,输出所述语音帧对应各状态的第一概率以及所述各状态之间相互转移的第二概率;
    获取所述多语种语言模型经预训练后得到的用于描述词序统计规律的第三概率;
    基于所述第一概率、所述第二概率以及所述第三概率对所述目标语音进行解码,得到所述识别结果搜索网格。
  5. 如权利要求4所述的方法,其中,基于所述第一概率、所述第二概率以及所述第三概率对所述目标语音进行解码,得到所述识别结果搜索网格,包括:
    根据所述第一概率、所述第二概率以及所述第三概率建立所述目标语音的状态网络,所述状态网络中的节点用于描述单一状态下的语音帧,所述状态网络中的的边用于描述各个所述单一状态下的语音帧之间的转移概率;
    采用维特比算法在所述状态网络中进行搜索,得到所述识别结果搜索网格。
  6. 如权利要求1所述的方法,其中,从排名前预设位数的候选识别结果中筛选出所述目标识别结果,包括:将排名第一位的候选识别结果确定为所述目标识别结果。
  7. 如权利要求1所述的方法,其中,从排名前预设位数的候选识别结果中筛选出所述目标识别结果,包括:从排名前预设位数的候选识别结果中随机选取出一候选识别结果作为所述目标识别结果。
  8. 一种多语种语音识别装置,其中,所述装置包括:
    获取模块,配置为获取待识别的目标语音;
    解码模块,配置为调用预训练的声学模型以及预训练的多语种语言模型对所述目标语音进行解码,获取所述目标语音的识别结果搜索网格;
    重打分模块,配置为调用预训练的多个单语种语言模型分别对所述识别结果搜索网格进行重打分,分别筛选出一个对应语种的候选识别结果,并分别确定所述候选识别结果为所述目标语音的目标识别结果的概率;
    筛选模块,配置为按照所述概率从大到小的顺序将所述候选识别结果进行排序,并从排名前预设位数的候选识别结果中筛选出所述目标识别结果。
  9. 一种电子设备,其中,所述电子设备包括:
    至少一个处理器;以及,
    与所述至少一个处理器通信连接的存储器;其中,
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行以下方法:
    获取待识别的目标语音;
    调用预训练的声学模型以及预训练的多语种语言模型对所述目标语音进行解码,获取所述目标语音的识别结果搜索网格;
    调用预训练的多个单语种语言模型分别对所述识别结果搜索网格进行重打分,分别筛选出一个对应语种的候选识别结果,并分别确定所述候选识别结果为所述目标语音的目标识别结果的概率;
    按照所述概率从大到小的顺序将所述候选识别结果进行排序,并从排名前预设位数的候选识别结果中筛选出所述目标识别结果。
  10. 如权利要求9所述的电子设备,其中,在调用预训练的声学模型以及预训练的多语种语言模型对所述目标语音进行解码之前,所述处理器还用于执行:
    对所述目标语音进行降噪处理,得到降噪处理后的目标语音;
    对所述降噪处理后的目标语音进行特征提取,得到用于作为所述声学模型的输入的所述目标语音的语音帧序列。
  11. 如权利要求9所述的电子设备,其中,通过以下方法预训练所述多语种语言模型,所述处理器还用于执行:
    获取第一语种对应的第一训练文本以及第二语种对应的第二训练文本;
    将所述第一训练文本以及所述第二训练文本共同输入所述多语种语言模型,得到所述多语种语言模型输出的所述第一训练文本的第一识别结果以及所述第二训练文本的第二识别结果;
    基于所述第一识别结果以及所述第二识别结果确定所述多语种语言模型的识别误差;
    通过反向传播所述识别误差调整所述多语种语言模型的参数,直到所述识别误差小于预设误差阈值。
  12. 如权利要求9所述的电子设备,其中,调用预训练的声学模型以及预训练的多语种语言模型对所述目标语音进行解码时,具体执行:
    调用所述声学模型以所述目标语音的语音帧序列为输入,针对每一语音帧,输出所述语音帧对应各状态的第一概率以及所述各状态之间相互转移的第二概率;
    获取所述多语种语言模型经预训练后得到的用于描述词序统计规律的第三概率;
    基于所述第一概率、所述第二概率以及所述第三概率对所述目标语音进行解码,得到所述识别结果搜索网格。
  13. 如权利要求12所述的电子设备,其中,基于所述第一概率、所述第二概率以及所述第三概率对所述目标语音进行解码,得到所述识别结果搜索网格时,具体执行:
    根据所述第一概率、所述第二概率以及所述第三概率建立所述目标语音的状态网络,所述状态网络中的节点用于描述单一状态下的语音帧,所述状态网络中的的边用于描述各个所述单一状态下的语音帧之间的转移概率;
    采用维特比算法在所述状态网络中进行搜索,得到所述识别结果搜索网格。
  14. 如权利要求9所述的电子设备,其中,从排名前预设位数的候选识别结果中筛选出所述目标识别结果时,具体执行:
    将排名第一位的候选识别结果确定为所述目标识别结果;或者,
    从排名前预设位数的候选识别结果中随机选取出一候选识别结果作为所述目标识别结果。
  15. 一种计算机可读存储介质,存储有计算机程序,其中,所述计算机程序被处理器执行时实现以下方法:
    获取待识别的目标语音;
    调用预训练的声学模型以及预训练的多语种语言模型对所述目标语音进行解码,获取所述目标语音的识别结果搜索网格;
    调用预训练的多个单语种语言模型分别对所述识别结果搜索网格进行重打分,分别筛选出一个对应语种的候选识别结果,并分别确定所述候选识别结果为所述目标语音的目标识别结果的概率;
    按照所述概率从大到小的顺序将所述候选识别结果进行排序,并从排名前预设位数的候选识别结果中筛选出所述目标识别结果。
  16. 如权利要求15所述的计算机可读存储介质,其中,在调用预训练的声学模型以及预训练的多语种语言模型对所述目标语音进行解码之前,所述计算机程序被处理器执行时还用于实现:
    对所述目标语音进行降噪处理,得到降噪处理后的目标语音;
    对所述降噪处理后的目标语音进行特征提取,得到用于作为所述声学模型的输入的所述目标语音的语音帧序列。
  17. 如权利要求15所述的计算机可读存储介质,其中,通过以下方法预训练所述多语种语言模型,所述计算机程序被处理器执行时还用于实现:
    获取第一语种对应的第一训练文本以及第二语种对应的第二训练文本;
    将所述第一训练文本以及所述第二训练文本共同输入所述多语种语言模型,得到所述多语种语言模型输出的所述第一训练文本的第一识别结果以及所述第二训练文本的第二识别结果;
    基于所述第一识别结果以及所述第二识别结果确定所述多语种语言模型的识别误差;
    通过反向传播所述识别误差调整所述多语种语言模型的参数,直到所述识别误差小于预设误差阈值。
  18. 如权利要求15所述的计算机可读存储介质,其中,调用预训练的声学模型以及预训练的多语种语言模型对所述目标语音进行解码时,具体实现:
    调用所述声学模型以所述目标语音的语音帧序列为输入,针对每一语音帧,输出所述语音帧对应各状态的第一概率以及所述各状态之间相互转移的第二概率;
    获取所述多语种语言模型经预训练后得到的用于描述词序统计规律的第三概率;
    基于所述第一概率、所述第二概率以及所述第三概率对所述目标语音进行解码,得到所述识别结果搜索网格。
  19. 如权利要求18所述的计算机可读存储介质,其中,基于所述第一概率、所述第二概率以及所述第三概率对所述目标语音进行解码,得到所述识别结果搜索网格时,具体实现:
    根据所述第一概率、所述第二概率以及所述第三概率建立所述目标语音的状态网络,所述状态网络中的节点用于描述单一状态下的语音帧,所述状态网络中的的边用于描述各个所述单一状态下的语音帧之间的转移概率;
    采用维特比算法在所述状态网络中进行搜索,得到所述识别结果搜索网格。
  20. 如权利要求15所述的计算机可读存储介质,其中,从排名前预设位数的候选识别结果中筛选出所述目标识别结果时,具体实现:
    将排名第一位的候选识别结果确定为所述目标识别结果;或者,
    从排名前预设位数的候选识别结果中随机选取出一候选识别结果作为所述目标识别结果。
PCT/CN2020/134543 2020-10-19 2020-12-08 多语种语音识别方法、装置及电子设备 WO2021179701A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011119841.6A CN112185348B (zh) 2020-10-19 2020-10-19 多语种语音识别方法、装置及电子设备
CN202011119841.6 2020-10-19

Publications (1)

Publication Number Publication Date
WO2021179701A1 true WO2021179701A1 (zh) 2021-09-16

Family

ID=73949726

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/134543 WO2021179701A1 (zh) 2020-10-19 2020-12-08 多语种语音识别方法、装置及电子设备

Country Status (2)

Country Link
CN (1) CN112185348B (zh)
WO (1) WO2021179701A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114927135A (zh) * 2022-07-22 2022-08-19 广州小鹏汽车科技有限公司 语音交互方法、服务器及存储介质
CN115132182A (zh) * 2022-05-24 2022-09-30 腾讯科技(深圳)有限公司 一种数据识别方法、装置、设备及可读存储介质

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112885336B (zh) * 2021-01-29 2024-02-02 深圳前海微众银行股份有限公司 语音识别系统的训练、识别方法、装置、电子设备
CN113077793B (zh) * 2021-03-24 2023-06-13 北京如布科技有限公司 一种语音识别方法、装置、设备及存储介质
CN113223506B (zh) * 2021-05-28 2022-05-20 思必驰科技股份有限公司 语音识别模型训练方法及语音识别方法
CN113380224A (zh) * 2021-06-04 2021-09-10 北京字跳网络技术有限公司 语种确定方法、装置、电子设备及存储介质
WO2022266825A1 (zh) * 2021-06-22 2022-12-29 华为技术有限公司 语音处理方法、装置及系统
CN113380227A (zh) * 2021-07-23 2021-09-10 上海才历网络有限公司 一种基于神经网络的语种识别方法、装置及电子设备
CN113689888A (zh) * 2021-07-30 2021-11-23 浙江大华技术股份有限公司 一种异常声音分类方法、系统、装置以及存储介质
CN114078468B (zh) * 2022-01-19 2022-05-13 广州小鹏汽车科技有限公司 语音的多语种识别方法、装置、终端和存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018087935A (ja) * 2016-11-30 2018-06-07 日本電信電話株式会社 音声言語識別装置、その方法、及びプログラム
CN109817213A (zh) * 2019-03-11 2019-05-28 腾讯科技(深圳)有限公司 用于自适应语种进行语音识别的方法、装置及设备
CN110895932A (zh) * 2018-08-24 2020-03-20 中国科学院声学研究所 基于语言种类和语音内容协同分类的多语言语音识别方法
CN111369978A (zh) * 2018-12-26 2020-07-03 北京搜狗科技发展有限公司 一种数据处理方法、装置和用于数据处理的装置

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2845019B2 (ja) * 1992-03-31 1999-01-13 三菱電機株式会社 類似度演算装置
CN109117484B (zh) * 2018-08-13 2019-08-06 北京帝派智能科技有限公司 一种语音翻译方法和语音翻译设备
CN110569830B (zh) * 2019-08-01 2023-08-22 平安科技(深圳)有限公司 多语言文本识别方法、装置、计算机设备及存储介质
CN111667817A (zh) * 2020-06-22 2020-09-15 平安资产管理有限责任公司 一种语音识别方法、装置、计算机系统及可读存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018087935A (ja) * 2016-11-30 2018-06-07 日本電信電話株式会社 音声言語識別装置、その方法、及びプログラム
CN110895932A (zh) * 2018-08-24 2020-03-20 中国科学院声学研究所 基于语言种类和语音内容协同分类的多语言语音识别方法
CN111369978A (zh) * 2018-12-26 2020-07-03 北京搜狗科技发展有限公司 一种数据处理方法、装置和用于数据处理的装置
CN109817213A (zh) * 2019-03-11 2019-05-28 腾讯科技(深圳)有限公司 用于自适应语种进行语音识别的方法、装置及设备
CN110491382A (zh) * 2019-03-11 2019-11-22 腾讯科技(深圳)有限公司 基于人工智能的语音识别方法、装置及语音交互设备

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115132182A (zh) * 2022-05-24 2022-09-30 腾讯科技(深圳)有限公司 一种数据识别方法、装置、设备及可读存储介质
CN115132182B (zh) * 2022-05-24 2024-02-23 腾讯科技(深圳)有限公司 一种数据识别方法、装置、设备及可读存储介质
CN114927135A (zh) * 2022-07-22 2022-08-19 广州小鹏汽车科技有限公司 语音交互方法、服务器及存储介质
CN114927135B (zh) * 2022-07-22 2022-12-13 广州小鹏汽车科技有限公司 语音交互方法、服务器及存储介质

Also Published As

Publication number Publication date
CN112185348A (zh) 2021-01-05
CN112185348B (zh) 2024-05-03

Similar Documents

Publication Publication Date Title
WO2021179701A1 (zh) 多语种语音识别方法、装置及电子设备
US11740863B2 (en) Search and knowledge base question answering for a voice user interface
US10176804B2 (en) Analyzing textual data
CN108847241B (zh) 将会议语音识别为文本的方法、电子设备及存储介质
CN106649742B (zh) 数据库维护方法和装置
US9672817B2 (en) Method and apparatus for optimizing a speech recognition result
US11823678B2 (en) Proactive command framework
TWI666558B (zh) 語意分析方法、語意分析系統及非暫態電腦可讀取媒體
CN108806671B (zh) 语义分析方法、装置及电子设备
TWI662425B (zh) 一種自動生成語義相近句子樣本的方法
US20110137653A1 (en) System and method for restricting large language models
WO2022121251A1 (zh) 文本处理模型训练方法、装置、计算机设备和存储介质
CN112001175A (zh) 流程自动化方法、装置、电子设备及存储介质
US10504512B1 (en) Natural language speech processing application selection
CN107451119A (zh) 基于语音交互的语义识别方法及装置、存储介质、计算机设备
CN110473527B (zh) 一种语音识别的方法和系统
CN111192570A (zh) 语言模型训练方法、系统、移动终端及存储介质
CN111241820A (zh) 不良用语识别方法、装置、电子装置及存储介质
JP2020086332A (ja) キーワード抽出装置、キーワード抽出方法、およびプログラム
CN113450805B (zh) 基于神经网络的自动语音识别方法、设备及可读存储介质
KR101095864B1 (ko) 연속 숫자의 음성 인식에 있어서 혼동행렬과 신뢰도치 기반의 다중 인식후보 생성 장치 및 방법
CN116842168B (zh) 跨领域问题处理方法、装置、电子设备及存储介质
TWI837596B (zh) 中文相似音別字校正方法及系統
CN113763939B (zh) 基于端到端模型的混合语音识别系统及方法
CN110942775B (zh) 数据处理方法、装置、电子设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20924686

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20924686

Country of ref document: EP

Kind code of ref document: A1