CN113223522A - Speech recognition method, apparatus, device and storage medium - Google Patents

Speech recognition method, apparatus, device and storage medium Download PDF

Info

Publication number
CN113223522A
CN113223522A CN202110455740.4A CN202110455740A CN113223522A CN 113223522 A CN113223522 A CN 113223522A CN 202110455740 A CN202110455740 A CN 202110455740A CN 113223522 A CN113223522 A CN 113223522A
Authority
CN
China
Prior art keywords
information
text content
region
language model
content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110455740.4A
Other languages
Chinese (zh)
Other versions
CN113223522B (en
Inventor
赵银楼
张辽
蒋正翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202110455740.4A priority Critical patent/CN113223522B/en
Publication of CN113223522A publication Critical patent/CN113223522A/en
Application granted granted Critical
Publication of CN113223522B publication Critical patent/CN113223522B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • G10L15/144Training of HMMs
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The present disclosure discloses a voice recognition method, apparatus, device and storage medium, relating to the technical field of computers, in particular to the technical field of voice recognition, deep learning and the like. The voice recognition method comprises the following steps: acquiring input information, wherein the input information comprises: voice information and current region information; obtaining identification content to be identified based on the voice information; and processing the identification content by adopting a region language model corresponding to the current region information to obtain text content corresponding to the voice information. The method and the device can improve the accuracy of voice recognition.

Description

Speech recognition method, apparatus, device and storage medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to the field of speech recognition and deep learning, and in particular, to a speech recognition method, apparatus, device, and storage medium.
Background
With the development of science and technology, speech recognition technology is gradually applied to various industries. The recognition of addresses in map applications is an important application scenario of speech recognition technology. There may be problems with homophones but different addresses in mapping applications, such as "jingjunhu" in beijing and "lingjing" in shenyang, both pronounces the same but actually represent different addresses.
In the related art, during speech recognition, speech recognition is performed only based on speech information, a plurality of text contents of homophones are output as candidate results, for example, "fairy border moustache" and "lingjing moustache" are output as candidate results, and a user manually selects one of the candidate results.
Disclosure of Invention
The present disclosure provides a voice recognition method, apparatus, device, and storage medium.
According to an aspect of the present disclosure, there is provided a speech recognition method including: acquiring input information, wherein the input information comprises: voice information and current region information; obtaining identification content to be identified based on the voice information; and processing the identification content by adopting a region language model corresponding to the current region information to obtain text content corresponding to the voice information.
According to another aspect of the present disclosure, there is provided a voice recognition apparatus including: an obtaining module, configured to obtain input information, where the input information includes: voice information and current region information; the recognition module is used for obtaining recognition content to be recognized based on the voice information; and the first processing module is used for processing the identification content by adopting a region language model corresponding to the current region information so as to obtain text content corresponding to the voice information.
According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the above aspects.
According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method according to any one of the above aspects.
According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of the above aspects.
According to the technical scheme disclosed by the invention, the accuracy of voice recognition can be improved.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;
FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure;
FIG. 3 is a schematic diagram according to a third embodiment of the present disclosure;
FIG. 4 is a schematic diagram according to a fourth embodiment of the present disclosure;
FIG. 5 is a schematic diagram according to a fifth embodiment of the present disclosure;
FIG. 6 is a schematic diagram according to a sixth embodiment of the present disclosure;
fig. 7 is a schematic diagram of an electronic device for implementing any one of the speech recognition methods of the embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In the related art, a plurality of candidate results can be generated for the user to select for the problems of homophones and different addresses, but the candidate results are multiple and are not accurate enough, so that the workload of the user is increased undoubtedly, and the user experience is influenced.
In order to improve the accuracy of speech recognition, the present disclosure provides the following embodiments.
Fig. 1 is a schematic diagram according to a first embodiment of the present disclosure. The embodiment provides a speech recognition method including:
101. acquiring input information, wherein the input information comprises: voice information and current region information.
102. And obtaining the identification content to be identified based on the voice information.
103. And processing the identification content by adopting a region language model corresponding to the current region information to obtain text content corresponding to the voice information.
The voice information may be input by the user in a voice form, for example, see fig. 2, and taking a map application as an example, the user may input the voice information into the map application, for example, a voice corresponding to "smart border hutong" or "smart jing hutong".
The speech recognition method of the embodiment is implemented at the server, and the server is taken as the cloud in fig. 2. It is understood that the terminal may also have a higher processing capability.
As shown in fig. 2, after the user inputs the voice information into the map application, the map application sends the voice information to the cloud for voice recognition. In addition, a terminal, such as a mobile phone, where the map application is located may adopt a Positioning device, such as a Global Positioning System (GPS), to obtain current geographic information, that is, current geographic information, where the user is currently located, and send the current geographic information to the cloud, where the current geographic information is, for example, "beijing".
After receiving the voice information and the current region information sent by the client, the cloud end can perform voice recognition processing based on the voice information and the current region information.
In some embodiments, the obtaining of the recognition content to be recognized based on the voice information may include: and processing the voice information by adopting an acoustic model to obtain acoustic model output information, and determining the acoustic model output information as the identification content to be identified.
The speech recognition technology generally includes an acoustic model and a language model, the acoustic model is used to convert speech information into acoustic model output information, and the acoustic model output information may be syllables (Syllable), phonemes (Phone), etc. according to the difference of modeling units of the acoustic model, specifically, taking chinese as an example, the acoustic model output information is pinyin of chinese characters, for example. The language model is used to convert the acoustic model output information into text content, for example, "lingjinghutong" into "smart hooke".
The acoustic Model is, for example, a Hidden Markov Model (HMM) acoustic Model, and the language Model is, for example, an N-gram language Model, a Recurrent Neural Network (RNN) language Model, or the like.
The voice information is processed by adopting the acoustic model to obtain the output information of the acoustic model, and the method conforms to the general voice recognition processing flow so as to ensure the accuracy of voice recognition.
The language models employed in general speech recognition techniques may be referred to as generic language models.
The language models in the embodiment of the present disclosure include a plurality of region language models, and specifically, the region language models may be constructed for a main city, for example, the region language models may include a region language model corresponding to beijing and a region language model corresponding to shenyang. The region language model may use information related to regions, for example, addresses of each city as a training corpus, and the region language model corresponding to each city is generated after the training corpus is used for training.
Assuming that the current regional information is beijing, processing may be performed based on a regional language model corresponding to beijing, for example, "lingjinghutong" is processed to obtain a corresponding text content "lingjing hutong".
In order to improve generalization capability, the language models of the embodiments of the present disclosure may further include a general language model, that is, in the embodiments of the present disclosure, the language models may include a general language model and a region language model.
In some embodiments, the text content obtained by using the region language model is a first text content, and the method may further include: and processing the identified content by adopting a general language model to obtain second text content corresponding to the voice information.
By still using the generic language model for processing, the generalization capability can be improved.
Further, the method further comprises: determining a first score corresponding to the first text content and a second score corresponding to the second text content; and selecting the first text content or the second text content based on the first score and the second score, and determining the selected text content as a voice recognition result. The voice recognition result may also be referred to as a final result, that is, corresponding operations may be performed according to the voice recognition result, for example, the voice information is the voice information corresponding to "lingjinghutong", and if the current regional information is beijing, the information corresponding to "lingchunghutong" is displayed.
As shown in fig. 3, after the speech information is processed by the acoustic model, the output information of the acoustic model, such as pinyin of a chinese character, can be obtained, and then the output information of the acoustic model can be respectively input into the region language model and the general language model, and after the output information of the acoustic model is processed by the region language model and the general language model, the text content output by the region language model and the text content output by the general language model can be respectively obtained, the text content corresponding to the region language model can be referred to as a first text content, and the text content corresponding to the general language model can be referred to as a second text content.
After the first text content and the second text content are obtained, a selection module can be adopted to select the first text content and the second text content so as to obtain a voice recognition result.
In some embodiments, a first score corresponding to the first text content and a second score corresponding to the second text content may be determined; and selecting the first text content or the second text content based on the first score and the second score, and determining the selected text content as a voice recognition result. Specifically, the text content with the highest score may be determined as the voice recognition result according to the score.
For example, the acoustic model output information is "lingjinghutong", the first text content obtained after the processing by the region language model corresponding to beijing is "lingxianghu", and the second text content obtained after the processing by the general language model includes: "Lingjing moustache" and "lingjing moustache". Then, scores of the respective text contents in the first text content and the second text content may be determined, and the voice recognition result is "the telepathic hooke" assuming that the score of the first text content "the telepathic hooke" is 0.9, the score of the second text content "the telepathic hooke" is 0.8, and the score of the second text content "the lingjing hooke" is 0.1.
By determining the final speech recognition result based on the score of the text content, the accuracy of the speech recognition result may be improved.
The score of each text content may be determined according to the corresponding language model. When the language model processes the output information of the acoustic model, not only the text content corresponding to the output information of the acoustic model but also the probability value corresponding to the text content can be obtained, and the probability value indicates the probability that the obtained text content is a true value.
For the second text content, that is, the text content output by the universal language model, the probability value corresponding to the text content determined by the universal language model may be directly adopted as the score, for example, if the probability value corresponding to "smart" in the second text content determined by the universal language model is 0.8, the probability value corresponding to "smart" in the second text content is 0.1, the score of "smart" in the second text content is 0.8, and the score of "smart" in the second text content is 0.1.
For the first text content, that is, the text content output by the region language model, in addition to the probability value corresponding to the text content determined by the region language model, the score of the text content may also be determined by combining the weight of the region language model. That is, the determining the first score corresponding to the first text content may include: and determining a first score corresponding to the first text content based on the weight of the region language model corresponding to the current region information and the probability value corresponding to the first text content determined by the region language model corresponding to the current region information.
In particular, the first score may be a product of the weight and the probability value. For example, the probability value corresponding to the first text content "smart border" determined by the region language model corresponding to beijing is 0.9, and if the weight of the region language model corresponding to beijing is 1, the score of the first text content "smart border" may be determined to be 0.9; alternatively, assuming that the weight of the region language model corresponding to beijing is 0.9, the score of the first text content "lingjing humind" may be determined to be 0.81.
By determining the score corresponding to the first text content based on the weight of the region language model, more accurate text content can be selected. In addition, because the text content with the highest score is generally one, an accurate result can be provided for the user, and compared with a mode of providing a plurality of candidate results, the method can reduce the burden of the user and improve the user experience.
Further, the region language model may be plural, and weights may be assigned to different region language models based on current region information.
In some embodiments, the region language model further includes a region language model corresponding to region information other than the current region information, and the method further includes: based on the current region information, a first weight is distributed to the region language model corresponding to the current region information, and a second weight is distributed to the region language models corresponding to the other region information, wherein the first weight is larger than the second weight.
By assigning the weight to the region language model based on the current region information, a higher weight can be assigned to the region language model corresponding to the current region information to obtain the text content corresponding to the current region.
For example, the geographic language model includes a geographic language model corresponding to beijing and a geographic language model corresponding to shenyang, and if the current geographic information is beijing, the geographic language model corresponding to beijing may be assigned a weight of 1 and the geographic language model corresponding to shenyang may be assigned a weight of 0.
It is understood that the weights are not limited to 0 and 1, and in general, the weights assigned to different region language models are values between [0,1], and the sum of the weights of all region language models is 1. For example, if the current region information is beijing, the weight of the region language model corresponding to beijing may be assigned to be 0.9, and the weight of the region language model corresponding to shenyang may be assigned to be 0.1.
In this embodiment, the accuracy of the speech recognition result can be improved by including the region information in the input information and processing the input information by using the region language model corresponding to the current region information.
Fig. 4 is a schematic diagram according to a fourth embodiment of the present disclosure. This embodiment provides a speech recognition method, in combination with the architecture diagram shown in fig. 5, the method comprising:
401. acquiring input information, wherein the input information comprises: voice information and current region information.
402. And distributing weight to each region language model in a plurality of region language models which are configured in advance based on the current region information.
Assuming that the weight of the region language model corresponding to the current region information is referred to as a first weight, and the weights of the region language models corresponding to the other region information are referred to as a second weight, the first weight is greater than the second weight.
Referring to fig. 5, the weight adapter may be configured to configure the corresponding weight for each geographic language model, for example, the preconfigured geographic language models include a geographic language model corresponding to beijing and a geographic language model corresponding to shenyang, and assuming that the current geographic information in the input information is beijing, the weight of the geographic language model corresponding to beijing may be configured to be 1, and the weight of the other geographic language models, for example, the geographic language model corresponding to shenyang, may be configured to be 0.
The weight adapter may be specified according to prior experience or may train a relevant weight regression model using relevant corpora, so that the weight adapter may determine the weight of each domain language model based on the current domain information.
403. And processing the voice information by adopting an acoustic model to obtain acoustic model output information.
404. And processing the output information of the acoustic model by adopting the region language model to obtain first text content, and determining a first score of the first text content.
405. And processing the acoustic model output information by adopting a general language model to obtain second text content, and determining a second score of the second text content.
As shown in fig. 5, a decoder may be used to perform speech recognition on speech information, that is, the decoder may respectively use an acoustic model, a regional language model, and a general language model to perform corresponding processing.
406. And selecting the first text content or the second text content based on the first score and the second score, and determining the selected text content as a voice recognition result.
The first score can be determined according to the weight of the region language model and the probability value of the corresponding text content, the second score can be determined according to the probability value of the corresponding text content, and then the text content with the highest score can be selected as the voice recognition result. See in particular the above example, which is not described in detail here.
In addition, after the voice recognition result is obtained, the operation corresponding to the voice recognition result may be performed.
For example, the user inputs the voice information corresponding to "i want to go to the smart border/smart border" into the map application, and if the current regional information is beijing, the navigation route for navigating to the smart border "can be displayed to the user.
For another example, for some common voices, for example, a voice corresponding to "a nearby library", at this time, the text content obtained by the common language model, that is, the score of "the nearby library" is higher, the operation of querying the nearby library may be performed, and the information of the queried library may be displayed.
In this embodiment, by including the region information in the input information, processing the input information using the region language model corresponding to the current region information, and assigning a higher weight to the region language model corresponding to the current region information, an address suitable for the current region can be obtained when the voice information is the voice information about the address, accuracy of a voice recognition result is improved, and user experience is improved.
It should be noted that, in the embodiment of the present disclosure, the main executing body of the speech recognition method may obtain the speech information and the geographic information of the user through various public and legal compliance manners, for example, the speech information and the geographic information may be obtained from a public data set, or obtained from the user after authorization of the user. The voice recognition process through the disclosed embodiment is performed after being authorized by the user, and the generation process thereof conforms to the relevant laws and regulations. The speech recognition in the embodiments of the present disclosure is not specific to a particular user and does not reflect personal information of a particular user.
Fig. 6 is a schematic diagram according to a sixth embodiment of the present disclosure, which provides a speech recognition apparatus. As shown in fig. 6, the speech recognition apparatus 600 includes an acquisition module 601, a recognition module 602, and a first processing module 603.
The obtaining module 601 is configured to obtain input information, where the input information includes: voice information and current region information; the recognition module 602 is configured to obtain recognition content to be recognized based on the voice information; the first processing module 603 is configured to process the identification content by using a region language model corresponding to the current region information, so as to obtain a text content corresponding to the voice information.
In some embodiments, the text content obtained by using the region language model is a first text content, and the apparatus may further include: and the second processing module is used for processing the identified content by adopting a general language model so as to obtain a second text content corresponding to the voice information.
In some embodiments, the apparatus may further comprise: the determining module is used for determining a first score corresponding to the first text content and a second score corresponding to the second text content; and the selection module is used for selecting the first text content or the second text content based on the first score and the second score and determining the first text content or the second text content as a voice recognition result.
In some embodiments, the determining module is specifically configured to: and determining a first score corresponding to the first text content based on the weight of the region language model corresponding to the current region information and the probability value corresponding to the first text content determined by the region language model corresponding to the current region information.
In some embodiments, the region language model further includes a region language model corresponding to region information other than the current region information, and the apparatus further includes: based on the current region information, a first weight is distributed to the region language model corresponding to the current region information, and a second weight is distributed to the region language models corresponding to the other region information, wherein the first weight is larger than the second weight.
In some embodiments, the identification module 602 is specifically configured to: and processing the voice information by adopting an acoustic model to obtain acoustic model output information, and determining the acoustic model output information as the identification content to be identified.
In this embodiment, the accuracy of the speech recognition result can be improved by including the region information in the input information and processing the input information by using the region language model corresponding to the current region information.
It is to be understood that in the disclosed embodiments, the same or similar elements in different embodiments may be referenced.
It is to be understood that "first", "second", and the like in the embodiments of the present disclosure are used for distinction only, and do not indicate the degree of importance, the order of timing, and the like.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
FIG. 7 illustrates a schematic block diagram of an example electronic device 700 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 7, the electronic device 700 includes a computing unit 701, which may perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the electronic device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
A number of components in the electronic device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the electronic device 700 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 701 performs the respective methods and processes described above, such as a voice recognition method. For example, in some embodiments, the speech recognition method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 700 via the ROM 702 and/or the communication unit 709. When the computer program is loaded into the RAM 703 and executed by the computing unit 701, one or more steps of the speech recognition method described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the speech recognition method by any other suitable means (e.g. by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (15)

1. A speech recognition method comprising:
acquiring input information, wherein the input information comprises: voice information and current region information;
obtaining identification content to be identified based on the voice information;
and processing the identification content by adopting a region language model corresponding to the current region information to obtain text content corresponding to the voice information.
2. The method of claim 1, wherein the text content obtained by using the region language model is a first text content, and the method further comprises:
and processing the identified content by adopting a general language model to obtain second text content corresponding to the voice information.
3. The method of claim 2, further comprising:
determining a first score corresponding to the first text content and a second score corresponding to the second text content;
and selecting the first text content or the second text content based on the first score and the second score, and determining the selected text content as a voice recognition result.
4. The method of claim 3, wherein the determining a first score corresponding to the first text content comprises:
and determining a first score corresponding to the first text content based on the weight of the region language model corresponding to the current region information and the probability value corresponding to the first text content determined by the region language model corresponding to the current region information.
5. The method of claim 4, wherein the geographic language model further includes geographic language models corresponding to geographic information other than the current geographic information, and the method further comprises:
based on the current region information, a first weight is distributed to the region language model corresponding to the current region information, and a second weight is distributed to the region language models corresponding to the other region information, wherein the first weight is larger than the second weight.
6. The method according to any one of claims 1-5, wherein the obtaining of the recognition content to be recognized based on the speech information comprises:
and processing the voice information by adopting an acoustic model to obtain acoustic model output information, and determining the acoustic model output information as the identification content to be identified.
7. A speech recognition apparatus comprising:
an obtaining module, configured to obtain input information, where the input information includes: voice information and current region information;
the recognition module is used for obtaining recognition content to be recognized based on the voice information;
and the first processing module is used for processing the identification content by adopting a region language model corresponding to the current region information so as to obtain text content corresponding to the voice information.
8. The apparatus as claimed in claim 7, wherein the text content obtained by using the region language model is a first text content, the apparatus further comprising:
and the second processing module is used for processing the identified content by adopting a general language model so as to obtain a second text content corresponding to the voice information.
9. The apparatus of claim 8, further comprising:
the determining module is used for determining a first score corresponding to the first text content and a second score corresponding to the second text content;
and the selection module is used for selecting the first text content or the second text content based on the first score and the second score and determining the first text content or the second text content as a voice recognition result.
10. The apparatus of claim 9, wherein the determining module is specifically configured to:
and determining a first score corresponding to the first text content based on the weight of the region language model corresponding to the current region information and the probability value corresponding to the first text content determined by the region language model corresponding to the current region information.
11. The apparatus of claim 10, wherein the geographic language model further includes geographic language models corresponding to geographic information other than the current geographic information, the apparatus further comprising:
based on the current region information, a first weight is distributed to the region language model corresponding to the current region information, and a second weight is distributed to the region language models corresponding to the other region information, wherein the first weight is larger than the second weight.
12. The apparatus according to any one of claims 7-11, wherein the identification module is specifically configured to:
and processing the voice information by adopting an acoustic model to obtain acoustic model output information, and determining the acoustic model output information as the identification content to be identified.
13. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.
14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-6.
15. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-6.
CN202110455740.4A 2021-04-26 2021-04-26 Speech recognition method, apparatus, device and storage medium Active CN113223522B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110455740.4A CN113223522B (en) 2021-04-26 2021-04-26 Speech recognition method, apparatus, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110455740.4A CN113223522B (en) 2021-04-26 2021-04-26 Speech recognition method, apparatus, device and storage medium

Publications (2)

Publication Number Publication Date
CN113223522A true CN113223522A (en) 2021-08-06
CN113223522B CN113223522B (en) 2022-05-03

Family

ID=77089135

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110455740.4A Active CN113223522B (en) 2021-04-26 2021-04-26 Speech recognition method, apparatus, device and storage medium

Country Status (1)

Country Link
CN (1) CN113223522B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117690581A (en) * 2023-12-13 2024-03-12 江苏济远医疗科技有限公司 Disease inquiry process auxiliary information generation method based on large language model

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20000007983U (en) * 1998-10-13 2000-05-06 박철웅 Connecting plate for floor hinge
CN105448292A (en) * 2014-08-19 2016-03-30 北京羽扇智信息科技有限公司 Scene-based real-time voice recognition system and method
US20190011278A1 (en) * 2017-07-06 2019-01-10 Here Global B.V. Method and apparatus for providing mobility-based language model adaptation for navigational speech interfaces
CN109243468A (en) * 2018-11-14 2019-01-18 北京羽扇智信息科技有限公司 Audio recognition method, device, electronic equipment and storage medium
CN109243461A (en) * 2018-09-21 2019-01-18 百度在线网络技术(北京)有限公司 Audio recognition method, device, equipment and storage medium
CN110427459A (en) * 2019-08-05 2019-11-08 苏州思必驰信息科技有限公司 Visualized generation method, system and the platform of speech recognition network
US20190348023A1 (en) * 2018-05-11 2019-11-14 Samsung Electronics Co., Ltd. Device and method to personalize speech recognition model
CN110473524A (en) * 2019-08-30 2019-11-19 苏州思必驰信息科技有限公司 The construction method and device of speech recognition system
CN110473523A (en) * 2019-08-30 2019-11-19 北京大米科技有限公司 A kind of audio recognition method, device, storage medium and terminal
CN111326147A (en) * 2018-12-12 2020-06-23 北京嘀嘀无限科技发展有限公司 Voice recognition method and device, electronic equipment and storage medium
CN111862944A (en) * 2019-04-30 2020-10-30 北京嘀嘀无限科技发展有限公司 Speech recognition apparatus, method, electronic device, and computer-readable storage medium
CN111984759A (en) * 2020-06-29 2020-11-24 百度在线网络技术(北京)有限公司 Question and answer processing method and device, electronic equipment and storage medium
CN112102832A (en) * 2020-09-18 2020-12-18 广州小鹏汽车科技有限公司 Speech recognition method, speech recognition device, server and computer-readable storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20200007983A (en) * 2020-01-03 2020-01-22 엘지전자 주식회사 Regional features based speech recognition method and system

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20000007983U (en) * 1998-10-13 2000-05-06 박철웅 Connecting plate for floor hinge
CN105448292A (en) * 2014-08-19 2016-03-30 北京羽扇智信息科技有限公司 Scene-based real-time voice recognition system and method
US20190011278A1 (en) * 2017-07-06 2019-01-10 Here Global B.V. Method and apparatus for providing mobility-based language model adaptation for navigational speech interfaces
US20190348023A1 (en) * 2018-05-11 2019-11-14 Samsung Electronics Co., Ltd. Device and method to personalize speech recognition model
CN109243461A (en) * 2018-09-21 2019-01-18 百度在线网络技术(北京)有限公司 Audio recognition method, device, equipment and storage medium
CN109243468A (en) * 2018-11-14 2019-01-18 北京羽扇智信息科技有限公司 Audio recognition method, device, electronic equipment and storage medium
CN111326147A (en) * 2018-12-12 2020-06-23 北京嘀嘀无限科技发展有限公司 Voice recognition method and device, electronic equipment and storage medium
CN111862944A (en) * 2019-04-30 2020-10-30 北京嘀嘀无限科技发展有限公司 Speech recognition apparatus, method, electronic device, and computer-readable storage medium
CN110427459A (en) * 2019-08-05 2019-11-08 苏州思必驰信息科技有限公司 Visualized generation method, system and the platform of speech recognition network
CN110473524A (en) * 2019-08-30 2019-11-19 苏州思必驰信息科技有限公司 The construction method and device of speech recognition system
CN110473523A (en) * 2019-08-30 2019-11-19 北京大米科技有限公司 A kind of audio recognition method, device, storage medium and terminal
CN111984759A (en) * 2020-06-29 2020-11-24 百度在线网络技术(北京)有限公司 Question and answer processing method and device, electronic equipment and storage medium
CN112102832A (en) * 2020-09-18 2020-12-18 广州小鹏汽车科技有限公司 Speech recognition method, speech recognition device, server and computer-readable storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117690581A (en) * 2023-12-13 2024-03-12 江苏济远医疗科技有限公司 Disease inquiry process auxiliary information generation method based on large language model

Also Published As

Publication number Publication date
CN113223522B (en) 2022-05-03

Similar Documents

Publication Publication Date Title
CN112597754A (en) Text error correction method and device, electronic equipment and readable storage medium
CN112507706B (en) Training method and device for knowledge pre-training model and electronic equipment
EP3732629A1 (en) Training sequence generation neural networks using quality scores
CN113836925A (en) Training method and device for pre-training language model, electronic equipment and storage medium
CN112307188B (en) Dialog generation method, system, electronic device and readable storage medium
CN113836278A (en) Training and dialogue generating method and device for general dialogue model
CN112632987B (en) Word slot recognition method and device and electronic equipment
CN113282736B (en) Dialogue understanding and model training method, device, equipment and storage medium
CN117312140A (en) Method and device for generating test case, electronic equipment and storage medium
CN113223522B (en) Speech recognition method, apparatus, device and storage medium
CN112767924A (en) Voice recognition method and device, electronic equipment and storage medium
CN113743127B (en) Task type dialogue method, device, electronic equipment and storage medium
CN114758649B (en) Voice recognition method, device, equipment and medium
CN113204613B (en) Address generation method, device, equipment and storage medium
CN115292467A (en) Information processing and model training method, apparatus, device, medium, and program product
CN112507712B (en) Method and device for establishing slot identification model and slot identification
CN113129894A (en) Speech recognition method, speech recognition device, electronic device and storage medium
CN114492793A (en) Model training and sample generating method, device, equipment and storage medium
CN109036379B (en) Speech recognition method, apparatus and storage medium
CN113361575A (en) Model training method and device and electronic equipment
CN113051926A (en) Text extraction method, equipment and storage medium
US20230085458A1 (en) Dialog data generating
CN113114851B (en) Incoming call intelligent voice reply method and device, electronic equipment and storage medium
CN115965018B (en) Training method of information generation model, information generation method and device
CN113223500B (en) Speech recognition method, method for training speech recognition model and corresponding device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant