CN113763925B - Speech recognition method, device, computer equipment and storage medium - Google Patents

Speech recognition method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN113763925B
CN113763925B CN202110578432.0A CN202110578432A CN113763925B CN 113763925 B CN113763925 B CN 113763925B CN 202110578432 A CN202110578432 A CN 202110578432A CN 113763925 B CN113763925 B CN 113763925B
Authority
CN
China
Prior art keywords
hotword
voice
auxiliary content
model
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110578432.0A
Other languages
Chinese (zh)
Other versions
CN113763925A (en
Inventor
曹立新
苏丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202110578432.0A priority Critical patent/CN113763925B/en
Publication of CN113763925A publication Critical patent/CN113763925A/en
Application granted granted Critical
Publication of CN113763925B publication Critical patent/CN113763925B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/01Assessment or evaluation of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Abstract

The application relates to a voice recognition method, a voice recognition device, computer equipment and a storage medium, and belongs to the technical field of artificial intelligence. The method comprises the following steps: acquiring target voice, wherein the target voice is real-time voice acquired in a specified environment; determining a first speech recognition model based on the auxiliary content; the auxiliary content comprises a voice recognition result of historical voice collected in the appointed environment and at least one of file contents of target files displayed in the appointed environment; decoding the target voice based on the first voice recognition model to obtain a candidate recognition result output by the first voice recognition model; and carrying out probability prediction processing on the candidate recognition result to obtain a voice recognition result of the target voice. By the aid of the scheme, the first language recognition model for carrying out voice recognition on the target voice can be adaptively adjusted in the appointed environment, and accuracy of voice recognition is improved.

Description

Speech recognition method, device, computer equipment and storage medium
Technical Field
The embodiment of the application relates to the technical field of artificial intelligence, in particular to a voice recognition method, a voice recognition device, computer equipment and a storage medium.
Background
Today, with the increasing development of artificial intelligence, artificial intelligence techniques are becoming more and more widely used in life, including speech recognition.
In the related art, a conventional scheme for performing voice recognition is to train a general voice recognition model in advance before performing voice recognition, input a voice into the voice recognition model, process the voice through a fixed acoustic model and a fixed language model in the voice recognition model, and output a recognition result corresponding to the voice from the voice recognition model.
However, when the above scheme is adopted to perform the speech recognition of the speech content of the specified subject, since the speech recognition model is a fixed model which is previously trained based on the specified subject before the speech recognition, there may occur a problem that the accuracy of the speech recognition performed by the speech recognition model is poor when the subject of the speech content is changed.
Disclosure of Invention
The embodiment of the application provides a voice recognition method, a voice recognition device, computer equipment and a storage medium, which can improve the accuracy of voice recognition. The technical scheme is as follows:
in one aspect, a method for speech recognition is provided, the method comprising:
Acquiring target voice, wherein the target voice is real-time voice acquired in a specified environment;
determining a first speech recognition model based on the auxiliary content; the auxiliary content comprises a voice recognition result of historical voice collected in the appointed environment and at least one of file contents of target files displayed in the appointed environment;
decoding the target voice based on the first voice recognition model to obtain a candidate recognition result output by the first voice recognition model;
and carrying out probability prediction processing on the candidate recognition result to obtain a voice recognition result of the target voice.
In yet another aspect, a speech recognition apparatus is provided, the apparatus comprising:
the voice acquisition module is used for acquiring target voice, wherein the target voice is real-time voice acquired in a specified environment;
the model determining module is used for determining a first voice recognition model based on auxiliary content; the auxiliary content comprises a voice recognition result of historical voice collected in the appointed environment and at least one of file contents of target files displayed in the appointed environment;
the candidate acquisition module is used for decoding the target voice based on the first voice recognition model to acquire a candidate recognition result output by the first voice recognition model;
And the result acquisition module is used for processing the candidate recognition result to acquire a voice recognition result of the target voice.
In one possible implementation, in response to the language model used in the first speech recognition model being determined from at least two candidate language models, at least two of the candidate language models respectively corresponding to respective domain categories;
the model determination module includes:
the domain determining submodule is used for processing the target voice based on a first voice recognition model and determining a target domain type based on the auxiliary content before obtaining a candidate recognition result output by the first voice recognition model;
and the language model determining submodule is used for determining the candidate language model corresponding to the target field type in at least two candidate language models as the language model used in the first voice recognition model.
In one possible implementation, the domain determining submodule includes:
the domain probability acquisition unit is used for inputting the auxiliary content into a domain detection model to acquire the domain probability distribution; the domain probability distribution is used for indicating the probability that the auxiliary content corresponds to each domain category; the domain detection model is obtained by training based on an auxiliary content sample and a domain type corresponding to the auxiliary content sample;
And the domain determining unit is used for determining the target domain type based on the domain probability distribution.
In one possible implementation, the apparatus further includes:
the prediction probability acquisition module is used for inputting the auxiliary content sample into the domain detection model before acquiring the target voice to acquire a prediction domain probability distribution;
and the first model updating module is used for updating parameters of the domain detection model based on the prediction domain probability distribution and the domain type corresponding to the auxiliary content sample.
In one possible implementation, the apparatus further includes:
the hot word information acquisition module is used for processing the target voice based on a first voice recognition model and acquiring hot word information of at least one hot word corresponding to the auxiliary content before acquiring a candidate recognition result output by the first voice recognition model; the hotword information comprises probability distribution of the hotword;
the candidate acquisition module includes:
and the candidate acquisition sub-module is used for inputting the target voice and the hot word information of at least one hot word corresponding to the auxiliary content into the first voice recognition model for decoding processing to obtain the candidate recognition result output by the first voice recognition model.
In one possible implementation manner, the hotword information acquisition module includes:
the first information extraction sub-module is used for extracting first hotword information corresponding to the first hotword contained in the auxiliary content;
the second information extraction sub-module is used for determining second hotword information corresponding to related words of the first hotword from a word network based on the first hotword; the word network is a pattern data structure which takes words as vertexes and takes the relation among the words as edges;
and the information merging sub-module is used for merging the first hotword information and the second hotword information into hotword information of at least one hotword corresponding to the auxiliary content.
In one possible implementation manner, the first information extraction sub-module includes:
the first information acquisition unit is used for inputting the auxiliary content into a hotword detection model to acquire first hotword information corresponding to the first hotword and output by the hotword detection model; the hotword detection model is obtained based on auxiliary content samples and hotwords contained in the auxiliary content samples.
In one possible implementation, the apparatus further includes:
the prediction information acquisition module is used for inputting the auxiliary content sample into the hotword detection model before acquiring the target voice to obtain hotword information of a predicted hotword output by the hotword detection model;
And the second model updating module is used for updating parameters of the hotword detection model based on the hotword information of the predicted hotword and the hotword in the auxiliary content sample.
In one possible implementation manner, the result obtaining module includes:
the score acquisition sub-module is used for inputting the candidate identification result and probability distribution information corresponding to the auxiliary content into a second language model to obtain a prediction score corresponding to the candidate identification result output by the second language model;
and the content determination submodule is used for determining the target identification content based on the prediction score.
In one possible implementation manner, the probability distribution information corresponding to the auxiliary content includes:
domain probability distribution, and hotword information of at least one hotword corresponding to the auxiliary content;
the domain probability distribution is used for indicating the probability that the auxiliary content corresponds to each domain category; the hotword information includes a probability distribution of the corresponding hotword.
In one possible implementation, the apparatus further includes:
the probability acquisition module is used for acquiring probability distribution information corresponding to the auxiliary content sample before acquiring the target voice;
The model determining module is used for determining a language model used in the first voice recognition model based on the domain probability distribution corresponding to the auxiliary content sample;
the candidate sample acquisition module is used for inputting the hot word information of at least one hot word corresponding to the auxiliary content sample and the voice sample corresponding to the auxiliary content sample into the first voice recognition model to obtain a candidate recognition result sample output by the first voice recognition model;
the prediction result acquisition module is used for inputting probability distribution information corresponding to the auxiliary content sample and the candidate recognition result sample into the second voice recognition model to obtain a prediction voice recognition result output by the second voice recognition model;
and the updating module is used for updating parameters of the first voice recognition model and the second voice recognition model based on the predicted voice recognition result and the real text corresponding to the voice sample.
In one possible implementation, the specified environment is at least one of a conference environment, a video playing environment, and a smart home environment; the target file is a file used in the specified environment.
In another aspect, a computer device is provided, the computer device comprising a processor and a memory having stored therein at least one instruction, at least one program, code set, or instruction set, the at least one instruction, the at least one program, code set, or instruction set being loaded and executed by the processor to implement a speech recognition method as described above.
In another aspect, a computer readable storage medium having stored therein at least one instruction, at least one program, code set, or instruction set loaded and executed by a processor to implement a speech recognition method as described above is provided.
According to one aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the speech recognition method provided in the various alternative implementations of the above aspects.
The technical scheme that this application provided can include following beneficial effect:
in the scheme shown in the embodiment of the application, the language model in the first language recognition model is determined in real time through at least one of the voice recognition result of the historical voice in the appointed environment and the file content of the displayed target file, so that the first language recognition model for carrying out voice recognition on the target voice can carry out self-adaptive adjustment in the appointed environment, the situation that part of voices cannot be clearly recognized in the voice recognition process by using the fixed voice recognition model which is trained in advance is avoided, and the accuracy of voice recognition is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
FIG. 1 is a flowchart illustrating a method of speech recognition according to an exemplary embodiment;
FIG. 2 is a flow chart of a speech recognition process according to the embodiment of FIG. 1;
FIG. 3 is a schematic diagram of a speech recognition system, shown according to an exemplary embodiment;
FIG. 4 is a flowchart illustrating a method of speech recognition, according to an example embodiment;
FIG. 5 is a schematic illustration of speech recognition in a conference system according to the embodiment of FIG. 4;
FIG. 6 is a schematic diagram of a voice recognition system in a conference scenario, according to an example embodiment;
FIG. 7 is a block diagram of a speech recognition device, according to an example embodiment;
FIG. 8 is a schematic diagram of a computer device, according to an example embodiment;
fig. 9 is a block diagram of a computer device, according to an example embodiment.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.
It should be understood that references herein to "a number" means one or more, and "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.
According to the scheme shown in the subsequent embodiments of the application, in the medical field, the method can be realized by means of artificial intelligence (Artificial Intelligence, AI), the position information of at least two named entities is obtained from a natural language text through a named entity recognition technology, the at least two named entities and the corresponding position information are input into an entity matching model, and the matching relation between a first type entity and a second type entity in each named entity can be output through the entity matching model, so that structural information is generated based on the matching relation. And further, the accuracy of the matching relation in the generated structured information is improved. For ease of understanding, terms involved in embodiments of the present disclosure are described below.
1) An artificial intelligence AI;
AI is a theory, method, technique, and application system that utilizes a digital computer or a digital computer-controlled machine to simulate, extend, and extend human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
With research and progress of artificial intelligence technology, research and application of artificial intelligence technology are developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, automatic driving, unmanned aerial vehicles, robots, smart medical services, smart customer service, smart video services, etc., and with development of technology, artificial intelligence technology will be applied in more fields and become more and more important.
2) Natural language processing (Nature Language Processing, NLP);
natural language processing is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.
3) Machine Learning (ML);
machine learning is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.
4) Voice technology (Speech Technology);
key technologies for Speech technology are automatic Speech recognition technology (Automatic Speech Recognition, ASR) and Speech synthesis technology (TTS). The method can enable the computer to listen, watch, say and feel, is the development direction of human-computer interaction in the future, and voice becomes one of the best human-computer interaction modes in the future.
5) A speech recognition technique;
the speech recognition technology is a technology for converting speech into characters, and the speech recognition system mainly comprises an acoustic model, a language model and a decoder, wherein the acoustic model is formed by counting pronunciation distribution in sound data, and is used for describing the probability that a section of speech corresponds to a syllable by taking syllable modeling as an example. The language model is to obtain a statistical model for language by counting grammar distribution in text corpus, and is used for describing probability that a text string becomes natural language. The decoder is a speech recognition engine, mainly uses an acoustic model and a language model, takes the speech of a user as input, searches in a search network, and finally obtains a recognized text result. In general, the process of decoder decoding is also referred to as one-pass decoding. After the first-pass speech recognition process, the second-pass speech recognition may also be continued, and the second-pass speech recognition decoder may also be referred to as a second-pass decoder, where the input of the second-pass decoder is the output result of the first codec, typically several candidate texts of a piece of speech. On the basis of one-pass decoding, the two-pass decoding utilizes a neural network language model which is more accurate to construct, and the output results of the one-pass decoder are subjected to re-scoring and sorting, so that the optimal recognition result is obtained, and the process of decoding by the two-pass decoder can be called as two-pass decoding.
6) Cloud computing;
cloud Computing (Cloud Computing) refers to the delivery and usage model of an IT infrastructure, meaning that required resources are obtained in an on-demand, easily scalable manner over a network; generalized cloud computing refers to the delivery and usage patterns of services, meaning that the required services are obtained in an on-demand, easily scalable manner over a network. Such services may be IT, software, internet related, or other services. Cloud Computing is a product of fusion of traditional computer and network technology developments such as Grid Computing (Grid Computing), distributed Computing (Distributed Computing), parallel Computing (Parallel Computing), utility Computing (Utility Computing), network storage (Network Storage Technologies), virtualization (Virtualization), load balancing (Load balancing), and the like.
With the development of the internet, real-time data flow and diversification of connected devices, and the promotion of demands of search services, social networks, mobile commerce, open collaboration and the like, cloud computing is rapidly developed. Unlike the previous parallel distributed computing, the generation of cloud computing will promote the revolutionary transformation of the whole internet mode and enterprise management mode in concept.
7) Cloud conference;
cloud conferencing is an efficient, convenient, low-cost form of conferencing based on cloud computing technology. The user can rapidly and efficiently share voice, data files and videos with all groups and clients in the world synchronously by simply and easily operating through an internet interface, and the user is helped by a cloud conference service provider to operate through complex technologies such as data transmission, processing and the like in the conference.
At present, domestic cloud conference mainly focuses on service contents mainly in a SaaS (Software as a Service ) mode, including service forms of telephone, network, video and the like, and video conference based on cloud computing is called as a cloud conference.
In the cloud conference era, the transmission, processing and storage of data are all processed by the computer resources of video conference factories, and users can carry out efficient remote conferences without purchasing expensive hardware and installing complicated software.
The cloud conference system supports the dynamic cluster deployment of multiple servers, provides multiple high-performance servers, and greatly improves conference stability, safety and usability. In recent years, video conferences are popular for a plurality of users because of greatly improving communication efficiency, continuously reducing communication cost and bringing about upgrade of internal management level, and have been widely used in various fields of transportation, finance, operators, education, enterprises and the like. Undoubtedly, the video conference has stronger attraction in convenience, rapidness and usability after the cloud computing is applied, and the video conference application is required to be stimulated.
The scheme provided by the embodiment of the application relates to artificial intelligence natural language processing, machine learning and other technologies, and is specifically described through the following embodiments.
Fig. 1 is a flow chart illustrating a method of speech recognition according to an exemplary embodiment. The speech recognition method may be performed by a computer device. For example, the computer device may include at least one of a terminal or a server. As shown in fig. 1, the voice recognition method includes the steps of:
step 101, acquiring target voice, wherein the target voice is real-time voice acquired in a specified environment.
In the embodiment of the application, the user sends out real-time voice in a specified environment, and the computer equipment can collect the real-time voice and acquire the real-time voice as target voice, wherein the target voice needs to be subjected to subsequent voice recognition processing.
Wherein the computer device may collect real-time speech of the user through a single microphone or microphone array.
In one possible implementation manner, in a specified environment at the same time, real-time voices of at least one user are collected, and real-time voices corresponding to the at least one user are respectively obtained as target voices.
For example, when the computer device is in the same time period, the collected real-time voice of the user a is voice a, and the collected real-time voice of the user B is voice B, the computer device respectively obtains the voice a as the target voice a, and the voice B as the target voice B.
The method comprises the steps of receiving a voice signal from a user, wherein the designated environment is an environment supporting voice recognition, and real-time voice collected in the designated environment is collected when the user speaks in the designated environment or is collected when the voice is played through equipment with an audio playing function.
For example, the designated environment is an online conference environment or an offline conference environment in which conference participants need to record real-time speech, or a video playing environment in which synchronous subtitle addition is needed, or an intelligent home environment in which a speech instruction issued by a speech recognition user is needed, so as to control intelligent equipment corresponding to the speech instruction.
Step 102, determining a first voice recognition model based on auxiliary content; the auxiliary content includes at least one of a speech recognition result of a history speech collected in the specified environment and a file content of a target file shown in the specified environment.
The first speech recognition model comprises an acoustic model, a language model and a one-pass decoder using the acoustic model and the language model. The language model is determined by the computer device from at least two candidate language models pre-stored in the computer device in dependence on the received auxiliary content. The acoustic model is used for describing the probability of each syllable corresponding to a section of voice by counting pronunciation distribution in sound data, and the language model is used for describing the probability of a text string becoming natural language by counting grammar distribution in text corpus. The one-pass decoder is an engine of the first speech recognition model and is used for searching in a searching network by taking the speech to be recognized as input through the acoustic model and the language model to finally obtain a recognized text result, and the process of decoding the speech through the one-pass decoder is one-pass decoding.
In one possible implementation, before the computer device acquires the target voice, the received voice recognition result is used as a voice recognition result of the historical voice, at least one of the voice recognition result of the historical voice and file content of the target file displayed in the specified environment is used as auxiliary content, and a language model in the first voice recognition model is determined based on the auxiliary content.
And 103, decoding the target voice based on the first voice recognition model to obtain a candidate recognition result output by the first voice recognition model.
In the embodiment of the application, the computer equipment decodes the acquired target voice through the determined first voice recognition model, so that a candidate recognition result output by the first voice recognition model can be obtained.
And 104, carrying out probability prediction processing on the candidate recognition result to obtain a target voice recognition result.
In the embodiment of the application, the candidate recognition result output by the first voice recognition model is subjected to probability prediction processing, and the candidate recognition result with the highest probability prediction result is obtained as the voice recognition result corresponding to the target voice.
In one possible implementation manner, probability prediction processing is performed on the candidate recognition result based on the second speech recognition model, so as to obtain a speech recognition result of the target speech output by the second speech recognition model.
For example, if the candidate recognition results output by the first speech recognition model are text a, text B and text C, and the text a, text B and text C are processed by the second speech recognition model, the text a can be obtained as the speech recognition result corresponding to the target speech.
For example, fig. 2 is a flowchart of a speech recognition flow according to an embodiment of the present application, and as shown in fig. 2, taking an example that a designated environment is a conference scenario, speech of a participant 21 is sent to a decoder side 22 through a network, and the decoder 22 decodes through an acoustic model and a language model to obtain a plurality of candidate recognition results, namely, N-best (search algorithm) results. The N-best results may include N-Gram language model scores and acoustic model scores for a one-pass decoder. The N-best results may then be sent to the two-pass decoder 23, where the two-pass decoder 23 may reorder using the N-Gram language model score, the acoustic model score, and the neural network language model score of the two-pass decoder 22 to obtain the optimal recognition result, and return the recognition result to the participant-side terminal.
The second-pass decoder 23 is a decoder used for performing the second-pass speech recognition, and is configured to perform the second-pass decoding on the result of the first-pass decoding, and re-score and sort the result of the first-pass decoding by using the neural network language model through the input result of the first-pass decoding, so as to obtain the final optimal recognition result.
In summary, in the scheme shown in the embodiment of the present application, by specifying at least one of the speech recognition result of the historical speech in the environment and the file content of the displayed target file, the language model in the first speech recognition model is determined in real time, so that the first speech recognition model for performing speech recognition on the target speech can perform adaptive adjustment in the specified environment, and the situation that part of speech cannot be clearly recognized in the process of performing speech recognition by using the fixed speech recognition model which is completed by training in advance is avoided, thereby improving the accuracy of speech recognition.
The scheme shown in the embodiment of the application can be applied to any scene needing speech recognition.
For example, aiming at online or offline conferences, the scheme shown in the embodiment of the application can be combined with related files in the conferences to perform voice recognition on the speech of the participants in the conference process, and text content is recorded and generated in real time, so that the efficiency of conference recording is improved, and the burden of recording the speech of the participants in the conference process is reduced on the premise of ensuring accurate recorded content.
For another example, in the process of adding the subtitles to the video work, the scheme shown in the embodiment of the application can be combined with a real-time playing picture in the process of playing the video work to perform voice recognition, and text content obtained by recognition is added below the corresponding picture, so that the subtitle addition to the video work is realized, and the accuracy of subtitle recognition is improved.
For another example, aiming at controlling the appointed intelligent equipment in the intelligent home environment, the current voice recognition can be performed by combining the recent historical voice recognition content of the user and the file content uploaded to the intelligent home system by the user, and the appointed intelligent equipment is controlled based on the voice recognition content, so that the response speed of the appointed intelligent equipment is improved.
In one exemplary aspect, the present application relates to a speech recognition system that includes a speech recognition processing portion and a model training updating portion. FIG. 3 is a schematic diagram of a speech recognition system, according to an example embodiment. As shown in fig. 3, for the voice recognition processing section, when the acquired target voice is acquired, a first voice recognition model is determined based on the acquired auxiliary content, then the target voice is input into the determined first voice recognition model, the corresponding candidate recognition result is output through the processing of the target voice by the first voice recognition model, then the candidate recognition result is input into a second voice recognition model, and the voice recognition result corresponding to the target voice is output by the second voice recognition model. For the model training update portion, the model training device 310 performs model update on the first speech recognition model and the second speech recognition model through each set of speech samples, and the updated second speech recognition model may be uploaded to the cloud or the database for use by the speech recognition processing portion.
The model training device 310 may be a computer device with machine learning capability, for example, the computer device may be a fixed computer device such as a personal computer, a server, and a fixed scientific research device, or the computer device may be a mobile computer device such as a tablet computer, an electronic book reader, and the like. The embodiments of the present application are not limited to a particular type of model training device 310.
Wherein the terminal 340 may be a computer device having a screen display function. The server 330 may be a background server of the terminal 340, an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN (Content Delivery Network ), and basic cloud computing services such as big data and an artificial intelligence platform.
In one possible implementation manner, taking the application of the voice recognition system in a conference scenario as an example, when the terminal 340 collects real-time voice of a user, auxiliary content in the conference scenario may be text content and picture content in a presentation text provided by the user, or may be a voice recognition result of historical voice, the real-time voice is sent to the server 330 as target voice, and voice recognition of the target voice may be completed in the computer device by acquiring a corresponding first voice recognition model and a corresponding second voice recognition model from the server 330, and the target voice and the auxiliary content through the first voice recognition model and the second voice recognition model, where the language model in the first voice recognition model may be selected based on the auxiliary content.
The terminal 340 and the server 330 may be connected through a communication network. Optionally, the communication network is a wired network or a wireless network.
Alternatively, the wireless network or wired network described above uses standard communication techniques and/or protocols. The network is typically the Internet, but may be any network including, but not limited to, a local area network (Local Area Network, LAN), metropolitan area network (Metropolitan Area Network, MAN), wide area network (Wide Area Network, WAN), a mobile, wired or wireless network, a private network, or any combination of virtual private networks. In some embodiments, data exchanged over the network is represented using techniques and/or formats including HyperText Mark-up Language (HTML), extensible markup Language (Extensible Markup Language, XML), and the like. All or some of the links may also be encrypted using conventional encryption techniques such as secure socket layer (Secure Socket Layer, SSL), transport layer security (Transport Layer Security, TLS), virtual private network (Virtual Private Network, VPN), internet protocol security (Internet Protocol Security, IPsec), and the like. In other embodiments, custom and/or dedicated data communication techniques may also be used in place of or in addition to the data communication techniques described above. The application is not limited herein.
Fig. 4 is a flowchart illustrating a method of speech recognition according to an exemplary embodiment. The voice recognition method can be applied to a voice recognition system. For example, the speech recognition system described above may be as shown in FIG. 3, and the speech recognition method may be performed by model training device 310, server 330, and terminal 340, among other things. As shown in fig. 4, the voice recognition method includes the steps of:
in step 401, a target voice is acquired.
In the embodiment of the application, the terminal collects target voice through the voice collection component, the collected target voice is sent to the server, and the server obtains the target voice.
Wherein the target speech is real-time speech collected in a specified environment.
Optionally, the specified environment is at least one of a conference environment, a video playing environment and a smart home environment.
In an exemplary manner, in a conference environment, when conference participants speak, the terminal can collect speech through a speech collection assembly comprising a single microphone or a microphone array, and the collected speech is uploaded to a cloud server or a background server as target speech for processing; in the video playing environment, in the process of adding video captions, the terminal directly acquires voice data in the video played on the current terminal or other terminals as target voice, acquires a real-time playing picture as auxiliary content, and uploads the auxiliary content to a cloud server or a background server for processing; in an intelligent home environment, a terminal collects voice instructions issued by a user in the environment, and the collected voice instructions are uploaded to a cloud server or a background server as target voice for processing.
In step 402, a target domain category is determined based on the auxiliary content.
In the embodiment of the application, in the process of collecting real-time voice under a specified environment, the terminal can simultaneously acquire auxiliary content, and the corresponding target field type is determined based on the auxiliary content through the server.
Wherein the auxiliary content is at least one of a speech recognition result including a history speech collected in the specified environment and a file content of a target file shown in the specified environment.
In one possible implementation, when the specified environment is a meeting environment, the target file is a document file used in the meeting; when the designated environment is a video playing environment, the target file is a video file being played in the video playing environment; when the specified environment is an intelligent home environment, the target file is a file uploaded to the intelligent home system by a user in the intelligent home environment.
That is, the target file may be a presentation document file presented by a meeting participant in the course of a meeting, and the auxiliary content may include text content and picture content in the presentation document file. The voice recognition result of the history voice which may be included in the auxiliary content may be stored in the server, and when the target voice recognition is completed, the voice recognition result obtained by the recognition is stored in the background server or the cloud server, and is used as the voice recognition result of the history voice used in the next voice recognition for calling.
In one possible implementation, the auxiliary content is input into a domain detection model, a domain probability distribution is obtained, and a target domain category is determined based on the domain probability distribution.
The domain probability distribution is used for indicating the probability that the auxiliary content corresponds to each domain category. The domain detection model is trained based on the auxiliary content sample and the domain category corresponding to the auxiliary content sample.
The server stores a trained domain detection model, and the obtained auxiliary content is input into the domain detection model, and the domain detection model outputs a domain probability distribution corresponding to the auxiliary content.
In one possible implementation, the domain with the highest probability in the domain probability distribution is determined as the target domain class.
For example, if the auxiliary content input into the domain detection model is the auxiliary content a, after being processed by the domain detection model, the probability distribution corresponding to at least one domain is output by the domain detection model, which may be "automotive domain: 0.8, fashion field: 0.1, pet field: 0.02, etc., wherein the automotive field can be determined as the target field category since the probability of the automotive field corresponding to the probability distribution is the greatest.
In addition, the server may previously receive the trained domain detection model before determining the target domain category based on the auxiliary content.
In one possible implementation, the model training process of the domain detection model is to input the auxiliary content sample into the domain detection model, obtain a predicted domain probability distribution, and then update parameters of the domain detection model based on the predicted domain probability distribution and the domain type corresponding to the auxiliary content sample.
By way of example, the auxiliary content sample is converted into a matrix or vector form and is input into the domain detection model, wherein the domain probability corresponding to the auxiliary content sample is set to 1, the other domain probabilities are set to 0, the matrix or vector processed through the domain detection model contains probability distribution information corresponding to each predicted domain, a corresponding loss function is calculated based on the matrix or vector output and the matrix or vector generated by converting the auxiliary content sample, and then model parameters in the domain detection model are updated.
In step 403, a candidate language model corresponding to the target domain type from among the at least two candidate language models is determined as the language model used in the first speech recognition model.
In the embodiment of the application, at least two candidate language models are stored in a server, the at least two candidate language models respectively correspond to one domain type, a target domain type determined based on a domain probability distribution output by a domain detection model is selected from the candidate language models, and the candidate language model corresponding to the target domain type is selected as a language model used in a first voice recognition model.
At least two candidate language models respectively correspond to respective domain types, and the at least two candidate language models are statistical language models corresponding to multiple domains, which are obtained through offline training by utilizing a large amount of text data. The statistical language model may be an N-Gram model.
N-Gram is an algorithm based on a statistical language model. The basic idea is to perform a sliding window operation of size N on the content in the text according to bytes, forming a sequence of byte fragments of length N. Each byte segment is called a gram, statistics is carried out on the occurrence frequency of all the grams, filtering is carried out according to a preset threshold value, a key gram list, namely a vector feature space of a text, is formed, and each gram in the list is a feature vector dimension. The occurrence of the nth word in the model is only related to the previous N-1 words, but not related to any other words, and the probability of the whole sentence is the product of the occurrence probabilities of the words. These probabilities can be obtained by directly counting the number of simultaneous occurrences of N words from the corpus.
Illustratively, the statistical language model corresponding to the multiple fields obtained through offline training may include an automobile field language model, a property field language model, a game field language model, a fashion field language model and the like.
The statistical language model corresponding to the multiple fields can greatly improve the accuracy of voice recognition aiming at voices in the respective fields. Compared with the traditional statistical language model, namely the statistical language model which is generally used in various fields, the method can improve the specificity of recognizing the voice and improve the voice recognition effect.
In step 404, hotword information of at least one hotword corresponding to the auxiliary content is obtained.
In this embodiment of the present application, after performing domain category identification, the server may further obtain at least one hotword and hotword information corresponding to the auxiliary content.
Wherein the hotword information includes probability distributions of the corresponding hotwords. The at least one hotword corresponding to the auxiliary content may include at least one hotword in the auxiliary content and related words of the hotword searched in the word network based on the auxiliary content.
In one possible implementation manner, first hotword information corresponding to a first hotword contained in the auxiliary content is extracted, second hotword information corresponding to a related word of the first hotword is determined from the word network based on the first hotword, and the first hotword information and the second hotword information are combined into hotword information of at least one hotword corresponding to the auxiliary content.
The word network is a pattern data structure with words as vertexes and the relation among the words as edges.
Illustratively, on the word network, there may be an edge connection between words that have a semantic close relationship, such as an edge connection between "car" and "train, plane," an edge connection between "car brand a" and "car brand B," or an edge connection between "car brand C" and "car series C. The word network can be an off-line construction graph data structure, can be constructed by removing relation features among words by utilizing a knowledge graph, and can also be obtained by mining a large amount of text data, for example, by counting which words frequently appear together, namely a 'co-occurrence' phenomenon, and the relationship among words with high co-occurrence phenomenon can be judged to be more compact, namely that the words have edge connection and the probability of edge connection correspondence is higher.
When the first hotword information corresponding to the first hotword contained in the auxiliary content is extracted, the first hotword information corresponding to the first hotword output by the hotword detection model can be obtained by inputting the auxiliary content into the hotword detection model.
The hotword detection model is obtained based on the auxiliary content sample and hotword training contained in the auxiliary content sample.
In addition, before acquiring the hotword information of at least one hotword corresponding to the auxiliary content, the server may previously receive the trained hotword detection model.
In one possible implementation manner, the model training process of the hotword detection model is to input the auxiliary content sample into the hotword detection model to obtain hotword information of the predicted hotword output by the hotword detection model; and updating parameters of the hotword detection model based on the hotword information of the predicted hotword and the hotword contained in the auxiliary content sample.
By way of example, the auxiliary content sample is converted into a matrix or vector form and is input into a hotword detection model, wherein the probability of hotword content in the auxiliary content is set to be 1, the probability of non-hotword content is set to be 0, the matrix or vector output after processing through the hotword detection model contains probability distribution information corresponding to predicted hotwords, a corresponding loss function is calculated based on the output matrix or vector and the matrix or vector generated by converting the auxiliary content sample, and then model parameters in the hotword detection model are updated.
In step 405, hot word information of at least one hot word corresponding to the target voice and the auxiliary content is input into the first voice recognition model, and a candidate recognition result output by the first voice recognition model is obtained.
In the embodiment of the application, the obtained target voice and the hotword probability distribution of at least one hotword corresponding to the auxiliary content are input into a first voice recognition model, and the language model in the voice recognition model is a candidate language model corresponding to the target field type determined based on the auxiliary content, and the processed candidate recognition result is output by the first voice recognition model.
The candidate recognition result comprises a plurality of pieces of voice recognition text content corresponding to the target voice and probability distribution information corresponding to the voice recognition text content respectively.
In one possible implementation, the first speech recognition model is a model that performs a speech recognition process on the input target speech using an acoustic model and a language model by using a one-pass decoder as an engine for speech recognition.
The first-pass decoder receives the target voice, determines and selects a candidate language model serving as the language model by using the domain probability distribution corresponding to the auxiliary content, combines the acoustic model score output by the acoustic model and the language model score output by the language model, and obtains an N-best voice recognition result obtained by recognition, namely a candidate recognition result.
For example, when the target speech input to the first speech recognition model is "hello," the one-pass decoder may process the target speech using the acoustic model and the language model, and the generated candidate recognition results may be N text contents. The N text contents may be output after being sequenced according to the scores of the model recognition, and if the candidate recognition result is 3 recognition contents, the candidate recognition result may be 3-best: "hello", "nigood" and "your number". The candidate recognition probability score corresponding to "hello" is 0.8, the candidate recognition probability score corresponding to "hello" is 0.08, and the candidate recognition probability score corresponding to "you number" is 0.02.
In step 406, the candidate recognition result and the probability distribution information corresponding to the auxiliary content are input into the second language model, and the prediction score corresponding to the candidate recognition result output by the second language model is obtained.
In the embodiment of the application, candidate recognition results output from the first voice recognition model and probability distribution information output by the field detection model and the hotword detection model are input into the second language model, and prediction scores respectively corresponding to the candidate recognition results output by the second language model are obtained.
The probability distribution information corresponding to the auxiliary content comprises domain probability distribution and hotword information of at least one hotword corresponding to the auxiliary content. The domain probability distribution is used for indicating the probability that the auxiliary content corresponds to each domain category; the hotword information includes a probability distribution of the corresponding hotword. The second language model is a neural network model trained based on the auxiliary content samples.
Illustratively, the second language model is at least one of a recurrent neural network (Recurrent Neural Network, RNN) model, a long-short-term memory (Long Short Term Memory, LSTM) recurrent neural network, a threshold recurrent unit (Gated Recurrent Unit, GRU) neural network model, and a convolutional neural network model.
In one possible implementation, the second speech recognition model is a speech recognition model that performs a re-scoring ranking process on the input candidate recognition results using the second language model by using a two-pass decoder as an engine for speech recognition.
Illustratively, the two-pass decoder receives the candidate recognition results generated by the one-pass decoder, scores each candidate recognition result by combining the domain probability distribution and the hot word information of at least one hot word corresponding to the auxiliary content, and outputs a probability score corresponding to each candidate recognition result by the second language model, wherein the probability score is obtained by combining the acoustic model score of the one-pass decoding, the language model score of the one-pass decoding and the neural network language model of the two-pass decoding, namely the score of the second language model.
In addition, the server may previously receive the trained second language model before inputting the candidate recognition result into the second language model.
In one possible implementation manner, the model training process of the second language model is to acquire probability distribution information corresponding to the auxiliary content sample; determining a language model used in the first speech recognition model based on the domain probability distribution corresponding to the auxiliary content sample; inputting hot word information of at least one hot word corresponding to the auxiliary content sample and a voice sample corresponding to the auxiliary content sample into a first voice recognition model to obtain a candidate recognition result sample output by the first voice recognition model; inputting probability distribution information corresponding to the auxiliary content sample and the candidate recognition result sample into a second voice recognition model to obtain a predicted voice recognition result output by the second voice recognition model; and updating parameters of the first voice recognition model and the second voice recognition model based on the predicted voice recognition result and the real text corresponding to the voice sample.
The second language model training process needs to be updated in an offline training mode together with the first voice recognition model, the hotword detection model and the domain detection model.
Because the neural network language model of the two-pass decoding, namely the second language model, introduces the domain and hot word characteristics when offline training is carried out, the probability of the second language model can be dynamically adjusted according to the domain probability distribution and the hot word information obtained in the auxiliary content, so as to achieve the purpose of enabling the language model to be self-adaptive to the appointed environment.
In step 407, target identification content is determined based on the prediction score.
In the embodiment of the application, the target recognition content is selected and determined from the candidate recognition results based on the prediction scores corresponding to the candidate recognition results output by the second language model.
In one possible implementation, the candidate identification content with the highest prediction score is determined as the target identification content based on the prediction score corresponding to the candidate identification content.
For example, if the prediction scores corresponding to "hello", "hello" and "your number" in the candidate recognition result are 0.9, 0.02 and 0.01, respectively, then "hello" may be determined as the target recognition content.
In one possible implementation, when the specified environment is a conference environment, the server transmits target recognition content to the terminal, is presented by the terminal, and stores the target recognition content as a speech recognition result corresponding to the history speech. When the designated environment is a video playing environment, the server sends target identification content to a terminal playing video, the target identification content is displayed in a certain area of a played video picture, the target identification content is used as real-time subtitles corresponding to the video, and the target identification content is stored as a voice identification result corresponding to historical voice. When the designated environment is an intelligent home environment, the server controls intelligent equipment in the intelligent home system to complete an instruction corresponding to the target identification content based on the target identification content, and stores the target identification content as a voice identification result corresponding to the historical voice.
By way of example, the embodiment of the present application is applied to a conference scenario, and fig. 5 is a schematic diagram of speech recognition in a conference system according to the embodiment of the present application. As shown in fig. 5, step 51, real-time speech is collected as target speech from a person speaking in a conference, and step 52, based on text content and picture content on a manuscript such as a Presentation Point (PPT) uploaded during the conference, text content obtained by previously recognizing speech is input into a domain detection model and a hotword detection model as auxiliary content in the conference. The probability distribution corresponding to each field is output through the field detection model, and then the detected hotword list and the corresponding hotword probability distribution are output through the hotword detection model. Step 53, determining the related words of each hot word in the hot word list from the word network. Step 54, selecting a language model used in the speech recognition model corresponding to the encoder from the multi-domain statistical language models based on the probability distribution corresponding to each domain. And step 55, inputting the collected real-time voice of the user as target voice into a voice recognition model corresponding to the one-pass encoder, and outputting each candidate text content N-best result. And step 56, inputting each candidate text content, hot word information and domain probability distribution into a voice recognition model corresponding to the second-pass decoder, wherein the voice recognition model is a neural network model, the neural network model is a neural network model which is trained and adjusted based on the domain probability distribution sample corresponding to the voice sample and the hot word probability distribution sample, and the voice recognition model corresponding to the second-pass decoder can output an optimal voice recognition result which can be sent to a terminal at a meeting personnel side for display, and can also be used as text content obtained by historical voice recognition as the next auxiliary content.
In summary, in the scheme shown in the embodiment of the present application, by specifying at least one of the speech recognition result of the historical speech in the environment and the file content of the displayed target file, the language model in the first speech recognition model is determined in real time, so that the first speech recognition model for performing speech recognition on the target speech can perform adaptive adjustment in the specified environment, and the situation that part of speech cannot be clearly recognized in the process of performing speech recognition by using the fixed speech recognition model which is completed by training in advance is avoided, thereby improving the accuracy of speech recognition.
Fig. 6 is a schematic diagram of a voice recognition system under a conference scene, which is shown in an exemplary embodiment, as shown in fig. 6, in a conference environment 60, including conference participants and conference presentation documents 61 prepared by the conference participants, and further including terminals 62 for collecting real-time voices of the conference participants and receiving presentation documents uploaded by the conference participants, wherein the collected real-time voices of the conference participants serve as target voices, the conference presentation documents prepared by the conference participants serve as a part of auxiliary content, the terminals 62 upload the target voices and the presentation documents to a server 63, and because the auxiliary content further includes voice recognition results of historical voices, the voice recognition results of the historical voices are obtained from a storage device in the server 63 as another part of auxiliary content, the two parts of auxiliary content are sequentially input into a domain detection model and a thermal word detection model, a language model corresponding to each domain trained in advance is stored in the storage device, the target language model is determined from the language model corresponding to each domain based on the domain probability distribution results output by the detection model, the target language model is used as the language model in the first voice recognition model, the target voices are output by the thermal detection words are obtained as the target voices and the language model, the corresponding to each word and the corresponding word probability distribution and the relevant word are obtained by the thermal word distribution and the relevant word, the relevant probability distribution is obtained from the relevant word and the relevant word probability distribution is obtained by the thermal word distribution and the relevant word probability distribution is obtained from the relevant word and the relevant word probability distribution and the relevant word is obtained by the word probability distribution, i.e. candidate recognition results. And then inputting the candidate recognition results, the domain probability distribution output by the domain detection model and the hot word information in the hot word detection model into a second language model, scoring the neural network language model through the second language model, outputting probability scores corresponding to the candidate recognition results, re-ordering the candidate recognition results based on the probability scores in combination with the acoustic model scores corresponding to the one-pass decoder and the language model scores corresponding to the target domain language model, obtaining a final voice recognition result, storing the finally obtained voice recognition result in a storage device as a voice recognition result of the historical voice, and simultaneously transmitting the voice recognition result to the terminal 62 for display. The conference participants may obtain voice recognition results via the terminal 62.
In summary, in the scheme shown in the embodiment of the present application, by specifying at least one of the speech recognition result of the historical speech in the environment and the file content of the displayed target file, the language model in the first speech recognition model is determined in real time, so that the first speech recognition model for performing speech recognition on the target speech can perform adaptive adjustment in the specified environment, and the situation that part of speech cannot be clearly recognized in the process of performing speech recognition by using the fixed speech recognition model which is completed by training in advance is avoided, thereby improving the accuracy of speech recognition.
Fig. 7 is a block diagram of a voice recognition apparatus according to an exemplary embodiment, and as shown in fig. 7, the structured information construction apparatus may be implemented as all or part of a computer device by hardware or a combination of hardware and software to perform all or part of the steps of the method shown in the corresponding embodiment of fig. 1 or fig. 4. The voice recognition apparatus may include:
a voice acquisition module 710 for acquiring a target voice, which is a real-time voice collected in a specified environment;
a model determination module 720 for determining a first speech recognition model based on the auxiliary content; the auxiliary content comprises a voice recognition result of historical voice collected in the appointed environment and at least one of file contents of target files displayed in the appointed environment;
A candidate obtaining module 730, configured to perform decoding processing on the target speech based on the first speech recognition model, to obtain a candidate recognition result output by the first speech recognition model;
and a result obtaining module 740, configured to perform probability prediction processing on the candidate recognition result, and obtain a speech recognition result of the target speech.
In one possible implementation, in response to the language model used in the first speech recognition model being determined from at least two candidate language models, at least two of the candidate language models respectively corresponding to respective domain categories;
the model determination module 720 includes:
the domain determining submodule is used for processing the target voice based on a first voice recognition model and determining a target domain type based on the auxiliary content before obtaining a candidate recognition result output by the first voice recognition model;
and the language model determining submodule is used for determining the candidate language model corresponding to the target field type in at least two candidate language models as the language model used in the first voice recognition model.
In one possible implementation, the domain determining submodule includes:
The domain probability acquisition unit is used for inputting the auxiliary content into a domain detection model to acquire the domain probability distribution; the domain probability distribution is used for indicating the probability that the auxiliary content corresponds to each domain category; the domain detection model is obtained by training based on an auxiliary content sample and a domain type corresponding to the auxiliary content sample;
and the domain determining unit is used for determining the target domain type based on the domain probability distribution.
In one possible implementation, the apparatus further includes:
the prediction probability acquisition module is used for inputting the auxiliary content sample into the domain detection model before acquiring the target voice to acquire a prediction domain probability distribution;
and the first model updating module is used for updating parameters of the domain detection model based on the prediction domain probability distribution and the domain type corresponding to the auxiliary content sample.
In one possible implementation, the apparatus further includes:
the hot word information acquisition module is used for processing the target voice based on a first voice recognition model and acquiring hot word information of at least one hot word corresponding to the auxiliary content before acquiring a candidate recognition result output by the first voice recognition model; the hotword information comprises probability distribution of the hotword;
The candidate acquisition module 730 includes:
and the candidate acquisition sub-module is used for inputting the target voice and the hot word information of at least one hot word corresponding to the auxiliary content into the first voice recognition model for decoding processing to obtain the candidate recognition result output by the first voice recognition model.
In one possible implementation manner, the hotword information acquisition module includes:
the first information extraction sub-module is used for extracting first hotword information corresponding to the first hotword contained in the auxiliary content;
the second information extraction sub-module is used for determining second hotword information corresponding to related words of the first hotword from a word network based on the first hotword; the word network is a pattern data structure which takes words as vertexes and takes the relation among the words as edges;
and the information merging sub-module is used for merging the first hotword information and the second hotword information into hotword information of at least one hotword corresponding to the auxiliary content.
In one possible implementation manner, the first information extraction sub-module includes:
the first information acquisition unit is used for inputting the auxiliary content into a hotword detection model to acquire first hotword information corresponding to the first hotword and output by the hotword detection model; the hotword detection model is obtained based on auxiliary content samples and hotwords contained in the auxiliary content samples.
In one possible implementation, the apparatus further includes:
the prediction information acquisition module is used for inputting the auxiliary content sample into the hotword detection model before acquiring the target voice to obtain hotword information of a predicted hotword output by the hotword detection model;
and the second model updating module is used for updating parameters of the hotword detection model based on the hotword information of the predicted hotword and the hotword in the auxiliary content sample.
In one possible implementation, the result obtaining module 740 includes:
the score acquisition sub-module is used for inputting the candidate identification result and probability distribution information corresponding to the auxiliary content into a second language model to obtain a prediction score corresponding to the candidate identification result output by the second language model;
and the content determination submodule is used for determining the target identification content based on the prediction score.
In one possible implementation manner, the probability distribution information corresponding to the auxiliary content includes:
domain probability distribution, and hotword information of at least one hotword corresponding to the auxiliary content;
the domain probability distribution is used for indicating the probability that the auxiliary content corresponds to each domain category; the hotword information includes a probability distribution of the corresponding hotword.
In one possible implementation, the apparatus further includes:
the probability acquisition module is used for acquiring probability distribution information corresponding to the auxiliary content sample before acquiring the target voice;
the model determining module is used for determining a language model used in the first voice recognition model based on the domain probability distribution corresponding to the auxiliary content sample;
the candidate sample acquisition module is used for inputting the hot word information of at least one hot word corresponding to the auxiliary content sample and the voice sample corresponding to the auxiliary content sample into the first voice recognition model to obtain a candidate recognition result sample output by the first voice recognition model;
the prediction result acquisition module is used for inputting probability distribution information corresponding to the auxiliary content sample and the candidate recognition result sample into the second voice recognition model to obtain a prediction voice recognition result output by the second voice recognition model;
and the updating module is used for updating parameters of the first voice recognition model and the second voice recognition model based on the predicted voice recognition result and the real text corresponding to the voice sample.
In one possible implementation, the specified environment is at least one of a conference environment, a video playing environment, and a smart home environment; the target file is a file used in the specified environment.
In summary, in the scheme shown in the embodiment of the present application, by specifying at least one of the speech recognition result of the historical speech in the environment and the file content of the displayed target file, the language model in the first speech recognition model is determined in real time, so that the first speech recognition model for performing speech recognition on the target speech can perform adaptive adjustment in the specified environment, and the situation that part of speech cannot be clearly recognized in the process of performing speech recognition by using the fixed speech recognition model which is completed by training in advance is avoided, thereby improving the accuracy of speech recognition.
Fig. 8 is a schematic diagram of a computer device, according to an example embodiment. The computer apparatus 800 includes a central processing unit (Central Processing Unit, CPU) 801, a system Memory 804 including a random access Memory (Random Access Memory, RAM) 802 and a Read-Only Memory (ROM) 803, and a system bus 805 connecting the system Memory 804 and the central processing unit 801. The computer device 800 also includes a basic Input/Output system (I/O) 806 for facilitating the transfer of information between the various devices within the computer device, and a mass storage device 807 for storing an operating system 813, application programs 814, and other program modules 815.
The basic input/output system 806 includes a display 808 for displaying information and an input device 809, such as a mouse, keyboard, or the like, for user input of information. Wherein the display 808 and the input device 809 are connected to the central processing unit 801 via an input output controller 810 connected to the system bus 805. The basic input/output system 806 can also include an input/output controller 810 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, the input output controller 810 also provides output to a display screen, a printer, or other type of output device.
The mass storage device 807 is connected to the central processing unit 801 through a mass storage controller (not shown) connected to the system bus 805. The mass storage device 807 and its associated computer device readable media provide non-volatile storage for the computer device 800. That is, the mass storage device 807 may include a computer device readable medium (not shown) such as a hard disk or a compact disk-Only (CD-ROM) drive.
The computer device readable medium may include computer device storage media and communication media without loss of generality. Computer device storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer device readable instructions, data structures, program modules or other data. Computer device storage media includes RAM, ROM, erasable programmable read-Only Memory (Erasable Programmable Read Only Memory, EPROM), electrically erasable programmable read-Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), CD-ROM, digital video disk (Digital Video Disc, DVD), or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that the computer device storage medium is not limited to the ones described above. The system memory 804 and mass storage device 807 described above may be collectively referred to as memory.
According to various embodiments of the present disclosure, the computer device 800 may also operate through a network, such as the Internet, to remote computer devices on the network. I.e., the computer device 800 may be connected to a network 812 through a network interface unit 811 connected to the system bus 805, or alternatively, the network interface unit 811 may be used to connect to other types of networks or remote computer device systems (not shown).
The memory further comprises one or more programs stored in the memory, by which the central processing unit 801 implements all or part of the steps of the method shown in fig. 1 or 4.
Fig. 9 is a block diagram of a computer device 900, shown in accordance with an exemplary embodiment. The computer device 900 may be a terminal in the structured information construction system shown in fig. 1.
In general, the computer device 900 includes: a processor 901 and a memory 902.
Processor 901 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 901 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 901 may also include a main processor and a coprocessor, the main processor being a processor for processing data in an awake state, also referred to as a CPU (Central Processing Unit ); a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 901 may integrate a GPU (Graphics Processing Unit, image processor) for rendering and drawing of content required to be displayed by the display screen. In some embodiments, the processor 901 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.
The memory 902 may include one or more computer-readable storage media, which may be non-transitory. The memory 902 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 902 is used to store at least one instruction for execution by processor 901 to implement the methods provided by the method embodiments herein.
In some embodiments, the computer device 900 may also optionally include: a peripheral interface 903, and at least one peripheral. The processor 901, memory 902, and peripheral interface 903 may be connected by a bus or signal line. The individual peripheral devices may be connected to the peripheral device interface 903 via buses, signal lines, or circuit boards. Specifically, the peripheral device includes: at least one of radio frequency circuitry 904, a display 905, a camera assembly 906, audio circuitry 907, and a power source 909.
The peripheral interface 903 may be used to connect at least one peripheral device associated with an I/O (Input/Output) to the processor 901 and the memory 902. In some embodiments, the processor 901, memory 902, and peripheral interface 903 are integrated on the same chip or circuit board; in some other embodiments, either or both of the processor 901, the memory 902, and the peripheral interface 903 may be implemented on separate chips or circuit boards, which is not limited in this embodiment.
The Radio Frequency circuit 904 is configured to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuit 904 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 904 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 904 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. The radio frequency circuit 904 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: the world wide web, metropolitan area networks, intranets, generation mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity ) networks. In some embodiments, the radio frequency circuit 904 may also include NFC (Near Field Communication ) related circuits, which are not limited in this application.
The display 905 is used to display a UI (User Interface). The camera assembly 906 is used to capture images or video.
The audio circuit 907 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 901 for processing, or inputting the electric signals to the radio frequency circuit 904 for voice communication. For purposes of stereo acquisition or noise reduction, the microphone may be multiple, each disposed at a different location of the computer device 900. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is used to convert electrical signals from the processor 901 or the radio frequency circuit 904 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, the audio circuit 907 may also include a headphone jack.
In some embodiments, computer device 900 also includes one or more sensors 910. The one or more sensors 910 include, but are not limited to: acceleration sensor 911, gyro sensor 912, pressure sensor 913, optical sensor 915, and proximity sensor 916.
Those skilled in the art will appreciate that the architecture shown in fig. 9 is not limiting of the computer device 900, and may include more or fewer components than shown, or may combine certain components, or employ a different arrangement of components.
In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as a memory including at least one instruction, at least one program, code set, or instruction set executable by a processor to perform all or part of the steps of the methods illustrated in any of the embodiments of fig. 1 or 4 described above. For example, the non-transitory computer readable storage medium may be ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
Those skilled in the art will appreciate that in one or more of the examples described above, the functions described by the embodiments of the present disclosure may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer device-readable medium. Computer device readable media includes both computer device storage media and communication media including any medium that facilitates transfer of a computer device program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer device.
According to one aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the speech recognition method provided in the various alternative implementations of the above aspects.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It is to be understood that the present application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (14)

1. A method of speech recognition, the method comprising:
acquiring target voice, wherein the target voice is real-time voice acquired in a specified environment;
determining a target domain category based on the auxiliary content;
determining the candidate language model corresponding to the target field type in at least two candidate language models as a language model used in a first voice recognition model; at least two candidate language models respectively correspond to respective domain types; the auxiliary content comprises a voice recognition result of historical voice collected in the appointed environment and at least one of file contents of target files displayed in the appointed environment;
acquiring hot word information of at least one hot word corresponding to the auxiliary content; the hotword information comprises probability distribution of the hotword;
inputting the target voice and the hot word information of at least one hot word corresponding to the auxiliary content into the first voice recognition model for decoding processing to obtain a candidate recognition result output by the first voice recognition model;
and carrying out probability prediction processing on the candidate recognition result to obtain a voice recognition result of the target voice.
2. The method of claim 1, wherein the determining a target domain category based on the auxiliary content comprises:
inputting the auxiliary content into a domain detection model to acquire the domain probability distribution; the domain probability distribution is used for indicating the probability that the auxiliary content corresponds to each domain category; the domain detection model is obtained by training based on an auxiliary content sample and a domain type corresponding to the auxiliary content sample;
the target domain category is determined based on the domain probability distribution.
3. The method of claim 2, wherein prior to the obtaining the target speech, further comprising:
inputting the auxiliary content sample into the domain detection model to obtain predicted domain probability distribution;
and updating parameters of the domain detection model based on the prediction domain probability distribution and the domain type corresponding to the auxiliary content sample.
4. The method according to claim 1, wherein the obtaining hotword information of at least one hotword corresponding to the auxiliary content includes:
extracting first hotword information corresponding to the first hotword contained in the auxiliary content;
Based on the first hotword, determining second hotword information corresponding to the related word of the first hotword from a word network; the word network is a pattern data structure which takes words as vertexes and takes the relation among the words as edges;
and merging the first hotword information and the second hotword information into hotword information of at least one hotword corresponding to the auxiliary content.
5. The method according to claim 4, wherein the extracting first hotword information corresponding to a first hotword included in the auxiliary content includes:
inputting the auxiliary content into a hotword detection model to obtain first hotword information corresponding to the first hotword and output by the hotword detection model; the hotword detection model is obtained based on auxiliary content samples and hotwords contained in the auxiliary content samples.
6. The method of claim 5, further comprising, prior to the obtaining the target speech:
inputting the auxiliary content sample into the hotword detection model to obtain hotword information of predicted hotwords output by the hotword detection model;
and updating parameters of the hotword detection model based on the hotword information of the predicted hotword and the hotword in the auxiliary content sample.
7. The method according to claim 1, wherein the performing probability prediction processing on the candidate recognition result to obtain a speech recognition result of the target speech includes:
inputting the candidate recognition result and probability distribution information corresponding to the auxiliary content into a second language model to obtain a prediction score corresponding to the candidate recognition result output by the second language model;
the target identification content is determined based on the predictive score.
8. The method of claim 7, wherein the probability distribution information corresponding to the auxiliary content comprises:
domain probability distribution, and hotword information of at least one hotword corresponding to the auxiliary content;
the domain probability distribution is used for indicating the probability that the auxiliary content corresponds to each domain category; the hotword information includes a probability distribution of the corresponding hotword.
9. The method of claim 8, further comprising, prior to the obtaining the target speech:
acquiring probability distribution information corresponding to the auxiliary content sample;
determining a language model used in the first voice recognition model based on the domain probability distribution corresponding to the auxiliary content sample;
Inputting the hot word information of at least one hot word corresponding to the auxiliary content sample and the voice sample corresponding to the auxiliary content sample into the first voice recognition model to obtain a candidate recognition result sample output by the first voice recognition model;
inputting probability distribution information corresponding to the auxiliary content sample and the candidate recognition result sample into the second language model to obtain a predicted voice recognition result output by the second language model;
and updating parameters of the first voice recognition model and the second language model based on the predicted voice recognition result and the real text corresponding to the voice sample.
10. The method of any one of claims 1 to 9, wherein the specified environment is at least one of a conference environment, a video playback environment, and a smart home environment; the target file is a file used in the specified environment.
11. A speech recognition device, the device comprising:
the voice acquisition module is used for acquiring target voice, wherein the target voice is real-time voice acquired in a specified environment;
a domain determining sub-module for determining a target domain category based on the auxiliary content;
A language model determining sub-module, configured to determine, from at least two candidate language models, the candidate language model corresponding to the target domain type as a language model used in a first speech recognition model; at least two candidate language models respectively correspond to respective domain types; the auxiliary content comprises a voice recognition result of historical voice collected in the appointed environment and at least one of file contents of target files displayed in the appointed environment;
the hotword information acquisition module is used for acquiring hotword information of at least one hotword corresponding to the auxiliary content; the hotword information comprises probability distribution of the hotword;
the candidate acquisition sub-module is used for inputting the target voice and the hot word information of at least one hot word corresponding to the auxiliary content into the first voice recognition model for decoding processing to obtain a candidate recognition result output by the first voice recognition model;
and the result acquisition module is used for carrying out probability prediction processing on the candidate recognition result to acquire a voice recognition result of the target voice.
12. A computer device comprising a processor and a memory having stored therein at least one instruction, at least one program, code set, or instruction set that is loaded and executed by the processor to implement the speech recognition method of any one of claims 1 to 10.
13. A computer readable storage medium having stored therein at least one instruction, at least one program, code set, or instruction set, the at least one instruction, the at least one program, the code set, or instruction set being loaded and executed by a processor to implement the speech recognition method of any one of claims 1 to 10.
14. A computer program product, the computer program product comprising computer instructions stored in a computer readable storage medium; the computer instructions are read and executed by a processor of a computer device to cause the computer device to implement the speech recognition method according to any one of claims 1 to 10.
CN202110578432.0A 2021-05-26 2021-05-26 Speech recognition method, device, computer equipment and storage medium Active CN113763925B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110578432.0A CN113763925B (en) 2021-05-26 2021-05-26 Speech recognition method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110578432.0A CN113763925B (en) 2021-05-26 2021-05-26 Speech recognition method, device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113763925A CN113763925A (en) 2021-12-07
CN113763925B true CN113763925B (en) 2024-03-12

Family

ID=78787225

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110578432.0A Active CN113763925B (en) 2021-05-26 2021-05-26 Speech recognition method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113763925B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115188381B (en) * 2022-05-17 2023-10-24 贝壳找房(北京)科技有限公司 Voice recognition result optimization method and device based on click ordering

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101923854A (en) * 2010-08-31 2010-12-22 中国科学院计算技术研究所 Interactive speech recognition system and method
CN106486126A (en) * 2016-12-19 2017-03-08 北京云知声信息技术有限公司 Speech recognition error correction method and device
CN107578771A (en) * 2017-07-25 2018-01-12 科大讯飞股份有限公司 Audio recognition method and device, storage medium, electronic equipment
CN109272995A (en) * 2018-09-26 2019-01-25 出门问问信息科技有限公司 Audio recognition method, device and electronic equipment
CN110797014A (en) * 2018-07-17 2020-02-14 中兴通讯股份有限公司 Voice recognition method and device and computer storage medium
CN112017645A (en) * 2020-08-31 2020-12-01 广州市百果园信息技术有限公司 Voice recognition method and device
CN112102815A (en) * 2020-11-13 2020-12-18 深圳追一科技有限公司 Speech recognition method, speech recognition device, computer equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8433576B2 (en) * 2007-01-19 2013-04-30 Microsoft Corporation Automatic reading tutoring with parallel polarized language modeling

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101923854A (en) * 2010-08-31 2010-12-22 中国科学院计算技术研究所 Interactive speech recognition system and method
CN106486126A (en) * 2016-12-19 2017-03-08 北京云知声信息技术有限公司 Speech recognition error correction method and device
CN107578771A (en) * 2017-07-25 2018-01-12 科大讯飞股份有限公司 Audio recognition method and device, storage medium, electronic equipment
CN110797014A (en) * 2018-07-17 2020-02-14 中兴通讯股份有限公司 Voice recognition method and device and computer storage medium
CN109272995A (en) * 2018-09-26 2019-01-25 出门问问信息科技有限公司 Audio recognition method, device and electronic equipment
CN112017645A (en) * 2020-08-31 2020-12-01 广州市百果园信息技术有限公司 Voice recognition method and device
CN112102815A (en) * 2020-11-13 2020-12-18 深圳追一科技有限公司 Speech recognition method, speech recognition device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN113763925A (en) 2021-12-07

Similar Documents

Publication Publication Date Title
US20210233521A1 (en) Method for speech recognition based on language adaptivity and related apparatus
US11842164B2 (en) Method and apparatus for training dialog generation model, dialog generation method and apparatus, and medium
CN111933115B (en) Speech recognition method, apparatus, device and storage medium
CN107657017A (en) Method and apparatus for providing voice service
CN107623614A (en) Method and apparatus for pushed information
US20240021202A1 (en) Method and apparatus for recognizing voice, electronic device and medium
CN112214591A (en) Conversation prediction method and device
CN112632244A (en) Man-machine conversation optimization method and device, computer equipment and storage medium
CN110825164A (en) Interaction method and system based on wearable intelligent equipment special for children
CN113392687A (en) Video title generation method and device, computer equipment and storage medium
CN111354362A (en) Method and device for assisting hearing-impaired communication
CN113763925B (en) Speech recognition method, device, computer equipment and storage medium
WO2023272616A1 (en) Text understanding method and system, terminal device, and storage medium
CN113205569A (en) Image drawing method and device, computer readable medium and electronic device
CN110781327B (en) Image searching method and device, terminal equipment and storage medium
CN117093687A (en) Question answering method and device, electronic equipment and storage medium
CN117150338A (en) Task processing, automatic question and answer and multimedia data identification model training method
KR20210123545A (en) Method and apparatus for conversation service based on user feedback
CN110781329A (en) Image searching method and device, terminal equipment and storage medium
CN115222857A (en) Method, apparatus, electronic device and computer readable medium for generating avatar
CN113961680A (en) Human-computer interaction based session processing method and device, medium and electronic equipment
CN113792537A (en) Action generation method and device
CN112307186A (en) Question-answering service method, system, terminal device and medium based on emotion recognition
CN113571063A (en) Voice signal recognition method and device, electronic equipment and storage medium
CN116913278B (en) Voice processing method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant