WO2021047103A1 - Voice recognition method and device - Google Patents

Voice recognition method and device Download PDF

Info

Publication number
WO2021047103A1
WO2021047103A1 PCT/CN2019/127672 CN2019127672W WO2021047103A1 WO 2021047103 A1 WO2021047103 A1 WO 2021047103A1 CN 2019127672 W CN2019127672 W CN 2019127672W WO 2021047103 A1 WO2021047103 A1 WO 2021047103A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio data
sub
recognition result
speech recognition
time
Prior art date
Application number
PCT/CN2019/127672
Other languages
French (fr)
Chinese (zh)
Inventor
汪俊
闫博群
李索恒
张志齐
郑达
Original Assignee
上海依图信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海依图信息技术有限公司 filed Critical 上海依图信息技术有限公司
Publication of WO2021047103A1 publication Critical patent/WO2021047103A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems

Definitions

  • the embodiments of the present invention relate to the field of information technology, and in particular, to a voice recognition method and device.
  • Voice recognition technology is a technology that enables machines to convert voice information into corresponding text or commands through the process of recognition and understanding.
  • it is necessary to determine the result of the speech recognition by the speech information at the current moment and the context information at the current moment.
  • the calculation time of the speech information at the current moment does not match the calculation time of the context information, so As a result, the output of speech recognition results in the prior art lags behind, which cannot meet the requirements of real-time performance.
  • the embodiment of the present invention provides a voice recognition method and device, which can match the calculation time of the voice information at the current moment with the calculation time of the context information, and meet the real-time requirements.
  • an embodiment of the present invention provides a speech recognition method, the method is applied to a speech recognition system, the speech recognition system includes at least a first speech recognition model and a second speech recognition model, the first speech recognition model has n processing modules, each module has an input terminal and a corresponding output terminal, the second speech recognition model has n processing modules, each module has an input terminal and a corresponding output terminal, the method includes:
  • Acquiring audio data to be identified where the audio data to be identified is composed of sub audio data at n times, where n is greater than or equal to 1;
  • the sub-audio data at the i-th moment input the sub-audio data into the i-th processing module in the first speech recognition model and the i-th processing module in the second speech recognition model to obtain the first recognition results, respectively And a second recognition result, where the first recognition result is determined based on the sub-audio data from the first moment to the i-th moment in the to-be-recognized audio, and the second recognition result is based on the sub-audio data in the to-be recognized audio
  • the sub-audio data from the i-th time to the n-th time is determined, each processing model in the first speech recognition model corresponds to the sub-audio data at a time, and each processing model in the second speech recognition model Corresponding to the sub-audio data at a time, the calculation time of the first speech recognition model matches the calculation time of the second speech recognition model, and the calculation dimension of the first speech recognition model is greater than that of the second speech recognition model
  • the calculation dimension, i is determined according to the calculation dimension of the first speech recognition model and the
  • the text recognition result of the sub audio data at the i-th moment is determined according to the first recognition result and the second recognition result.
  • the sub-audio data at the i-th moment input the sub-audio data into the i-th processing module in the first speech recognition model and the i-th processing module in the second speech recognition model to obtain respectively
  • the first recognition result and the second recognition result include:
  • matching the calculation time of the first speech recognition model with the calculation time of the second speech recognition model includes:
  • the difference between the time when the first speech recognition model calculates the first recognition result and the time when the second speech recognition calculates the second recognition result is smaller than a preset threshold.
  • the method further includes:
  • the sub-audio data at the i+1-th moment input the sub-audio data into the i+1-th processing module in the first speech recognition model to obtain the first recognition result, and obtain the second recognition result, the The first recognition result is determined based on the sub-audio data from the 1st moment to the i+1th moment in the audio to be recognized, and the second recognition result is the text recognition of the sub-audio data at the i-th moment. Determined in the process of results;
  • the text recognition result of the sub audio data at the i+1th time instant is determined according to the first recognition result and the second recognition result.
  • the determining the recognition result of the sub-audio data according to the first recognition result and the second recognition result includes:
  • the recognition result of the sub audio data is determined according to the weight of the first recognition result and the weight of the second recognition result.
  • the embodiment of the present invention also provides a voice recognition device, the device is applied to a voice recognition system, the voice recognition system at least includes a first voice recognition model and a second voice recognition model, the first voice recognition model There are n processing modules, each module has an input terminal and a corresponding output terminal, the second speech recognition model has n processing modules, each module has an input terminal and a corresponding output terminal, and the device includes:
  • the acquiring unit is configured to acquire audio data to be identified, where the audio data to be identified is composed of sub audio data at n times, where n is greater than or equal to 1;
  • the calculation unit is configured to input the sub-audio data into the i-th processing module in the first speech recognition model and the i-th processing module in the second speech recognition model for the sub-audio data at the i-th moment, respectively Obtain a first recognition result and a second recognition result.
  • the first recognition result is determined based on the sub-audio data from the first moment to the i-th moment in the audio to be recognized, and the second recognition result is based on the The sub-audio data from the i-th time to the n-th time in the audio to be recognized is determined, each processing model in the first speech recognition model corresponds to the sub-audio data at a time, and in the second speech recognition model Each processing model corresponds to the sub-audio data at a time, the calculation time of the first speech recognition model matches the calculation time of the second speech recognition model, and the calculation dimension of the first speech recognition model is greater than the first speech recognition model. 2.
  • the calculation dimension of the speech recognition model, i is determined according to the calculation dimension of the first speech recognition model and the calculation dimension of the second speech recognition model, and i belongs to n;
  • the result determining unit is configured to determine the text recognition result of the sub audio data at the i-th moment according to the first recognition result and the second recognition result.
  • calculation unit is specifically configured to:
  • matching the calculation time of the first speech recognition model with the calculation time of the second speech recognition model includes:
  • the difference between the time when the first speech recognition model calculates the first recognition result and the time when the second speech recognition calculates the second recognition result is smaller than a preset threshold.
  • calculation unit is further used for:
  • the sub-audio data at the i+1-th moment input the sub-audio data into the i+1-th processing module in the first speech recognition model to obtain the first recognition result, and obtain the second recognition result, the The first recognition result is determined based on the sub-audio data from the 1st moment to the i+1th moment in the audio to be recognized, and the second recognition result is the text recognition of the sub-audio data at the i-th moment. Determined in the process of results;
  • the result determining unit is also used for:
  • the text recognition result of the sub audio data at the i+1th time instant is determined according to the first recognition result and the second recognition result.
  • the result determining unit is specifically configured to:
  • the recognition result of the sub audio data is determined according to the weight of the first recognition result and the weight of the second recognition result.
  • an embodiment of the present invention also provides a computer device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor.
  • the processor implements any of the above-mentioned speech recognition when the program is executed. Method steps.
  • an embodiment of the present invention also provides a computer-readable storage medium that stores a computer program executable by a computer device.
  • the program runs on the computer device, the computer device executes any of the above-mentioned speech recognitions. Method steps.
  • An embodiment of the present invention provides a computer program product, the computer program product includes a calculation program stored on a non-transitory computer-readable storage medium, the computer program includes program instructions, and when the program instructions are executed by a computer, The computer is allowed to execute any of the above-mentioned speech recognition methods.
  • the first recognition result of the first speech recognition model is determined based on the sub-audio data from the first time to the i-th time. It can be considered that the first speech recognition model processes the output result at the current time.
  • the second recognition result is determined based on the sub-audio data from the i-th time to the n-th time in the audio to be recognized. It can be considered that the second speech recognition model processes context information, because in the embodiment of the present invention, the first speech The calculation dimension of the recognition model is greater than that of the second speech recognition model. Therefore, when the first speech recognition model calculates the sub-audio data at the i-th moment, the second speech recognition model has also calculated the sub-audio data at the i-th moment. For sub-audio data, in this way, the calculation time of the first recognition result and the second recognition result are matched, and there is no need to wait for another calculation result after calculating one calculation result, which improves the real-time performance of speech recognition.
  • Figure 1 is an application scenario architecture diagram provided by an embodiment of the present invention.
  • FIG. 2 is an architecture diagram of a speech recognition system provided by an embodiment of the present invention
  • FIG. 3 is a schematic flowchart of a voice recognition method provided by an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of an application scenario of a voice recognition method provided by an embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of a speech recognition device provided by an embodiment of the present invention.
  • Fig. 6 is a schematic structural diagram of a computer device provided by an embodiment of the present invention.
  • Speech recognition technology allows machines to convert speech signals into corresponding text or commands through the process of recognition and understanding. Through speech signal processing and pattern recognition, machines can automatically recognize and understand human spoken language.
  • Speech recognition technology is an interdisciplinary subject that involves a wide range of subjects. It is closely related to subjects such as acoustics, phonetics, linguistics, information theory, pattern recognition theory, and neurobiology. Speech recognition technology usually uses three methods: template matching method, random model method and probabilistic syntax analysis method. Deep learning methods and machine learning methods are also commonly used.
  • Machine learning is a multi-field interdisciplinary subject, involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other subjects. Specializing in the study of how computers simulate or realize human learning behaviors in order to acquire new knowledge or skills, and reorganize the existing knowledge structure to continuously improve its own performance.
  • HMM Hidden Markov Model, Hidden Markov Model
  • Deep learning is to learn the internal laws and representation levels of sample data.
  • the information obtained in the learning process is of great help to the interpretation of data such as text, images and sounds. Its ultimate goal is to enable machines to have the ability to analyze and learn like humans, and to recognize data such as text, images, and sounds.
  • Deep learning is a complex machine learning algorithm that has achieved results in speech and image recognition far beyond previous related technologies.
  • Deep learning has achieved many results in search technology, data mining, machine learning, machine translation, natural language processing, multimedia learning, speech, recommendation and personalization technology, and other related fields. Deep learning enables machines to imitate human activities such as audiovisual and thinking, and solves many complex pattern recognition problems, which has made great progress in artificial intelligence-related technologies.
  • a neural network model in a deep learning method can be used for speech recognition.
  • each training sequence is forward and backward respectively as two recurrent neural networks (RNN), and these two are connected to an output layer.
  • RNN recurrent neural networks
  • the data includes sub-audio data at n times as an example.
  • the sub-audio data is input to the i-th processing module in the first speech recognition model and the second speech recognition model.
  • the i-th processing module obtains the first recognition result and the second recognition result respectively.
  • the first recognition result is determined based on the sub-audio data from the first moment to the i-th moment in the audio to be recognized, and the second recognition result is based on The sub-audio data from the i-th moment to the n-th moment in the audio to be recognized is determined.
  • the applicant of this application conceived a voice recognition method in which the calculation dimension of the first speech recognition model is greater than the calculation dimension of the second speech recognition model, so the first speech
  • the calculation time of the recognition model matches the calculation time of the second speech recognition model, which can effectively improve the real-time performance of speech recognition.
  • the voice recognition method in the embodiment of the present application can be applied to the application scenario shown in FIG. 1, and the application scenario includes the terminal device 101 and the voice server 102.
  • the terminal device 101 and the voice server 102 are connected through a wireless or wired network.
  • the terminal device 101 includes but is not limited to smart speakers, smart watches, smart homes and other smart devices, smart robots, AI customer service, bank credit card reminder phone systems, And electronic devices such as smart phones, mobile computers, and tablet computers with voice interaction functions.
  • the voice server 102 may provide related voice servers, such as voice recognition, voice synthesis and other services.
  • the voice server 102 may be a server, a server cluster composed of several servers, or a cloud computing center.
  • the user 10 interacts with the terminal device 101, and the terminal device 101 sends the voice data input by the user 10 to the voice server 102.
  • the voice server 102 performs voice recognition processing and semantic analysis processing on the voice data sent by the terminal device 101, determines the corresponding voice recognition text according to the semantic analysis result, and sends the voice recognition text to the terminal device 101, and the terminal device 101 displays or executes the voice Recognize the instructions corresponding to the text.
  • an embodiment of the present application provides a voice recognition method.
  • the process of the method can be executed by a voice recognition device.
  • the method is applied to a voice recognition system, and the voice recognition system at least includes a first A voice recognition model and a second voice recognition model, the first voice recognition model has n processing modules, each module has an input terminal and a corresponding output terminal, the second voice recognition model has n processing modules, Each module has an input terminal and a corresponding output terminal.
  • a voice recognition system is first introduced by way of example.
  • the voice recognition system includes a first voice The recognition model and the second speech recognition model.
  • the first speech recognition model and the second speech recognition model each have n processing modules, and each processing model has an input terminal and an output terminal.
  • the sub-audio data is Input into the corresponding processing module of the first speech recognition model, and input the sub-audio data into the corresponding processing module of the second speech recognition model for processing.
  • the voice recognition method in the embodiment of the present invention includes:
  • Step S301 Acquire audio data to be identified, where the audio data to be identified is composed of sub audio data at n times, where n is greater than or equal to 1.
  • the audio data to be identified is composed of sub audio data at n times.
  • the audio data to be identified is a 20-second piece of audio data, and the 20-second audio data can be divided into 20 pieces of audio data.
  • Time that is, the audio data at every 1 second time is regarded as one sub-audio data, and each sub-audio data has a time sequence, so the audio data to be identified corresponds to 20 sub-audio data in a sequence.
  • Step S302 For the sub-audio data at the i-th moment, input the sub-audio data into the i-th processing module in the first speech recognition model and the i-th processing module in the second speech recognition model to obtain the sub-audio data respectively.
  • a recognition result and a second recognition result where the first recognition result is determined based on the sub audio data from the first moment to the i-th moment in the audio to be recognized, and the second recognition result is based on the to-be-recognized audio Recognizing the sub-audio data from the i-th time to the n-th time in the audio, each processing model in the first speech recognition model corresponds to the sub-audio data at a time, and each processing model in the second speech recognition model Each processing model corresponds to the sub-audio data at a time, the calculation time of the first speech recognition model matches the calculation time of the second speech recognition model, and the calculation dimension of the first speech recognition model is larger than the second speech
  • the calculation dimension of the recognition model, i is determined according to the calculation dimension of
  • the sub-audio data at each moment is input into the corresponding processing module of the first voice recognition model and the processing module of the second voice recognition model , Get the corresponding results respectively.
  • the processing direction of the first speech recognition model is opposite to that of the second speech recognition model.
  • the first speech recognition model and the second speech recognition model each have 10 processing modules.
  • the recognition audio data is 10s voice data, so every second of voice data is input to the processing of the corresponding first voice recognition model.
  • the input data of the two processing modules are the 2s speech data
  • the second processing of the first speech recognition model determines the processing result of the 2s voice data according to the processing results of the 1s voice data and the 2s voice data.
  • the second processing module of the second voice recognition model is based on the third processing module for the 3s voice data
  • the processing result of the 2s voice data and the 2s voice data are used to determine the 2s voice data processing result, and the third processing module of the second voice recognition model is to obtain the 3s voice data and the fourth processing module for the 4s voice data
  • the processing result of the second speech recognition model is determined, and so on, the processing result of the second processing module of the second speech recognition model is based on the processing results of the tenth processing module to the third processing module of the second speech recognition model and the 2s Voice data to determine.
  • the first The calculation dimension of the speech recognition model is greater than the calculation dimension of the second speech recognition model, that is, the calculation time of the first speech recognition model is longer, and the calculation time of the second speech recognition model is shorter.
  • the first speech When the i-th processing module of the recognition model calculates the output result, the n-th to the i-th processing module of the second speech recognition model also calculates the output result, so the recognition result of the audio data to be recognized can be determined in real time.
  • the difference between the time when the first speech recognition model calculates the first recognition result and the time when the second speech recognition calculates the second recognition result is less than the preset threshold, then it can be considered OK
  • the recognition result of the audio data to be recognized can be determined in real time.
  • the text information corresponding to the audio data to be identified is “I and you are good friends", and "I” corresponds to the sub-audio data of a time, and “and” corresponds to the sub-audio of a time.
  • Data “you” corresponds to the sub audio data of a time, “Yes” corresponds to the sub audio data of a time, “Good” corresponds to the sub audio data of a time, “Peng” corresponds to the sub audio data of a time, and "Friend” corresponds to the sub audio data Sub audio data for a moment.
  • the sub-audio data at the i+1-th moment is input to the i+1-th processing module in the first speech recognition model to obtain the first recognition result, and obtain The second recognition result, the first recognition result is determined based on the sub-audio data from the 1st moment to the i+1th moment in the audio to be recognized, and the second recognition result is the text of the sub-audio data at the i-th moment is determined Determined in the process of the recognition result; the text recognition result of the sub-audio data at the i+1th moment is determined according to the first recognition result and the second recognition result.
  • the second speech recognition model has obtained the recognition results from the n time to the i time, so Only by waiting for the recognition result of the first speech recognition model, the total recognition result can be determined.
  • the calculation dimensions of the processing modules in the first speech recognition model are different, and the calculation dimensions from the i+1th processing module to the nth processing module are smaller than those from the first processing module to the ith processing module.
  • the calculation dimension of the processing module can speed up the calculation of the first speech recognition model and improve the real-time performance.
  • the calculation dimension of the first speech recognition model and the calculation dimension of the second speech recognition model can be understood as the parameter amount of each model, and can also be understood as the size of the calculation matrix that each model participates in the calculation.
  • the calculated dimension refers to the parameter quantity of each model
  • the parameter quantity of the first speech recognition model is greater than the parameter quantity of the second speech recognition model
  • the parameter quantity of the first speech recognition model is 1000
  • the parameter quantity of the second speech recognition model is 1000.
  • the parameter of the model is 500.
  • the calculation dimension of the first speech recognition model is a matrix of 1000*1000
  • the calculation dimension of the second speech recognition model is a matrix of 500*500, so the calculation dimension of the first speech recognition model is greater than that of the first speech recognition model.
  • Step S303 Determine the text recognition result of the sub audio data at the i-th moment according to the first recognition result and the second recognition result.
  • the sub-determinant is determined according to the weight of the first recognition result and the weight of the second recognition result.
  • the recognition result of the audio data can be the same or different, and can be set according to the recognition accuracy requirements or the scene requirements.
  • the voice recognition method is applied to a conference.
  • Scenario in the conference scenario, the speech of the participants needs to be recorded and displayed on the screen.
  • the BRNN model is used for speech recognition.
  • the BRNN model includes two recognition models, namely a first recognition model and a second recognition model.
  • the first recognition model includes N processing modules
  • the second recognition model The model includes N processing modules, and the speech content of the participants is determined by the processing results of each processing module of the first recognition model and the processing results of each processing module of the second recognition model.
  • the first recognition model in BRNN is processed in the order of processing by the first processing module, processing by the second processing module, processing by the third processing module, ..., processing by the Nth processing module
  • the second recognition model in BRNN is processed in the order of processing by the Nth processing module, processing by the N-1 processing module, ..., processing by the first processing module.
  • the calculation dimension of the first recognition model is greater than the calculation dimension of the second recognition model.
  • the speech content of each participant is collected through the microphone of the audio collection device, and then the speech content is input into the BRNN model to obtain the recognition result, and the recognition result is displayed on the display screen.
  • an embodiment of the present invention provides a voice recognition device 500, the device 500 is applied to a voice recognition system, and the voice recognition system includes at least a first voice recognition model and a second voice recognition Model, the first speech recognition model has n processing modules, each module has an input terminal and a corresponding output terminal, the second speech recognition model has n processing modules, each module has an input terminal and a corresponding
  • the output terminal of the device 500 includes:
  • the acquiring unit 501 is configured to acquire audio data to be identified, where the audio data to be identified is composed of sub audio data at n times, where n is greater than or equal to 1;
  • the calculation unit 502 is configured to input the sub-audio data into the i-th processing module in the first speech recognition model and the i-th processing module in the second speech recognition model for the sub-audio data at the i-th moment, A first recognition result and a second recognition result are obtained respectively.
  • the first recognition result is determined according to the sub-audio data from the first moment to the i-th moment in the audio to be recognized, and the second recognition result is based on The sub-audio data from the i-th time to the n-th time in the audio to be recognized is determined, each processing model in the first speech recognition model corresponds to the sub-audio data at a time, and the second speech recognition model Each processing model in the corresponding to the sub-audio data at a time, the calculation time of the first speech recognition model matches the calculation time of the second speech recognition model, and the calculation dimension of the first speech recognition model is greater than the The calculation dimension of the second speech recognition model, i is determined according to the calculation dimension of the first speech recognition model and the calculation dimension of the second speech recognition model, and i belongs to n;
  • the result determining unit 503 is configured to determine the text recognition result of the sub audio data at the i-th moment according to the first recognition result and the second recognition result.
  • calculation unit 502 is specifically configured to:
  • matching the calculation time of the first speech recognition model with the calculation time of the second speech recognition model includes:
  • the difference between the time when the first speech recognition model calculates the first recognition result and the time when the second speech recognition calculates the second recognition result is smaller than a preset threshold.
  • calculation unit 502 is further configured to:
  • the sub-audio data at the i+1-th moment input the sub-audio data into the i+1-th processing module in the first speech recognition model to obtain the first recognition result, and obtain the second recognition result, the The first recognition result is determined based on the sub-audio data from the 1st moment to the i+1th moment in the audio to be recognized, and the second recognition result is the text recognition of the sub-audio data at the i-th moment. Determined in the process of results;
  • the result determining unit is also used for:
  • the text recognition result of the sub audio data at the i+1th time instant is determined according to the first recognition result and the second recognition result.
  • the result determining unit 503 is specifically configured to:
  • the recognition result of the sub audio data is determined according to the weight of the first recognition result and the weight of the second recognition result.
  • an embodiment of the present application provides a computer device. As shown in FIG. 6, it includes at least one processor 601 and a memory 602 connected to the at least one processor.
  • the embodiment of the present application does not limit the processor.
  • the connection between the processor 601 and the memory 602 in FIG. 6 is taken as an example.
  • the bus can be divided into address bus, data bus, control bus and so on.
  • the memory 602 stores instructions that can be executed by at least one processor 601, and the at least one processor 601 can execute the steps included in the aforementioned voice recognition method by executing the instructions stored in the memory 602.
  • the processor 601 is the control center of the computer equipment, which can use various interfaces and lines to connect various parts of the terminal equipment, and obtain customers by running or executing instructions stored in the memory 602 and calling data stored in the memory 602. End address.
  • the processor 601 may include one or more processing units, and the processor 601 may integrate an application processor and a modem processor.
  • the application processor mainly processes the operating system, user interface, and application programs.
  • the adjustment processor mainly deals with wireless communication. It can be understood that the foregoing modem processor may not be integrated into the processor 601.
  • the processor 601 and the memory 602 may be implemented on the same chip, and in some embodiments, they may also be implemented on separate chips.
  • the processor 601 may be a general-purpose processor, such as a central processing unit (CPU), a digital signal processor, an application specific integrated circuit (ASIC), a field programmable gate array or other programmable logic devices, discrete gates or transistors Logic devices and discrete hardware components can implement or execute the methods, steps, and logic block diagrams disclosed in the embodiments of the present application.
  • the general-purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiments of the present application may be directly embodied as being executed and completed by a hardware processor, or executed and completed by a combination of hardware and software modules in the processor.
  • the memory 602 as a non-volatile computer-readable storage medium, can be used to store non-volatile software programs, non-volatile computer-executable programs, and modules.
  • the memory 602 may include at least one type of storage medium, such as flash memory, hard disk, multimedia card, card-type memory, random access memory (Random Access Memory, RAM), static random access memory (Static Random Access Memory, SRAM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), magnetic memory, disk , CD, etc.
  • the memory 602 is any other medium that can be used to carry or store desired program codes in the form of instructions or data structures and that can be accessed by a computer, but is not limited thereto.
  • the memory 602 in the embodiment of the present application may also be a circuit or any other device capable of realizing a storage function for storing program instructions and/or data.
  • the embodiments of the present application provide a computer-readable storage medium that stores a computer program executable by a computer device.
  • the program runs on the computer device, the computer device executes voice recognition.
  • the embodiment of the present invention also provides a computer program product, the computer program product includes a calculation program stored on a non-transitory computer-readable storage medium, the computer program includes program instructions, when the program instructions are executed by a computer To enable the computer to execute any of the methods described in the preceding claims.
  • this application can be provided as methods, systems, or computer program products. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device.
  • the device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment.
  • the instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

A voice recognition method and device, relating to the field of information technology. Said method comprises: acquiring audio data to be recognized, the audio data to be recognized consisting of sub-audio data at n moments, n being greater than or equal to 1 (301); for the sub-audio data at an ith moment, inputting the sub-audio data into an ith processing module in a first voice recognition model and an ith processing module in a second voice recognition model, so as to obtain a first recognition result and a second recognition result respectively, a computing time of the first voice recognition model matching a computing time of the second voice recognition model, and the computing dimension of the first voice recognition model being greater than the computing dimension of the second voice recognition model (302); and determining a text recognition result of the sub-audio data at the ith moment according to the first recognition result and the second recognition result (303). The present invention improves the real-time performance of voice recognition.

Description

一种语音识别方法及装置Method and device for speech recognition
相关申请的交叉引用Cross-references to related applications
本申请要求在2019年09月12日提交中国专利局、申请号为201910865885.4、申请名称为“一种语音识别方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on September 12, 2019, the application number is 201910865885.4, and the application name is "a method and device for speech recognition", the entire content of which is incorporated into this application by reference .
技术领域Technical field
本发明实施例涉及信息技术领域,尤其涉及一种语音识别方法及装置。The embodiments of the present invention relate to the field of information technology, and in particular, to a voice recognition method and device.
背景技术Background technique
随着通信技术的发展,智能终端的普及,各种网络通讯工具成为大众沟通的主要工具之一。其中由于语音信息的操作和传输的便捷性,成为各种网络通讯工具的主要传输信息。而在使用各种网络通讯工具时,还涉及到将语音信息进行文本转换的过程,该过程就是语音识别技术。With the development of communication technology and the popularization of smart terminals, various network communication tools have become one of the main tools for public communication. Among them, due to the convenience of operation and transmission of voice information, it has become the main transmission information of various network communication tools. When using various network communication tools, the process of converting voice information into text is also involved. This process is voice recognition technology.
语音识别技术是使得机器通过识别和理解过程把语音信息转变为相应的文本或命令的技术。在使用深度学习的方法进行语音识别时,需要通过当前时刻的语音信息以及当前时刻的上下文信息来确定语音识别结果,但是由于当前时刻的语音信息的计算时间与上下文信息的计算时间不匹配,所以导致现有技术中语音识别结果输出滞后,不能满足实时性的要求。Voice recognition technology is a technology that enables machines to convert voice information into corresponding text or commands through the process of recognition and understanding. When using the deep learning method for speech recognition, it is necessary to determine the result of the speech recognition by the speech information at the current moment and the context information at the current moment. However, because the calculation time of the speech information at the current moment does not match the calculation time of the context information, so As a result, the output of speech recognition results in the prior art lags behind, which cannot meet the requirements of real-time performance.
发明内容Summary of the invention
本发明实施例提供一种语音识别方法及装置,能够使当前时刻的语音信息的计算时间与上下文信息的计算时间匹配,满足实时性的要求。The embodiment of the present invention provides a voice recognition method and device, which can match the calculation time of the voice information at the current moment with the calculation time of the context information, and meet the real-time requirements.
一方面,本发明实施例提供一种语音识别方法,所述方法应用于语音识别系统,所述语音识别系统至少包括第一语音识别模型以及第二语音识别模型,所述第一语音识别模型具有n个处理模块,每个模块具有一个输入端以 及对应的输出端,所述第二语音识别模型具有n个处理模块,每个模块具有一个输入端以及对应的输出端,所述方法包括:On the one hand, an embodiment of the present invention provides a speech recognition method, the method is applied to a speech recognition system, the speech recognition system includes at least a first speech recognition model and a second speech recognition model, the first speech recognition model has n processing modules, each module has an input terminal and a corresponding output terminal, the second speech recognition model has n processing modules, each module has an input terminal and a corresponding output terminal, the method includes:
获取待识别音频数据,所述待识别音频数据由n个时刻的子音频数据构成,其中n大于等于1;Acquiring audio data to be identified, where the audio data to be identified is composed of sub audio data at n times, where n is greater than or equal to 1;
针对第i个时刻的子音频数据,将所述子音频数据输入至第一语音识别模型中的第i个处理模块以及第二语音识别模型中的第i个处理模块,分别得到第一识别结果以及第二识别结果,所述第一识别结果是根据所述待识别音频中第1个时刻到第i个时刻的子音频数据确定的,所述第二识别结果是根据所述待识别音频中第i个时刻到第n个时刻的子音频数据确定的,所述第一语音识别模型中的每个处理模型对应一个时刻的子音频数据,所述第二语音识别模型中的每个处理模型对应一个时刻的子音频数据,所述第一语音识别模型的计算时间与所述第二语音识别模型的计算时间匹配,所述第一语音识别模型的计算维度大于所述第二语音识别模型的计算维度,i是根据所述第一语音识别模型的计算维度与所述第二语音识别模型的计算维度确定的,i属于n;For the sub audio data at the i-th moment, input the sub-audio data into the i-th processing module in the first speech recognition model and the i-th processing module in the second speech recognition model to obtain the first recognition results, respectively And a second recognition result, where the first recognition result is determined based on the sub-audio data from the first moment to the i-th moment in the to-be-recognized audio, and the second recognition result is based on the sub-audio data in the to-be recognized audio The sub-audio data from the i-th time to the n-th time is determined, each processing model in the first speech recognition model corresponds to the sub-audio data at a time, and each processing model in the second speech recognition model Corresponding to the sub-audio data at a time, the calculation time of the first speech recognition model matches the calculation time of the second speech recognition model, and the calculation dimension of the first speech recognition model is greater than that of the second speech recognition model The calculation dimension, i is determined according to the calculation dimension of the first speech recognition model and the calculation dimension of the second speech recognition model, and i belongs to n;
根据所述第一识别结果以及所述第二识别结果确定所述第i个时刻的子音频数据的文本识别结果。The text recognition result of the sub audio data at the i-th moment is determined according to the first recognition result and the second recognition result.
可选的,针对第i个时刻的子音频数据,将所述子音频数据输入至第一语音识别模型中的第i个处理模块以及第二语音识别模型中的第i个处理模块,分别得到第一识别结果以及第二识别结果,包括:Optionally, for the sub-audio data at the i-th moment, input the sub-audio data into the i-th processing module in the first speech recognition model and the i-th processing module in the second speech recognition model to obtain respectively The first recognition result and the second recognition result include:
将第1时刻的子音频数据输入至所述第一语音识别模型中的第1个处理模块,得到第1时刻的子音频数据的第一识别结果,将所述第1时刻的子音频数据的第一识别结果以及第2时刻的子音频数据作为所述第一语音识别模型中的第2个处理模块的输入数据,得到第2时刻的子音频数据的第一识别结果,将所述第2时刻的子音频数据的第一识别结果以及第3时刻的子音频数据作为所述第一语音识别模型中的第3个处理模块的输入数据,得到第3时刻的子音频数据的第一识别结果,以此类推得到第i时刻的子音频数据的第一识别结果;Input the sub-audio data at the first moment into the first processing module in the first speech recognition model to obtain the first recognition result of the sub-audio data at the first moment. The first recognition result and the sub-audio data at the second moment are used as the input data of the second processing module in the first speech recognition model, and the first recognition result of the sub-audio data at the second moment is obtained, and the second The first recognition result of the sub audio data at time and the sub audio data at time 3 are used as the input data of the third processing module in the first speech recognition model, and the first recognition result of the sub audio data at time 3 is obtained , And so on to obtain the first recognition result of the sub audio data at the i-th moment;
将第n时刻的子音频数据输入至所述第二语音识别模型中的第n个处理模块,得到第n时刻的子音频数据的第二识别结果,将所述第n时刻的子音频数据的第二文本识别结果以及第n-1时刻的子音频数据作为所述第二语音识别模型中的第n-1个处理模块的输入数据,得到第n-1时刻的子音频数据的第二识别结果,将所述第n-1时刻的子音频数据的第二识别结果以及第n-2时刻的子音频数据作为所述第二语音识别模型中的第n-2个处理模块的输入数据,得到第n-2时刻的子音频数据的第二识别结果,以此类推得到第i时刻的子音频数据的第二识别结果。Input the sub-audio data at the nth time into the n-th processing module in the second speech recognition model to obtain the second recognition result of the sub-audio data at the n-th time. The second text recognition result and the sub-audio data at time n-1 are used as the input data of the n-1th processing module in the second speech recognition model to obtain the second recognition of the sub-audio data at time n-1 As a result, the second recognition result of the sub-audio data at time n-1 and the sub-audio data at time n-2 are used as input data of the n-2th processing module in the second speech recognition model, The second recognition result of the sub-audio data at the n-2th time is obtained, and the second recognition result of the sub-audio data at the i-th time is obtained by analogy.
可选的,所述第一语音识别模型的计算时间与所述第二语音识别模型的计算时间匹配,包括:Optionally, matching the calculation time of the first speech recognition model with the calculation time of the second speech recognition model includes:
所述第一语音识别模型计算得到第一识别结果的时间与所述第二语音识别计算得到第二识别结果的时间之间的差值小于预设阈值。The difference between the time when the first speech recognition model calculates the first recognition result and the time when the second speech recognition calculates the second recognition result is smaller than a preset threshold.
可选的,所述方法还包括:Optionally, the method further includes:
在针对第i+1个时刻的子音频数据,将所述子音频数据输入至第一语音识别模型中的第i+1个处理模块得到第一识别结果,并获取第二识别结果,所述第一识别结果是根据所述待识别音频中第1个时刻到第i+1个时刻的子音频数据确定的,所述第二识别结果是在确定第i个时刻的子音频数据的文本识别结果的过程中确定的;For the sub-audio data at the i+1-th moment, input the sub-audio data into the i+1-th processing module in the first speech recognition model to obtain the first recognition result, and obtain the second recognition result, the The first recognition result is determined based on the sub-audio data from the 1st moment to the i+1th moment in the audio to be recognized, and the second recognition result is the text recognition of the sub-audio data at the i-th moment. Determined in the process of results;
根据所述第一识别结果以及所述第二识别结果确定所述第i+1个时刻的子音频数据的文本识别结果。The text recognition result of the sub audio data at the i+1th time instant is determined according to the first recognition result and the second recognition result.
可选的,所述根据所述第一识别结果以及所述第二识别结果确定子音频数据的识别结果,包括:Optionally, the determining the recognition result of the sub-audio data according to the first recognition result and the second recognition result includes:
根据所述第一识别结果的权重以及所述第二识别结果的权重确定子音频数据的识别结果。The recognition result of the sub audio data is determined according to the weight of the first recognition result and the weight of the second recognition result.
一方面,本发明实施例还提供一种语音识别装置,所述装置应用于语音识别系统,所述语音识别系统至少包括第一语音识别模型以及第二语音识别模型,所述第一语音识别模型具有n个处理模块,每个模块具有一个输入端 以及对应的输出端,所述第二语音识别模型具有n个处理模块,每个模块具有一个输入端以及对应的输出端,所述装置包括:On the one hand, the embodiment of the present invention also provides a voice recognition device, the device is applied to a voice recognition system, the voice recognition system at least includes a first voice recognition model and a second voice recognition model, the first voice recognition model There are n processing modules, each module has an input terminal and a corresponding output terminal, the second speech recognition model has n processing modules, each module has an input terminal and a corresponding output terminal, and the device includes:
获取单元,用于获取待识别音频数据,所述待识别音频数据由n个时刻的子音频数据构成,其中n大于等于1;The acquiring unit is configured to acquire audio data to be identified, where the audio data to be identified is composed of sub audio data at n times, where n is greater than or equal to 1;
计算单元,用于针对第i个时刻的子音频数据,将所述子音频数据输入至第一语音识别模型中的第i个处理模块以及第二语音识别模型中的第i个处理模块,分别得到第一识别结果以及第二识别结果,所述第一识别结果是根据所述待识别音频中第1个时刻到第i个时刻的子音频数据确定的,所述第二识别结果是根据所述待识别音频中第i个时刻到第n个时刻的子音频数据确定的,所述第一语音识别模型中的每个处理模型对应一个时刻的子音频数据,所述第二语音识别模型中的每个处理模型对应一个时刻的子音频数据,所述第一语音识别模型的计算时间与所述第二语音识别模型的计算时间匹配,所述第一语音识别模型的计算维度大于所述第二语音识别模型的计算维度,i是根据所述第一语音识别模型的计算维度与所述第二语音识别模型的计算维度确定的,i属于n;The calculation unit is configured to input the sub-audio data into the i-th processing module in the first speech recognition model and the i-th processing module in the second speech recognition model for the sub-audio data at the i-th moment, respectively Obtain a first recognition result and a second recognition result. The first recognition result is determined based on the sub-audio data from the first moment to the i-th moment in the audio to be recognized, and the second recognition result is based on the The sub-audio data from the i-th time to the n-th time in the audio to be recognized is determined, each processing model in the first speech recognition model corresponds to the sub-audio data at a time, and in the second speech recognition model Each processing model corresponds to the sub-audio data at a time, the calculation time of the first speech recognition model matches the calculation time of the second speech recognition model, and the calculation dimension of the first speech recognition model is greater than the first speech recognition model. 2. The calculation dimension of the speech recognition model, i is determined according to the calculation dimension of the first speech recognition model and the calculation dimension of the second speech recognition model, and i belongs to n;
结果确定单元,用于根据所述第一识别结果以及所述第二识别结果确定所述第i个时刻的子音频数据的文本识别结果。The result determining unit is configured to determine the text recognition result of the sub audio data at the i-th moment according to the first recognition result and the second recognition result.
可选的,所述计算单元具体用于:Optionally, the calculation unit is specifically configured to:
将第1时刻的子音频数据输入至所述第一语音识别模型中的第1个处理模块,得到第1时刻的子音频数据的第一识别结果,将所述第1时刻的子音频数据的第一识别结果以及第2时刻的子音频数据作为所述第一语音识别模型中的第2个处理模块的输入数据,得到第2时刻的子音频数据的第一识别结果,将所述第2时刻的子音频数据的第一识别结果以及第3时刻的子音频数据作为所述第一语音识别模型中的第3个处理模块的输入数据,得到第3时刻的子音频数据的第一识别结果,以此类推得到第i时刻的子音频数据的第一识别结果;Input the sub-audio data at the first moment into the first processing module in the first speech recognition model to obtain the first recognition result of the sub-audio data at the first moment. The first recognition result and the sub-audio data at the second moment are used as the input data of the second processing module in the first speech recognition model, and the first recognition result of the sub-audio data at the second moment is obtained, and the second The first recognition result of the sub audio data at time and the sub audio data at time 3 are used as the input data of the third processing module in the first speech recognition model, and the first recognition result of the sub audio data at time 3 is obtained , And so on to obtain the first recognition result of the sub audio data at the i-th moment;
将第n时刻的子音频数据输入至所述第二语音识别模型中的第n个处理 模块,得到第n时刻的子音频数据的第二识别结果,将所述第n时刻的子音频数据的第二文本识别结果以及第n-1时刻的子音频数据作为所述第二语音识别模型中的第n-1个处理模块的输入数据,得到第n-1时刻的子音频数据的第二识别结果,将所述第n-1时刻的子音频数据的第二识别结果以及第n-2时刻的子音频数据作为所述第二语音识别模型中的第n-2个处理模块的输入数据,得到第n-2时刻的子音频数据的第二识别结果,以此类推得到第i时刻的子音频数据的第二识别结果。Input the sub-audio data at the nth time into the n-th processing module in the second speech recognition model to obtain the second recognition result of the sub-audio data at the n-th time. The second text recognition result and the sub-audio data at time n-1 are used as the input data of the n-1th processing module in the second speech recognition model to obtain the second recognition of the sub-audio data at time n-1 As a result, the second recognition result of the sub-audio data at time n-1 and the sub-audio data at time n-2 are used as input data of the n-2th processing module in the second speech recognition model, The second recognition result of the sub-audio data at the n-2th time is obtained, and the second recognition result of the sub-audio data at the i-th time is obtained by analogy.
可选的,所述第一语音识别模型的计算时间与所述第二语音识别模型的计算时间匹配,包括:Optionally, matching the calculation time of the first speech recognition model with the calculation time of the second speech recognition model includes:
所述第一语音识别模型计算得到第一识别结果的时间与所述第二语音识别计算得到第二识别结果的时间之间的差值小于预设阈值。The difference between the time when the first speech recognition model calculates the first recognition result and the time when the second speech recognition calculates the second recognition result is smaller than a preset threshold.
可选的,所述计算单元还用于:Optionally, the calculation unit is further used for:
在针对第i+1个时刻的子音频数据,将所述子音频数据输入至第一语音识别模型中的第i+1个处理模块得到第一识别结果,并获取第二识别结果,所述第一识别结果是根据所述待识别音频中第1个时刻到第i+1个时刻的子音频数据确定的,所述第二识别结果是在确定第i个时刻的子音频数据的文本识别结果的过程中确定的;For the sub-audio data at the i+1-th moment, input the sub-audio data into the i+1-th processing module in the first speech recognition model to obtain the first recognition result, and obtain the second recognition result, the The first recognition result is determined based on the sub-audio data from the 1st moment to the i+1th moment in the audio to be recognized, and the second recognition result is the text recognition of the sub-audio data at the i-th moment. Determined in the process of results;
所述结果确定单元还用于:The result determining unit is also used for:
根据所述第一识别结果以及所述第二识别结果确定所述第i+1个时刻的子音频数据的文本识别结果。The text recognition result of the sub audio data at the i+1th time instant is determined according to the first recognition result and the second recognition result.
可选的,所述结果确定单元具体用于:Optionally, the result determining unit is specifically configured to:
根据所述第一识别结果的权重以及所述第二识别结果的权重确定子音频数据的识别结果。The recognition result of the sub audio data is determined according to the weight of the first recognition result and the weight of the second recognition result.
一方面,本发明实施例还提供一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现上述任一语音识别方法的步骤。On the one hand, an embodiment of the present invention also provides a computer device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor. The processor implements any of the above-mentioned speech recognition when the program is executed. Method steps.
一方面,本发明实施例还提供一种计算机可读存储介质,其存储有可由 计算机设备执行的计算机程序,当所述程序在计算机设备上运行时,使得所述计算机设备执行上述任一语音识别方法的步骤。On the one hand, an embodiment of the present invention also provides a computer-readable storage medium that stores a computer program executable by a computer device. When the program runs on the computer device, the computer device executes any of the above-mentioned speech recognitions. Method steps.
本发明实施例提供一种计算机程序产品,所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,使所述计算机执行上述任一语音识别方法。An embodiment of the present invention provides a computer program product, the computer program product includes a calculation program stored on a non-transitory computer-readable storage medium, the computer program includes program instructions, and when the program instructions are executed by a computer, The computer is allowed to execute any of the above-mentioned speech recognition methods.
本发明实施例中,第一语音识别模型的第一识别结果是根据第1个时刻到第i个时刻的子音频数据确定的,可以认为第一语音识别模型处理的是当前时刻的输出结果,第二识别结果是根据待识别音频中第i个时刻到第n个时刻的子音频数据确定的,可以认为第二语音识别模型处理的是上下文信息,由于在本发明实施例中,第一语音识别模型的计算维度大于所述第二语音识别模型的计算维度,所以第一语音识别模型在计算到第i个时刻的子音频数据时,第二语音识别模型也已经计算到第i个时刻的子音频数据,这样,第一识别结果以及第二识别结果的计算时间是匹配的,不需要在计算完一个计算结果后等待另一个计算结果,提高了语音识别的实时性。In the embodiment of the present invention, the first recognition result of the first speech recognition model is determined based on the sub-audio data from the first time to the i-th time. It can be considered that the first speech recognition model processes the output result at the current time. The second recognition result is determined based on the sub-audio data from the i-th time to the n-th time in the audio to be recognized. It can be considered that the second speech recognition model processes context information, because in the embodiment of the present invention, the first speech The calculation dimension of the recognition model is greater than that of the second speech recognition model. Therefore, when the first speech recognition model calculates the sub-audio data at the i-th moment, the second speech recognition model has also calculated the sub-audio data at the i-th moment. For sub-audio data, in this way, the calculation time of the first recognition result and the second recognition result are matched, and there is no need to wait for another calculation result after calculating one calculation result, which improves the real-time performance of speech recognition.
附图说明Description of the drawings
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简要介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域的普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions in the embodiments of the present invention more clearly, the following will briefly introduce the drawings needed in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, without creative labor, other drawings can be obtained based on these drawings.
图1为本发明实施例提供的一种应用场景架构图;Figure 1 is an application scenario architecture diagram provided by an embodiment of the present invention;
图2为本发明实施例提供的一种语音识别系统架构图;FIG. 2 is an architecture diagram of a speech recognition system provided by an embodiment of the present invention;
图3为本发明实施例提供的一种语音识别方法的流程示意图;FIG. 3 is a schematic flowchart of a voice recognition method provided by an embodiment of the present invention;
图4为本发明实施例提供的一种语音识别方法应用的场景示意图;FIG. 4 is a schematic diagram of an application scenario of a voice recognition method provided by an embodiment of the present invention;
图5为本发明实施例提供的一种语音识别装置的结构示意图;5 is a schematic structural diagram of a speech recognition device provided by an embodiment of the present invention;
图6为本发明实施例提供的一种计算机设备的结构示意图。Fig. 6 is a schematic structural diagram of a computer device provided by an embodiment of the present invention.
具体实施方式detailed description
为了使本发明的目的、技术方案及有益效果更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。In order to make the objectives, technical solutions, and beneficial effects of the present invention clearer, the following further describes the present invention in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, but not to limit the present invention.
为便于对本发明实施例的理解,下面先对几个概念进行简单介绍:In order to facilitate the understanding of the embodiments of the present invention, a few concepts are briefly introduced below:
语音识别技术,让机器通过识别和理解过程把语音信号转变为相应的文本或命令的技术,通过语音信号处理和模式识别让机器自动识别和理解人类口述的语言。语音识别技术是一门涉及面很广的交叉学科,它与声学、语音学、语言学、信息理论、模式识别理论以及神经生物学等学科都有非常密切的关系。语音识别技术中通常使用模板匹配法、随机模型法和概率语法分析法三种方法,也通常使用深度学习方法以及机器学习方法。Speech recognition technology allows machines to convert speech signals into corresponding text or commands through the process of recognition and understanding. Through speech signal processing and pattern recognition, machines can automatically recognize and understand human spoken language. Speech recognition technology is an interdisciplinary subject that involves a wide range of subjects. It is closely related to subjects such as acoustics, phonetics, linguistics, information theory, pattern recognition theory, and neurobiology. Speech recognition technology usually uses three methods: template matching method, random model method and probabilistic syntax analysis method. Deep learning methods and machine learning methods are also commonly used.
机器学习,是一门多领域交叉学科,涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。专门研究计算机怎样模拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识结构使之不断改善自身的性能。示例性的,可以使用HMM(Hidden Markov Model,隐马尔可夫模型)来进行语音识别。Machine learning is a multi-field interdisciplinary subject, involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other subjects. Specializing in the study of how computers simulate or realize human learning behaviors in order to acquire new knowledge or skills, and reorganize the existing knowledge structure to continuously improve its own performance. Exemplarily, HMM (Hidden Markov Model, Hidden Markov Model) may be used for speech recognition.
深度学习,是学习样本数据的内在规律和表示层次,这些学习过程中获得的信息对诸如文字,图像和声音等数据的解释有很大的帮助。它的最终目标是让机器能够像人一样具有分析学习能力,能够识别文字、图像和声音等数据。深度学习是一个复杂的机器学习算法,在语音和图像识别方面取得的效果,远远超过先前相关技术。Deep learning is to learn the internal laws and representation levels of sample data. The information obtained in the learning process is of great help to the interpretation of data such as text, images and sounds. Its ultimate goal is to enable machines to have the ability to analyze and learn like humans, and to recognize data such as text, images, and sounds. Deep learning is a complex machine learning algorithm that has achieved results in speech and image recognition far beyond previous related technologies.
深度学习在搜索技术,数据挖掘,机器学习,机器翻译,自然语言处理,多媒体学习,语音,推荐和个性化技术,以及其他相关领域都取得了很多成果。深度学习使机器模仿视听和思考等人类的活动,解决了很多复杂的模式识别难题,使得人工智能相关技术取得了很大进步。示例性的,可以使用深度学习方法中的神经网络模型来进行语音识别。Deep learning has achieved many results in search technology, data mining, machine learning, machine translation, natural language processing, multimedia learning, speech, recommendation and personalization technology, and other related fields. Deep learning enables machines to imitate human activities such as audiovisual and thinking, and solves many complex pattern recognition problems, which has made great progress in artificial intelligence-related technologies. Exemplarily, a neural network model in a deep learning method can be used for speech recognition.
BRNN,双向循环神经网络模型,一种深度学习方法,该方法提出每一个 训练序列向前和向后分别是两个循环神经网络(RNN),而且这两个都连接着一个输出层。这个结构提供给输出层输入序列中每一个点的完整的过去和未来的上下文信息。BRNN, bidirectional recurrent neural network model, a deep learning method, the method proposes that each training sequence is forward and backward respectively as two recurrent neural networks (RNN), and these two are connected to an output layer. This structure provides complete past and future contextual information for each point in the input sequence of the output layer.
在具体实践过程中,本申请的申请人发现,在进行语音识别的过程中,通常会存在上下文信息,但是上下文信息处理的过程与实时处理的过程处理的数据是不同的,以待识别的语音数据包括n个时刻的子音频数据为例进行说明,针对第i个时刻的子音频数据,将子音频数据输入至第一语音识别模型中的第i个处理模块以及第二语音识别模型中的第i个处理模块,分别得到第一识别结果以及第二识别结果,第一识别结果是根据待识别音频中第1个时刻到第i个时刻的子音频数据确定的,第二识别结果是根据待识别音频中第i个时刻到第n个时刻的子音频数据确定的,可以认为,当i为n中的前几个时刻时,则第一语音识别模型的计算时间短,而第二语音识别模型的计算时间长,会导致第一语音识别模型已经确定了结果,但是第二语音识别模型还没有确定结果,不能满足实时性的要求;同样的,当i为n中后几个时刻时,第二语音识别模型的计算时间短,而第一语音识别模型的计算时间长,会导致第二语音识别模型已经确定了结果,但是第一语音识别模型还没有确定结果,不能满足实时性的要求。In the specific practice process, the applicant of this application found that in the process of speech recognition, there is usually context information, but the process of context information processing is different from the data processed in the process of real-time processing. The data includes sub-audio data at n times as an example. For the sub-audio data at the i-th time, the sub-audio data is input to the i-th processing module in the first speech recognition model and the second speech recognition model. The i-th processing module obtains the first recognition result and the second recognition result respectively. The first recognition result is determined based on the sub-audio data from the first moment to the i-th moment in the audio to be recognized, and the second recognition result is based on The sub-audio data from the i-th moment to the n-th moment in the audio to be recognized is determined. It can be considered that when i is the first few moments in n, the calculation time of the first speech recognition model is short, and the second speech The long calculation time of the recognition model will cause the first speech recognition model to have determined the result, but the second speech recognition model has not determined the result yet, which cannot meet the real-time requirements; similarly, when i is the last few moments of n , The calculation time of the second speech recognition model is short, and the calculation time of the first speech recognition model is long, which will cause the second speech recognition model to have determined the result, but the first speech recognition model has not determined the result yet, which cannot meet the real-time requirements. Claim.
基于上述现有技术的缺点,本申请的申请人构思了一种语音识别方法,该语音识别方法中第一语音识别模型的计算维度大于所述第二语音识别模型的计算维度,所以第一语音识别模型的计算时间与第二语音识别模型的计算时间匹配,能够有效提高语音识别的实时性。Based on the above-mentioned shortcomings of the prior art, the applicant of this application conceived a voice recognition method in which the calculation dimension of the first speech recognition model is greater than the calculation dimension of the second speech recognition model, so the first speech The calculation time of the recognition model matches the calculation time of the second speech recognition model, which can effectively improve the real-time performance of speech recognition.
本申请实施例中的语音识别方法可以应用于如图1所示的应用场景,该应用场景包括终端设备101和语音服务器102。其中,终端设备101和语音服务器102之间通过无线或有线网络连接,终端设备101包括但不限于智能音箱、智能手表、智能家居等智能设备,智能机器人、AI客服、银行信用卡催单电话系统,以及具有语音交互功能智能电话、移动电脑、平板电脑等电子设备。语音服务器102可提供相关的语音服务器,如语音识别、语音合成等 服务,语音服务器102可以是一台服务器、若干台服务器组成的服务器集群或云计算中心。The voice recognition method in the embodiment of the present application can be applied to the application scenario shown in FIG. 1, and the application scenario includes the terminal device 101 and the voice server 102. Among them, the terminal device 101 and the voice server 102 are connected through a wireless or wired network. The terminal device 101 includes but is not limited to smart speakers, smart watches, smart homes and other smart devices, smart robots, AI customer service, bank credit card reminder phone systems, And electronic devices such as smart phones, mobile computers, and tablet computers with voice interaction functions. The voice server 102 may provide related voice servers, such as voice recognition, voice synthesis and other services. The voice server 102 may be a server, a server cluster composed of several servers, or a cloud computing center.
在一种可能的应用场景下,用户10与终端设备101进行交互,终端设备101将用户10输入的语音数据发送给语音服务器102。语音服务器102对终端设备101发送的语音数据进行语音识别处理和语义解析处理,根据语义解析结果确定出相应的语音识别文本,将语音识别文本发送给终端设备101,终端设备101进行显示或者执行语音识别文本对应的指令。In a possible application scenario, the user 10 interacts with the terminal device 101, and the terminal device 101 sends the voice data input by the user 10 to the voice server 102. The voice server 102 performs voice recognition processing and semantic analysis processing on the voice data sent by the terminal device 101, determines the corresponding voice recognition text according to the semantic analysis result, and sends the voice recognition text to the terminal device 101, and the terminal device 101 displays or executes the voice Recognize the instructions corresponding to the text.
值得说明的是,本申请实施例中的架构图是为了更加清楚地说明本发明实施例中的技术方案,并不构成对本申请实施例提供的技术方案的限制,对于其它的应用场景架构和业务应用,本申请实施例提供的技术方案对于类似的问题,同样适用。It is worth noting that the architecture diagrams in the embodiments of the present application are intended to more clearly illustrate the technical solutions in the embodiments of the present invention, and do not constitute a limitation to the technical solutions provided in the embodiments of the present application. For other application scenarios, architecture and services In application, the technical solutions provided in the embodiments of the present application are equally applicable to similar problems.
基于图1所示的应用场景图,本申请实施例提供了一种语音识别方法,该方法的流程可以由语音识别装置执行,所述方法应用于语音识别系统,所述语音识别系统至少包括第一语音识别模型以及第二语音识别模型,所述第一语音识别模型具有n个处理模块,每个模块具有一个输入端以及对应的输出端,所述第二语音识别模型具有n个处理模块,每个模块具有一个输入端以及对应的输出端,为了解释本发明实施例中的语音识别方法,首先示例性的介绍一种语音识别系统,如图2所示,语音识别系统中包括第一语音识别模型以及第二语音识别模型,第一语音识别模型以及第二语音识别模型各具有n个处理模块,且每个处理模型都具有输入端以及输出端,针对一个子音频数据,将子音频数据输入到对应的第一语音识别模型的处理模块中,以及将子音频数据输入到对应的第二语音识别模型的处理模块中进行处理。Based on the application scenario diagram shown in FIG. 1, an embodiment of the present application provides a voice recognition method. The process of the method can be executed by a voice recognition device. The method is applied to a voice recognition system, and the voice recognition system at least includes a first A voice recognition model and a second voice recognition model, the first voice recognition model has n processing modules, each module has an input terminal and a corresponding output terminal, the second voice recognition model has n processing modules, Each module has an input terminal and a corresponding output terminal. In order to explain the voice recognition method in the embodiment of the present invention, a voice recognition system is first introduced by way of example. As shown in FIG. 2, the voice recognition system includes a first voice The recognition model and the second speech recognition model. The first speech recognition model and the second speech recognition model each have n processing modules, and each processing model has an input terminal and an output terminal. For one sub-audio data, the sub-audio data is Input into the corresponding processing module of the first speech recognition model, and input the sub-audio data into the corresponding processing module of the second speech recognition model for processing.
本发明实施例中的语音识别方法,具体如图3所示,包括:The voice recognition method in the embodiment of the present invention, as shown in FIG. 3, includes:
步骤S301,获取待识别音频数据,所述待识别音频数据由n个时刻的子音频数据构成,其中n大于等于1。Step S301: Acquire audio data to be identified, where the audio data to be identified is composed of sub audio data at n times, where n is greater than or equal to 1.
具体的,在本发明实施例中,待识别音频数据是由n个时刻的子音频数据构成,例如,待识别音频数据是一段20秒的音频数据,可以将20秒的音 频数据划分为20个时刻,即每1秒时刻的音频数据作为一个子音频数据,且各个子音频数据具有先后时间顺序,所以待识别音频数据对应的是20个有先后顺序的子音频数据构成的。Specifically, in the embodiment of the present invention, the audio data to be identified is composed of sub audio data at n times. For example, the audio data to be identified is a 20-second piece of audio data, and the 20-second audio data can be divided into 20 pieces of audio data. Time, that is, the audio data at every 1 second time is regarded as one sub-audio data, and each sub-audio data has a time sequence, so the audio data to be identified corresponds to 20 sub-audio data in a sequence.
步骤S302,针对第i个时刻的子音频数据,将所述子音频数据输入至第一语音识别模型中的第i个处理模块以及第二语音识别模型中的第i个处理模块,分别得到第一识别结果以及第二识别结果,所述第一识别结果是根据所述待识别音频中第1个时刻到第i个时刻的子音频数据确定的,所述第二识别结果是根据所述待识别音频中第i个时刻到第n个时刻的子音频数据确定的,所述第一语音识别模型中的每个处理模型对应一个时刻的子音频数据,所述第二语音识别模型中的每个处理模型对应一个时刻的子音频数据,所述第一语音识别模型的计算时间与所述第二语音识别模型的计算时间匹配,所述第一语音识别模型的计算维度大于所述第二语音识别模型的计算维度,i是根据所述第一语音识别模型的计算维度与所述第二语音识别模型的计算维度确定的,i属于n。Step S302: For the sub-audio data at the i-th moment, input the sub-audio data into the i-th processing module in the first speech recognition model and the i-th processing module in the second speech recognition model to obtain the sub-audio data respectively. A recognition result and a second recognition result, where the first recognition result is determined based on the sub audio data from the first moment to the i-th moment in the audio to be recognized, and the second recognition result is based on the to-be-recognized audio Recognizing the sub-audio data from the i-th time to the n-th time in the audio, each processing model in the first speech recognition model corresponds to the sub-audio data at a time, and each processing model in the second speech recognition model Each processing model corresponds to the sub-audio data at a time, the calculation time of the first speech recognition model matches the calculation time of the second speech recognition model, and the calculation dimension of the first speech recognition model is larger than the second speech The calculation dimension of the recognition model, i is determined according to the calculation dimension of the first speech recognition model and the calculation dimension of the second speech recognition model, and i belongs to n.
具体的,在本发明实施例中,在进行待识别音频数据的识别时,将每个时刻的子音频数据输入至对应的第一语音识别模型的处理模块以及第二语音识别模型的处理模块中,分别得到对应的结果。Specifically, in the embodiment of the present invention, when the audio data to be recognized is recognized, the sub-audio data at each moment is input into the corresponding processing module of the first voice recognition model and the processing module of the second voice recognition model , Get the corresponding results respectively.
第一语音识别模型的处理方向与第二语音识别模型的处理方向相反,以i为第2个时刻为例进行说明,第一语音识别模型以及第二语音识别模型各具有10个处理模块,待识别音频数据为10s的语音数据,所以将每一秒的语音数据输入至对应的第一语音识别模型的处理。The processing direction of the first speech recognition model is opposite to that of the second speech recognition model. Taking i as the second moment as an example, the first speech recognition model and the second speech recognition model each have 10 processing modules. The recognition audio data is 10s voice data, so every second of voice data is input to the processing of the corresponding first voice recognition model.
针对第一语音识别模型的第二个处理模块以及第二语音识别模型的第二个处理模块,两个处理模块的输入数据都是第2s的语音数据,第一语音识别模型的第二个处理模块根据第1s的语音数据的处理结果以及第2s的语音数据来确定第2s语音数据的处理结果,第二语音识别模型的第二个处理模块是根据第三个处理模块针对第3s的语音数据的处理结果以及第2s的语音数据来确定第2s语音数据的处理结果,而第二语音识别模型的第三个处理模块是获取 到第3s语音数据以及第四个处理模块的针对第4s语音数据的处理结果确定的,依次类推,第二语音识别模型的第二个处理模块的处理结果是根据第二语音识别模型的第十个处理模块到第三个处理模块的各处理结果以及第2s的语音数据来确定的。For the second processing module of the first speech recognition model and the second processing module of the second speech recognition model, the input data of the two processing modules are the 2s speech data, and the second processing of the first speech recognition model The module determines the processing result of the 2s voice data according to the processing results of the 1s voice data and the 2s voice data. The second processing module of the second voice recognition model is based on the third processing module for the 3s voice data The processing result of the 2s voice data and the 2s voice data are used to determine the 2s voice data processing result, and the third processing module of the second voice recognition model is to obtain the 3s voice data and the fourth processing module for the 4s voice data The processing result of the second speech recognition model is determined, and so on, the processing result of the second processing module of the second speech recognition model is based on the processing results of the tenth processing module to the third processing module of the second speech recognition model and the 2s Voice data to determine.
在本发明实施例中,为了能够使得第一语音识别模型的输出结果和第二语音识别模型的输出结果能够同时确定,以使能够实时确定总的输出结果,在本发明实施例中,第一语音识别模型的计算维度大于第二语音识别模型的计算维度,也就是说,第一语音识别模型的计算时间较长,第二语音识别模型的计算时间短,在第i时刻时,第一语音识别模型的第i个处理模块计算出输出结果时,第二语音识别模型的第n个到第i个处理模块也计算出了输出结果,所以可以实现实时确定待识别音频数据的识别结果。In the embodiment of the present invention, in order to enable the output result of the first speech recognition model and the output result of the second speech recognition model to be determined at the same time, so that the total output result can be determined in real time, in the embodiment of the present invention, the first The calculation dimension of the speech recognition model is greater than the calculation dimension of the second speech recognition model, that is, the calculation time of the first speech recognition model is longer, and the calculation time of the second speech recognition model is shorter. At the i-th moment, the first speech When the i-th processing module of the recognition model calculates the output result, the n-th to the i-th processing module of the second speech recognition model also calculates the output result, so the recognition result of the audio data to be recognized can be determined in real time.
一种可选的实施例中,可以认为第一语音识别模型计算得到第一识别结果的时间与第二语音识别计算得到第二识别结果的时间之间的差值小于预设阈值,则认为可以实现实时确定待识别音频数据的识别结果。In an alternative embodiment, it can be considered that the difference between the time when the first speech recognition model calculates the first recognition result and the time when the second speech recognition calculates the second recognition result is less than the preset threshold, then it can be considered OK The recognition result of the audio data to be recognized can be determined in real time.
也就是说,在本发明实施例中,第一语音识别模型计算得到第一识别结果的时间与第二语音识别计算得到第二识别结果的时间之间可以存在较小的时间差,不会影响识别结果的实时性。That is to say, in the embodiment of the present invention, there may be a small time difference between the time when the first speech recognition model calculates the first recognition result and the time when the second speech recognition calculates the second recognition result, which will not affect the recognition. Real-time results.
一种可选的实施例中,为了尽快输出识别结果,所以i时刻越早越好,例如,i为第一时刻或者i为第二时刻,这样在待识别音频数据输入后,就可以快速的输出部分待识别音频数据的识别结果。In an optional embodiment, in order to output the recognition result as soon as possible, the earlier the time i is, the better, for example, i is the first time or i is the second time, so that after the audio data to be recognized is input, it can be quickly Output part of the recognition result of the audio data to be recognized.
示例性的,在本发明实施例中,待识别音频数据对应的文本信息为“我和你是好朋友”,且“我”对应一个时刻的子音频数据、“和”对应一个时刻的子音频数据、“你”对应一个时刻的子音频数据、“是”对应一个时刻的子音频数据、“好”对应一个时刻的子音频数据、“朋”对应一个时刻的子音频数据以及“友”对应一个时刻的子音频数据。Exemplarily, in the embodiment of the present invention, the text information corresponding to the audio data to be identified is "I and you are good friends", and "I" corresponds to the sub-audio data of a time, and "and" corresponds to the sub-audio of a time. Data, "you" corresponds to the sub audio data of a time, "Yes" corresponds to the sub audio data of a time, "Good" corresponds to the sub audio data of a time, "Peng" corresponds to the sub audio data of a time, and "Friend" corresponds to the sub audio data Sub audio data for a moment.
将各个时刻的子音频数据分别输入至第一语音识别模型的各个处理模块以及第二语音识别模型的各个处理模块中,当第一语音识别模型的第1个处 理模块已经解析了“我”时,第二语音识别模型的其它处理模块已经处理了“友”、“朋”、“好”、“是”、“你”、“和”、“我”,所以可以直接显示识别结果“我”,然后在第一语音识别模型的第2个处理模块解析出“和”后,也可以快速的显示识别结果“和”,从而能够实现实时显示识别结果。Input the sub-audio data at each moment into each processing module of the first speech recognition model and each processing module of the second speech recognition model. When the first processing module of the first speech recognition model has already parsed "I" , The other processing modules of the second speech recognition model have already processed "friend", "friend", "good", "yes", "you", "he", and "I", so the recognition result "I" can be displayed directly , And then after the second processing module of the first speech recognition model parses out the "and", the recognition result "and" can also be displayed quickly, so that the recognition result can be displayed in real time.
一种可选的实施例中,在针对第i+1个时刻的子音频数据,将子音频数据输入至第一语音识别模型中的第i+1个处理模块得到第一识别结果,并获取第二识别结果,第一识别结果是根据待识别音频中第1个时刻到第i+1个时刻的子音频数据确定的,第二识别结果是在确定第i个时刻的子音频数据的文本识别结果的过程中确定的;根据第一识别结果以及第二识别结果确定第i+1个时刻的子音频数据的文本识别结果。In an optional embodiment, for the sub-audio data at the i+1-th moment, the sub-audio data is input to the i+1-th processing module in the first speech recognition model to obtain the first recognition result, and obtain The second recognition result, the first recognition result is determined based on the sub-audio data from the 1st moment to the i+1th moment in the audio to be recognized, and the second recognition result is the text of the sub-audio data at the i-th moment is determined Determined in the process of the recognition result; the text recognition result of the sub-audio data at the i+1th moment is determined according to the first recognition result and the second recognition result.
也就是说,第一语音识别模型在i个时刻与第二语音识别模型在第i个时刻匹配后,第二语音识别模型已经得到了第n个时刻到第i个时刻的各个识别结果,所以只需要等待第一语音识别模型的识别结果,就可以确定总的识别结果。That is to say, after the first speech recognition model matches the second speech recognition model at the i time at the i time, the second speech recognition model has obtained the recognition results from the n time to the i time, so Only by waiting for the recognition result of the first speech recognition model, the total recognition result can be determined.
一种可选的实施例中,第一语音识别模型中的各个处理模块的计算维度不相同,第i+1个处理模块到第n个处理模块的计算维度小于第1个处理模块至第i个处理模块的计算维度,这样可以加快第一语音识别模型的计算,提高实时性。In an optional embodiment, the calculation dimensions of the processing modules in the first speech recognition model are different, and the calculation dimensions from the i+1th processing module to the nth processing module are smaller than those from the first processing module to the ith processing module. The calculation dimension of the processing module can speed up the calculation of the first speech recognition model and improve the real-time performance.
在本发明实施例中,第一语音识别模型的计算维度与第二语音识别模型的计算维度可以理解为各个模型的参数量,也可以理解为各个模型参与计算的计算矩阵的大小。示例性的,若计算维度指的是各个模型的参数量,则第一语音识别模型的参数量大于第二语音识别模型的参数量,第一语音识别模型的参数量为1000,第二语音识别模型的参数量为500。In the embodiment of the present invention, the calculation dimension of the first speech recognition model and the calculation dimension of the second speech recognition model can be understood as the parameter amount of each model, and can also be understood as the size of the calculation matrix that each model participates in the calculation. Exemplarily, if the calculated dimension refers to the parameter quantity of each model, the parameter quantity of the first speech recognition model is greater than the parameter quantity of the second speech recognition model, the parameter quantity of the first speech recognition model is 1000, and the parameter quantity of the second speech recognition model is 1000. The parameter of the model is 500.
另一种可选的实施例,第一语音识别模型的计算维度为1000*1000的矩阵,第二语音识别模型的计算维度为500*500的矩阵,所以第一语音识别模型的计算维度大于第二语音识别模型的计算维度。In another optional embodiment, the calculation dimension of the first speech recognition model is a matrix of 1000*1000, and the calculation dimension of the second speech recognition model is a matrix of 500*500, so the calculation dimension of the first speech recognition model is greater than that of the first speech recognition model. 2. The computational dimensions of the speech recognition model.
步骤S303,根据所述第一识别结果以及所述第二识别结果确定所述第i 个时刻的子音频数据的文本识别结果。Step S303: Determine the text recognition result of the sub audio data at the i-th moment according to the first recognition result and the second recognition result.
在本发明实施例中,当第一语音识别模型确定了第一识别结果,以及第二语音识别模型确定了第二识别结果后,根据第一识别结果的权重以及第二识别结果的权重确定子音频数据的识别结果。权重可以相同,也可以不同,可以按照识别的精度要求或者场景要求来设定。In the embodiment of the present invention, when the first speech recognition model determines the first recognition result and the second speech recognition model determines the second recognition result, the sub-determinant is determined according to the weight of the first recognition result and the weight of the second recognition result. The recognition result of the audio data. The weights can be the same or different, and can be set according to the recognition accuracy requirements or the scene requirements.
为了更好的解释本申请实施例,下面结合一种具体的实施场景描述本申请实施例提供的一种语音识别方法,如图4所示,在本发明实施例中,语音识别方法应用于会议场景,在会议场景中,需要将与会人员的发言进行记录,并显示在屏幕中。In order to better explain the embodiments of the present application, the following describes a voice recognition method provided by the embodiments of the present application in conjunction with a specific implementation scenario. As shown in FIG. 4, in the embodiment of the present invention, the voice recognition method is applied to a conference. Scenario, in the conference scenario, the speech of the participants needs to be recorded and displayed on the screen.
在发明实施例中,使用BRNN模型来进行语音识别,在BRNN模型中包括两个识别模型,分别为第一识别模型以及第二识别模型,第一识别模型中包括N个处理模块,第二识别模型中包括N个处理模块,通过第一识别模型的各个处理模块的处理结果以及第二识别模型的各个处理模块的处理结果来确定与会人员的发言内容。在本发明实施例中,BRNN中的第一识别模型是按照第1个处理模块处理、第2个处理模块处理、第3个处理模块处理、……、第N个处理模块处理的顺序进行处理,BRNN中的第二识别模型是按照第N个处理模块处理、第N-1个处理模块处理、……、第1个处理模块处理的顺序进行处理。第一识别模型的计算维度大于第二识别模型的计算维度。In the embodiment of the invention, the BRNN model is used for speech recognition. The BRNN model includes two recognition models, namely a first recognition model and a second recognition model. The first recognition model includes N processing modules, and the second recognition model The model includes N processing modules, and the speech content of the participants is determined by the processing results of each processing module of the first recognition model and the processing results of each processing module of the second recognition model. In the embodiment of the present invention, the first recognition model in BRNN is processed in the order of processing by the first processing module, processing by the second processing module, processing by the third processing module, ..., processing by the Nth processing module , The second recognition model in BRNN is processed in the order of processing by the Nth processing module, processing by the N-1 processing module, ..., processing by the first processing module. The calculation dimension of the first recognition model is greater than the calculation dimension of the second recognition model.
在本发明实施例中,通过音频采集设备话筒采集各个与会人员的发言内容,然后将发言内容输入到BRNN模型中,得到识别结果,并将识别结果显示在显示屏幕中。In the embodiment of the present invention, the speech content of each participant is collected through the microphone of the audio collection device, and then the speech content is input into the BRNN model to obtain the recognition result, and the recognition result is displayed on the display screen.
基于上述实施例,参阅图5所示,本发明实施例提供一种语音识别装置500,所述装置500应用于语音识别系统,所述语音识别系统至少包括第一语音识别模型以及第二语音识别模型,所述第一语音识别模型具有n个处理模块,每个模块具有一个输入端以及对应的输出端,所述第二语音识别模型具有n个处理模块,每个模块具有一个输入端以及对应的输出端,所述装置500包括:Based on the foregoing embodiment and referring to FIG. 5, an embodiment of the present invention provides a voice recognition device 500, the device 500 is applied to a voice recognition system, and the voice recognition system includes at least a first voice recognition model and a second voice recognition Model, the first speech recognition model has n processing modules, each module has an input terminal and a corresponding output terminal, the second speech recognition model has n processing modules, each module has an input terminal and a corresponding The output terminal of the device 500 includes:
获取单元501,用于获取待识别音频数据,所述待识别音频数据由n个时刻的子音频数据构成,其中n大于等于1;The acquiring unit 501 is configured to acquire audio data to be identified, where the audio data to be identified is composed of sub audio data at n times, where n is greater than or equal to 1;
计算单元502,用于针对第i个时刻的子音频数据,将所述子音频数据输入至第一语音识别模型中的第i个处理模块以及第二语音识别模型中的第i个处理模块,分别得到第一识别结果以及第二识别结果,所述第一识别结果是根据所述待识别音频中第1个时刻到第i个时刻的子音频数据确定的,所述第二识别结果是根据所述待识别音频中第i个时刻到第n个时刻的子音频数据确定的,所述第一语音识别模型中的每个处理模型对应一个时刻的子音频数据,所述第二语音识别模型中的每个处理模型对应一个时刻的子音频数据,所述第一语音识别模型的计算时间与所述第二语音识别模型的计算时间匹配,所述第一语音识别模型的计算维度大于所述第二语音识别模型的计算维度,i是根据所述第一语音识别模型的计算维度与所述第二语音识别模型的计算维度确定的,i属于n;The calculation unit 502 is configured to input the sub-audio data into the i-th processing module in the first speech recognition model and the i-th processing module in the second speech recognition model for the sub-audio data at the i-th moment, A first recognition result and a second recognition result are obtained respectively. The first recognition result is determined according to the sub-audio data from the first moment to the i-th moment in the audio to be recognized, and the second recognition result is based on The sub-audio data from the i-th time to the n-th time in the audio to be recognized is determined, each processing model in the first speech recognition model corresponds to the sub-audio data at a time, and the second speech recognition model Each processing model in the corresponding to the sub-audio data at a time, the calculation time of the first speech recognition model matches the calculation time of the second speech recognition model, and the calculation dimension of the first speech recognition model is greater than the The calculation dimension of the second speech recognition model, i is determined according to the calculation dimension of the first speech recognition model and the calculation dimension of the second speech recognition model, and i belongs to n;
结果确定单元503,用于根据所述第一识别结果以及所述第二识别结果确定所述第i个时刻的子音频数据的文本识别结果。The result determining unit 503 is configured to determine the text recognition result of the sub audio data at the i-th moment according to the first recognition result and the second recognition result.
可选的,所述计算单元502具体用于:Optionally, the calculation unit 502 is specifically configured to:
将第1时刻的子音频数据输入至所述第一语音识别模型中的第1个处理模块,得到第1时刻的子音频数据的第一识别结果,将所述第1时刻的子音频数据的第一识别结果以及第2时刻的子音频数据作为所述第一语音识别模型中的第2个处理模块的输入数据,得到第2时刻的子音频数据的第一识别结果,将所述第2时刻的子音频数据的第一识别结果以及第3时刻的子音频数据作为所述第一语音识别模型中的第3个处理模块的输入数据,得到第3时刻的子音频数据的第一识别结果,以此类推得到第i时刻的子音频数据的第一识别结果;Input the sub-audio data at the first moment into the first processing module in the first speech recognition model to obtain the first recognition result of the sub-audio data at the first moment. The first recognition result and the sub-audio data at the second moment are used as the input data of the second processing module in the first speech recognition model, and the first recognition result of the sub-audio data at the second moment is obtained, and the second The first recognition result of the sub audio data at time and the sub audio data at time 3 are used as the input data of the third processing module in the first speech recognition model, and the first recognition result of the sub audio data at time 3 is obtained , And so on to obtain the first recognition result of the sub audio data at the i-th moment;
将第n时刻的子音频数据输入至所述第二语音识别模型中的第n个处理模块,得到第n时刻的子音频数据的第二识别结果,将所述第n时刻的子音频数据的第二文本识别结果以及第n-1时刻的子音频数据作为所述第二语音 识别模型中的第n-1个处理模块的输入数据,得到第n-1时刻的子音频数据的第二识别结果,将所述第n-1时刻的子音频数据的第二识别结果以及第n-2时刻的子音频数据作为所述第二语音识别模型中的第n-2个处理模块的输入数据,得到第n-2时刻的子音频数据的第二识别结果,以此类推得到第i时刻的子音频数据的第二识别结果。Input the sub-audio data at the nth time into the n-th processing module in the second speech recognition model to obtain the second recognition result of the sub-audio data at the n-th time. The second text recognition result and the sub-audio data at time n-1 are used as the input data of the n-1th processing module in the second speech recognition model to obtain the second recognition of the sub-audio data at time n-1 As a result, the second recognition result of the sub-audio data at time n-1 and the sub-audio data at time n-2 are used as input data of the n-2th processing module in the second speech recognition model, The second recognition result of the sub-audio data at the n-2th time is obtained, and the second recognition result of the sub-audio data at the i-th time is obtained by analogy.
可选的,所述第一语音识别模型的计算时间与所述第二语音识别模型的计算时间匹配,包括:Optionally, matching the calculation time of the first speech recognition model with the calculation time of the second speech recognition model includes:
所述第一语音识别模型计算得到第一识别结果的时间与所述第二语音识别计算得到第二识别结果的时间之间的差值小于预设阈值。The difference between the time when the first speech recognition model calculates the first recognition result and the time when the second speech recognition calculates the second recognition result is smaller than a preset threshold.
可选的,所述计算单元502还用于:Optionally, the calculation unit 502 is further configured to:
在针对第i+1个时刻的子音频数据,将所述子音频数据输入至第一语音识别模型中的第i+1个处理模块得到第一识别结果,并获取第二识别结果,所述第一识别结果是根据所述待识别音频中第1个时刻到第i+1个时刻的子音频数据确定的,所述第二识别结果是在确定第i个时刻的子音频数据的文本识别结果的过程中确定的;For the sub-audio data at the i+1-th moment, input the sub-audio data into the i+1-th processing module in the first speech recognition model to obtain the first recognition result, and obtain the second recognition result, the The first recognition result is determined based on the sub-audio data from the 1st moment to the i+1th moment in the audio to be recognized, and the second recognition result is the text recognition of the sub-audio data at the i-th moment. Determined in the process of results;
所述结果确定单元还用于:The result determining unit is also used for:
根据所述第一识别结果以及所述第二识别结果确定所述第i+1个时刻的子音频数据的文本识别结果。The text recognition result of the sub audio data at the i+1th time instant is determined according to the first recognition result and the second recognition result.
可选的,所述结果确定单元503具体用于:Optionally, the result determining unit 503 is specifically configured to:
根据所述第一识别结果的权重以及所述第二识别结果的权重确定子音频数据的识别结果。The recognition result of the sub audio data is determined according to the weight of the first recognition result and the weight of the second recognition result.
基于相同的技术构思,本申请实施例提供了一种计算机设备,如图6所示,包括至少一个处理器601,以及与至少一个处理器连接的存储器602,本申请实施例中不限定处理器601与存储器602之间的具体连接介质,图6中处理器601和存储器602之间通过总线连接为例。总线可以分为地址总线、数据总线、控制总线等。Based on the same technical concept, an embodiment of the present application provides a computer device. As shown in FIG. 6, it includes at least one processor 601 and a memory 602 connected to the at least one processor. The embodiment of the present application does not limit the processor. For the specific connection medium between the 601 and the memory 602, the connection between the processor 601 and the memory 602 in FIG. 6 is taken as an example. The bus can be divided into address bus, data bus, control bus and so on.
在本申请实施例中,存储器602存储有可被至少一个处理器601执行的 指令,至少一个处理器601通过执行存储器602存储的指令,可以执行前述的语音识别方法中所包括的步骤。In the embodiment of the present application, the memory 602 stores instructions that can be executed by at least one processor 601, and the at least one processor 601 can execute the steps included in the aforementioned voice recognition method by executing the instructions stored in the memory 602.
其中,处理器601是计算机设备的控制中心,可以利用各种接口和线路连接终端设备的各个部分,通过运行或执行存储在存储器602内的指令以及调用存储在存储器602内的数据,从而获得客户端地址。可选的,处理器601可包括一个或多个处理单元,处理器601可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器601中。在一些实施例中,处理器601和存储器602可以在同一芯片上实现,在一些实施例中,它们也可以在独立的芯片上分别实现。Among them, the processor 601 is the control center of the computer equipment, which can use various interfaces and lines to connect various parts of the terminal equipment, and obtain customers by running or executing instructions stored in the memory 602 and calling data stored in the memory 602. End address. Optionally, the processor 601 may include one or more processing units, and the processor 601 may integrate an application processor and a modem processor. The application processor mainly processes the operating system, user interface, and application programs. The adjustment processor mainly deals with wireless communication. It can be understood that the foregoing modem processor may not be integrated into the processor 601. In some embodiments, the processor 601 and the memory 602 may be implemented on the same chip, and in some embodiments, they may also be implemented on separate chips.
处理器601可以是通用处理器,例如中央处理器(CPU)、数字信号处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件,可以实现或者执行本申请实施例中公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件处理器执行完成,或者用处理器中的硬件及软件模块组合执行完成。The processor 601 may be a general-purpose processor, such as a central processing unit (CPU), a digital signal processor, an application specific integrated circuit (ASIC), a field programmable gate array or other programmable logic devices, discrete gates or transistors Logic devices and discrete hardware components can implement or execute the methods, steps, and logic block diagrams disclosed in the embodiments of the present application. The general-purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiments of the present application may be directly embodied as being executed and completed by a hardware processor, or executed and completed by a combination of hardware and software modules in the processor.
存储器602作为一种非易失性计算机可读存储介质,可用于存储非易失性软件程序、非易失性计算机可执行程序以及模块。存储器602可以包括至少一种类型的存储介质,例如可以包括闪存、硬盘、多媒体卡、卡型存储器、随机访问存储器(Random Access Memory,RAM)、静态随机访问存储器(Static Random Access Memory,SRAM)、可编程只读存储器(Programmable Read Only Memory,PROM)、只读存储器(Read Only Memory,ROM)、带电可擦除可编程只读存储器(Electrically Erasable Programmable Read-Only Memory,EEPROM)、磁性存储器、磁盘、光盘等等。存储器602是能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。本申请实施例中的存储器602还可以是电路或者其它 任意能够实现存储功能的装置,用于存储程序指令和/或数据。The memory 602, as a non-volatile computer-readable storage medium, can be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The memory 602 may include at least one type of storage medium, such as flash memory, hard disk, multimedia card, card-type memory, random access memory (Random Access Memory, RAM), static random access memory (Static Random Access Memory, SRAM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), magnetic memory, disk , CD, etc. The memory 602 is any other medium that can be used to carry or store desired program codes in the form of instructions or data structures and that can be accessed by a computer, but is not limited thereto. The memory 602 in the embodiment of the present application may also be a circuit or any other device capable of realizing a storage function for storing program instructions and/or data.
基于相同的技术构思,本申请实施例提供了一种计算机可读存储介质,其存储有可由计算机设备执行的计算机程序,当所述程序在计算机设备上运行时,使得所述计算机设备执行语音识别方法的步骤。Based on the same technical concept, the embodiments of the present application provide a computer-readable storage medium that stores a computer program executable by a computer device. When the program runs on the computer device, the computer device executes voice recognition. Method steps.
本发明实施例还提供一种计算机程序产品,所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,使所述计算机执行权利要求上述任一所述方法。The embodiment of the present invention also provides a computer program product, the computer program product includes a calculation program stored on a non-transitory computer-readable storage medium, the computer program includes program instructions, when the program instructions are executed by a computer To enable the computer to execute any of the methods described in the preceding claims.
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that the embodiments of the present application can be provided as methods, systems, or computer program products. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
本申请是参照根据本申请的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。This application is described with reference to the flowcharts and/or block diagrams of the methods, equipment (systems), and computer program products according to the application. It should be understood that each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be implemented by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing equipment to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing equipment are generated It is a device that realizes the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device. The device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图 一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment. The instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的精神和范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。Obviously, those skilled in the art can make various changes and modifications to the application without departing from the spirit and scope of the application. In this way, if these modifications and variations of this application fall within the scope of the claims of this application and their equivalent technologies, this application is also intended to include these modifications and variations.

Claims (13)

  1. 一种语音识别方法,其特征在于,所述方法应用于语音识别系统,所述语音识别系统至少包括第一语音识别模型以及第二语音识别模型,所述第一语音识别模型具有n个处理模块,每个模块具有一个输入端以及对应的输出端,所述第二语音识别模型具有n个处理模块,每个模块具有一个输入端以及对应的输出端,所述方法包括:A speech recognition method, characterized in that the method is applied to a speech recognition system, the speech recognition system includes at least a first speech recognition model and a second speech recognition model, and the first speech recognition model has n processing modules Each module has an input terminal and a corresponding output terminal, the second speech recognition model has n processing modules, and each module has an input terminal and a corresponding output terminal, and the method includes:
    获取待识别音频数据,所述待识别音频数据由n个时刻的子音频数据构成,其中n大于等于1;Acquiring audio data to be identified, where the audio data to be identified is composed of sub audio data at n times, where n is greater than or equal to 1;
    针对第i个时刻的子音频数据,将所述子音频数据输入至第一语音识别模型中的第i个处理模块以及第二语音识别模型中的第i个处理模块,分别得到第一识别结果以及第二识别结果,所述第一识别结果是根据所述待识别音频中第1个时刻到第i个时刻的子音频数据确定的,所述第二识别结果是根据所述待识别音频中第i个时刻到第n个时刻的子音频数据确定的,所述第一语音识别模型中的每个处理模型对应一个时刻的子音频数据,所述第二语音识别模型中的每个处理模型对应一个时刻的子音频数据,所述第一语音识别模型的计算时间与所述第二语音识别模型的计算时间匹配,所述第一语音识别模型的计算维度大于所述第二语音识别模型的计算维度,i是根据所述第一语音识别模型的计算维度与所述第二语音识别模型的计算维度确定的,i属于n;For the sub audio data at the i-th moment, input the sub-audio data into the i-th processing module in the first speech recognition model and the i-th processing module in the second speech recognition model to obtain the first recognition results, respectively And a second recognition result, where the first recognition result is determined based on the sub-audio data from the first moment to the i-th moment in the to-be-recognized audio, and the second recognition result is based on the sub-audio data in the to-be recognized audio The sub-audio data from the i-th time to the n-th time is determined, each processing model in the first speech recognition model corresponds to the sub-audio data at a time, and each processing model in the second speech recognition model Corresponding to the sub-audio data at a time, the calculation time of the first speech recognition model matches the calculation time of the second speech recognition model, and the calculation dimension of the first speech recognition model is greater than that of the second speech recognition model The calculation dimension, i is determined according to the calculation dimension of the first speech recognition model and the calculation dimension of the second speech recognition model, and i belongs to n;
    根据所述第一识别结果以及所述第二识别结果确定所述第i个时刻的子音频数据的文本识别结果。The text recognition result of the sub audio data at the i-th moment is determined according to the first recognition result and the second recognition result.
  2. 根据权利要求1所述的方法,其特征在于,针对第i个时刻的子音频数据,将所述子音频数据输入至第一语音识别模型中的第i个处理模块以及第二语音识别模型中的第i个处理模块,分别得到第一识别结果以及第二识别结果,包括:The method according to claim 1, wherein for the sub-audio data at the i-th moment, the sub-audio data is input into the i-th processing module in the first speech recognition model and the second speech recognition model The i-th processing module of, respectively obtains the first recognition result and the second recognition result, including:
    将第1时刻的子音频数据输入至所述第一语音识别模型中的第1个处理模块,得到第1时刻的子音频数据的第一识别结果,将所述第1时刻的子音 频数据的第一识别结果以及第2时刻的子音频数据作为所述第一语音识别模型中的第2个处理模块的输入数据,得到第2时刻的子音频数据的第一识别结果,将所述第2时刻的子音频数据的第一识别结果以及第3时刻的子音频数据作为所述第一语音识别模型中的第3个处理模块的输入数据,得到第3时刻的子音频数据的第一识别结果,以此类推得到第i时刻的子音频数据的第一识别结果;Input the sub-audio data at the first moment into the first processing module in the first speech recognition model to obtain the first recognition result of the sub-audio data at the first moment. The first recognition result and the sub-audio data at the second moment are used as the input data of the second processing module in the first speech recognition model, and the first recognition result of the sub-audio data at the second moment is obtained, and the second The first recognition result of the sub audio data at time and the sub audio data at time 3 are used as the input data of the third processing module in the first speech recognition model, and the first recognition result of the sub audio data at time 3 is obtained , And so on to obtain the first recognition result of the sub audio data at the i-th moment;
    将第n时刻的子音频数据输入至所述第二语音识别模型中的第n个处理模块,得到第n时刻的子音频数据的第二识别结果,将所述第n时刻的子音频数据的第二文本识别结果以及第n-1时刻的子音频数据作为所述第二语音识别模型中的第n-1个处理模块的输入数据,得到第n-1时刻的子音频数据的第二识别结果,将所述第n-1时刻的子音频数据的第二识别结果以及第n-2时刻的子音频数据作为所述第二语音识别模型中的第n-2个处理模块的输入数据,得到第n-2时刻的子音频数据的第二识别结果,以此类推得到第i时刻的子音频数据的第二识别结果。Input the sub-audio data at the nth time into the n-th processing module in the second speech recognition model to obtain the second recognition result of the sub-audio data at the n-th time. The second text recognition result and the sub-audio data at time n-1 are used as the input data of the n-1th processing module in the second speech recognition model to obtain the second recognition of the sub-audio data at time n-1 As a result, the second recognition result of the sub-audio data at time n-1 and the sub-audio data at time n-2 are used as input data of the n-2th processing module in the second speech recognition model, The second recognition result of the sub-audio data at the n-2th time is obtained, and the second recognition result of the sub-audio data at the i-th time is obtained by analogy.
  3. 根据权利要求2所述的方法,其特征在于,所述第一语音识别模型的计算时间与所述第二语音识别模型的计算时间匹配,包括:The method according to claim 2, wherein the calculation time of the first speech recognition model matches the calculation time of the second speech recognition model, comprising:
    所述第一语音识别模型计算得到第一识别结果的时间与所述第二语音识别计算得到第二识别结果的时间之间的差值小于预设阈值。The difference between the time when the first speech recognition model calculates the first recognition result and the time when the second speech recognition calculates the second recognition result is smaller than a preset threshold.
  4. 根据权利要求1所述的方法,其特征在于,所述方法还包括:The method according to claim 1, wherein the method further comprises:
    在针对第i+1个时刻的子音频数据,将所述子音频数据输入至第一语音识别模型中的第i+1个处理模块得到第一识别结果,并获取第二识别结果,所述第一识别结果是根据所述待识别音频中第1个时刻到第i+1个时刻的子音频数据确定的,所述第二识别结果是在确定第i个时刻的子音频数据的文本识别结果的过程中确定的;For the sub-audio data at the i+1-th moment, input the sub-audio data into the i+1-th processing module in the first speech recognition model to obtain the first recognition result, and obtain the second recognition result, the The first recognition result is determined based on the sub-audio data from the 1st moment to the i+1th moment in the audio to be recognized, and the second recognition result is the text recognition of the sub-audio data at the i-th moment. Determined in the process of results;
    根据所述第一识别结果以及所述第二识别结果确定所述第i+1个时刻的子音频数据的文本识别结果。The text recognition result of the sub audio data at the i+1th time instant is determined according to the first recognition result and the second recognition result.
  5. 根据权利要求1或4所述的方法,其特征在于,所述根据所述第一识 别结果以及所述第二识别结果确定子音频数据的识别结果,包括:The method according to claim 1 or 4, wherein the determining the recognition result of the sub-audio data according to the first recognition result and the second recognition result comprises:
    根据所述第一识别结果的权重以及所述第二识别结果的权重确定子音频数据的识别结果。The recognition result of the sub audio data is determined according to the weight of the first recognition result and the weight of the second recognition result.
  6. 一种语音识别装置,其特征在于,所述装置应用于语音识别系统,所述语音识别系统至少包括第一语音识别模型以及第二语音识别模型,所述第一语音识别模型具有n个处理模块,每个模块具有一个输入端以及对应的输出端,所述第二语音识别模型具有n个处理模块,每个模块具有一个输入端以及对应的输出端,所述装置包括:A voice recognition device, characterized in that the device is applied to a voice recognition system, the voice recognition system includes at least a first voice recognition model and a second voice recognition model, and the first voice recognition model has n processing modules Each module has an input terminal and a corresponding output terminal, the second speech recognition model has n processing modules, each module has an input terminal and a corresponding output terminal, and the device includes:
    获取单元,用于获取待识别音频数据,所述待识别音频数据由n个时刻的子音频数据构成,其中n大于等于1;The acquiring unit is configured to acquire audio data to be identified, where the audio data to be identified is composed of sub audio data at n times, where n is greater than or equal to 1;
    计算单元,用于针对第i个时刻的子音频数据,将所述子音频数据输入至第一语音识别模型中的第i个处理模块以及第二语音识别模型中的第i个处理模块,分别得到第一识别结果以及第二识别结果,所述第一识别结果是根据所述待识别音频中第1个时刻到第i个时刻的子音频数据确定的,所述第二识别结果是根据所述待识别音频中第i个时刻到第n个时刻的子音频数据确定的,所述第一语音识别模型中的每个处理模型对应一个时刻的子音频数据,所述第二语音识别模型中的每个处理模型对应一个时刻的子音频数据,所述第一语音识别模型的计算时间与所述第二语音识别模型的计算时间匹配,所述第一语音识别模型的计算维度大于所述第二语音识别模型的计算维度,i是根据所述第一语音识别模型的计算维度与所述第二语音识别模型的计算维度确定的,i属于n;The calculation unit is configured to input the sub-audio data into the i-th processing module in the first speech recognition model and the i-th processing module in the second speech recognition model for the sub-audio data at the i-th moment, respectively Obtain a first recognition result and a second recognition result. The first recognition result is determined based on the sub-audio data from the first moment to the i-th moment in the audio to be recognized, and the second recognition result is based on the The sub-audio data from the i-th time to the n-th time in the audio to be recognized is determined, each processing model in the first speech recognition model corresponds to the sub-audio data at a time, and in the second speech recognition model Each processing model corresponds to the sub-audio data at a time, the calculation time of the first speech recognition model matches the calculation time of the second speech recognition model, and the calculation dimension of the first speech recognition model is greater than the first speech recognition model. 2. The calculation dimension of the speech recognition model, i is determined according to the calculation dimension of the first speech recognition model and the calculation dimension of the second speech recognition model, and i belongs to n;
    结果确定单元,用于根据所述第一识别结果以及所述第二识别结果确定所述第i个时刻的子音频数据的文本识别结果。The result determining unit is configured to determine the text recognition result of the sub audio data at the i-th moment according to the first recognition result and the second recognition result.
  7. 根据权利要求6所述的装置,其特征在于,所述计算单元具体用于:The device according to claim 6, wherein the calculation unit is specifically configured to:
    将第1时刻的子音频数据输入至所述第一语音识别模型中的第1个处理模块,得到第1时刻的子音频数据的第一识别结果,将所述第1时刻的子音频数据的第一识别结果以及第2时刻的子音频数据作为所述第一语音识别模 型中的第2个处理模块的输入数据,得到第2时刻的子音频数据的第一识别结果,将所述第2时刻的子音频数据的第一识别结果以及第3时刻的子音频数据作为所述第一语音识别模型中的第3个处理模块的输入数据,得到第3时刻的子音频数据的第一识别结果,以此类推得到第i时刻的子音频数据的第一识别结果;Input the sub-audio data at the first moment into the first processing module in the first speech recognition model to obtain the first recognition result of the sub-audio data at the first moment. The first recognition result and the sub-audio data at the second moment are used as input data of the second processing module in the first speech recognition model, and the first recognition result of the sub-audio data at the second moment is obtained, and the second The first recognition result of the sub audio data at time and the sub audio data at time 3 are used as the input data of the third processing module in the first speech recognition model, and the first recognition result of the sub audio data at time 3 is obtained , And so on to obtain the first recognition result of the sub-audio data at the i-th moment;
    将第n时刻的子音频数据输入至所述第二语音识别模型中的第n个处理模块,得到第n时刻的子音频数据的第二识别结果,将所述第n时刻的子音频数据的第二文本识别结果以及第n-1时刻的子音频数据作为所述第二语音识别模型中的第n-1个处理模块的输入数据,得到第n-1时刻的子音频数据的第二识别结果,将所述第n-1时刻的子音频数据的第二识别结果以及第n-2时刻的子音频数据作为所述第二语音识别模型中的第n-2个处理模块的输入数据,得到第n-2时刻的子音频数据的第二识别结果,以此类推得到第i时刻的子音频数据的第二识别结果。Input the sub-audio data at the nth time into the n-th processing module in the second speech recognition model to obtain the second recognition result of the sub-audio data at the n-th time. The second text recognition result and the sub-audio data at time n-1 are used as the input data of the n-1th processing module in the second speech recognition model to obtain the second recognition of the sub-audio data at time n-1 As a result, the second recognition result of the sub-audio data at time n-1 and the sub-audio data at time n-2 are used as input data of the n-2th processing module in the second speech recognition model, The second recognition result of the sub-audio data at the n-2th time is obtained, and the second recognition result of the sub-audio data at the i-th time is obtained by analogy.
  8. 根据权利要求7所述的装置,其特征在于,所述第一语音识别模型的计算时间与所述第二语音识别模型的计算时间匹配,包括:8. The device of claim 7, wherein the calculation time of the first speech recognition model matches the calculation time of the second speech recognition model, comprising:
    所述第一语音识别模型计算得到第一识别结果的时间与所述第二语音识别计算得到第二识别结果的时间之间的差值小于预设阈值。The difference between the time when the first speech recognition model calculates the first recognition result and the time when the second speech recognition calculates the second recognition result is smaller than a preset threshold.
  9. 根据权利要求6所述的装置,其特征在于,所述计算单元还用于:The device according to claim 6, wherein the calculation unit is further configured to:
    在针对第i+1个时刻的子音频数据,将所述子音频数据输入至第一语音识别模型中的第i+1个处理模块得到第一识别结果,并获取第二识别结果,所述第一识别结果是根据所述待识别音频中第1个时刻到第i+1个时刻的子音频数据确定的,所述第二识别结果是在确定第i个时刻的子音频数据的文本识别结果的过程中确定的;For the sub-audio data at the i+1-th moment, input the sub-audio data into the i+1-th processing module in the first speech recognition model to obtain the first recognition result, and obtain the second recognition result, the The first recognition result is determined based on the sub-audio data from the 1st moment to the i+1th moment in the audio to be recognized, and the second recognition result is the text recognition of the sub-audio data at the i-th moment. Determined in the process of results;
    所述结果确定单元还用于:The result determining unit is also used for:
    根据所述第一识别结果以及所述第二识别结果确定所述第i+1个时刻的子音频数据的文本识别结果。The text recognition result of the sub audio data at the i+1th time instant is determined according to the first recognition result and the second recognition result.
  10. 根据权利要求6或者9所述的装置,其特征在于,所述结果确定单 元具体用于:The device according to claim 6 or 9, wherein the result determining unit is specifically used for:
    根据所述第一识别结果的权重以及所述第二识别结果的权重确定子音频数据的识别结果。The recognition result of the sub audio data is determined according to the weight of the first recognition result and the weight of the second recognition result.
  11. 一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其特征在于,所述处理器执行所述程序时实现权利要求1~5任一权利要求所述方法的步骤。A computer device, comprising a memory, a processor, and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program when the program is executed. The steps of the method.
  12. 一种计算机可读存储介质,其特征在于,其存储有可由计算机设备执行的计算机程序,当所述程序在计算机设备上运行时,使得所述计算机设备执行权利要求1~5任一所述方法的步骤。A computer-readable storage medium, characterized in that it stores a computer program that can be executed by a computer device, and when the program runs on a computer device, the computer device executes the method described in any one of claims 1 to 5 A step of.
  13. 一种计算机程序产品,其特征在于,所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,使所述计算机执行权利要求1~5任一所述方法。A computer program product, characterized in that the computer program product includes a calculation program stored on a non-transitory computer-readable storage medium, the computer program includes program instructions, and when the program instructions are executed by a computer, The computer executes the method described in any one of claims 1 to 5.
PCT/CN2019/127672 2019-09-12 2019-12-23 Voice recognition method and device WO2021047103A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910865885.4A CN110610697B (en) 2019-09-12 2019-09-12 Voice recognition method and device
CN201910865885.4 2019-09-12

Publications (1)

Publication Number Publication Date
WO2021047103A1 true WO2021047103A1 (en) 2021-03-18

Family

ID=68892748

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/127672 WO2021047103A1 (en) 2019-09-12 2019-12-23 Voice recognition method and device

Country Status (2)

Country Link
CN (1) CN110610697B (en)
WO (1) WO2021047103A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113408539A (en) * 2020-11-26 2021-09-17 腾讯科技(深圳)有限公司 Data identification method and device, electronic equipment and storage medium
CN115512693A (en) * 2021-06-23 2022-12-23 中移(杭州)信息技术有限公司 Audio recognition method, acoustic model training method, device and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106128462A (en) * 2016-06-21 2016-11-16 东莞酷派软件技术有限公司 Audio recognition method and system
US20180025721A1 (en) * 2016-07-22 2018-01-25 Google Inc. Automatic speech recognition using multi-dimensional models
CN107871496A (en) * 2016-09-23 2018-04-03 北京眼神科技有限公司 Audio recognition method and device
CN108694940A (en) * 2017-04-10 2018-10-23 北京猎户星空科技有限公司 A kind of audio recognition method, device and electronic equipment
CN109243461A (en) * 2018-09-21 2019-01-18 百度在线网络技术(北京)有限公司 Audio recognition method, device, equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106128462A (en) * 2016-06-21 2016-11-16 东莞酷派软件技术有限公司 Audio recognition method and system
US20180025721A1 (en) * 2016-07-22 2018-01-25 Google Inc. Automatic speech recognition using multi-dimensional models
CN107871496A (en) * 2016-09-23 2018-04-03 北京眼神科技有限公司 Audio recognition method and device
CN108694940A (en) * 2017-04-10 2018-10-23 北京猎户星空科技有限公司 A kind of audio recognition method, device and electronic equipment
CN109243461A (en) * 2018-09-21 2019-01-18 百度在线网络技术(北京)有限公司 Audio recognition method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN110610697A (en) 2019-12-24
CN110610697B (en) 2020-07-31

Similar Documents

Publication Publication Date Title
WO2022095380A1 (en) Ai-based virtual interaction model generation method and apparatus, computer device and storage medium
CN110298906B (en) Method and device for generating information
EP3665676B1 (en) Speaking classification using audio-visual data
WO2021000497A1 (en) Retrieval method and apparatus, and computer device and storage medium
US11308671B2 (en) Method and apparatus for controlling mouth shape changes of three-dimensional virtual portrait
US20240021202A1 (en) Method and apparatus for recognizing voice, electronic device and medium
CN108520741A (en) A kind of whispering voice restoration methods, device, equipment and readable storage medium storing program for executing
CN111754985B (en) Training of voice recognition model and voice recognition method and device
CN109461437B (en) Verification content generation method and related device for lip language identification
CN113421547B (en) Voice processing method and related equipment
WO2022048239A1 (en) Audio processing method and device
CN108345612A (en) A kind of question processing method and device, a kind of device for issue handling
WO2021047103A1 (en) Voice recognition method and device
CN111126084B (en) Data processing method, device, electronic equipment and storage medium
CN110610698B (en) Voice labeling method and device
CN110827799B (en) Method, apparatus, device and medium for processing voice signal
CN115312033A (en) Speech emotion recognition method, device, equipment and medium based on artificial intelligence
CN111653274A (en) Method, device and storage medium for awakening word recognition
CN114065720A (en) Conference summary generation method and device, storage medium and electronic equipment
CN114064943A (en) Conference management method, conference management device, storage medium and electronic equipment
CN117520498A (en) Virtual digital human interaction processing method, system, terminal, equipment and medium
CN115174285A (en) Conference record generation method and device and electronic equipment
US11704585B2 (en) System and method to determine outcome probability of an event based on videos
CN114138960A (en) User intention identification method, device, equipment and medium
CN112071331A (en) Voice file repairing method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19945150

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19945150

Country of ref document: EP

Kind code of ref document: A1