WO2023092399A1 - 语音识别方法、语音识别装置及系统 - Google Patents

语音识别方法、语音识别装置及系统 Download PDF

Info

Publication number
WO2023092399A1
WO2023092399A1 PCT/CN2021/133207 CN2021133207W WO2023092399A1 WO 2023092399 A1 WO2023092399 A1 WO 2023092399A1 CN 2021133207 W CN2021133207 W CN 2021133207W WO 2023092399 A1 WO2023092399 A1 WO 2023092399A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
audio
category
sound
audio data
Prior art date
Application number
PCT/CN2021/133207
Other languages
English (en)
French (fr)
Inventor
高益
聂为然
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN202180104424.0A priority Critical patent/CN118235199A/zh
Priority to PCT/CN2021/133207 priority patent/WO2023092399A1/zh
Publication of WO2023092399A1 publication Critical patent/WO2023092399A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal

Definitions

  • the embodiments of the present application relate to the field of artificial intelligence, and more specifically, relate to a speech recognition method, speech recognition device and system.
  • voice interaction products have widely entered people's daily life, such as smart terminal equipment, smart home equipment, smart vehicle equipment, etc.
  • the voice interaction is initiated by the user, but the voice end state (that is, the voice end point of a round of dialogue) is generally automatically judged through automatic voice recognition.
  • the background noise leads to the late speech end state
  • the pause in the middle of the speech leads to the premature speech end state.
  • Embodiments of the present application provide a voice recognition method, voice recognition device, and system, which can more accurately determine the end state of a voice, so as to more accurately respond to voice-based subsequent operations.
  • a speech recognition method comprising: acquiring audio data, the audio data including a plurality of audio frames; extracting sound categories and semantics of a plurality of audio frames; according to the sound categories and semantics of a plurality of audio frames, Get the speech end point of the audio data.
  • the speech end point of the audio data is obtained by extracting and synthesizing the sound category and semantics in the audio data, which can more accurately determine the speech end point of the audio data, so as to more accurately determine the voice-based follow-up Respond to operations to improve user experience.
  • the waiting time before the end of the speech is not fixed, but will change with the actual speech recognition process, and the existing fixed waiting time is preset, Compared with the method of responding after the waiting time is over, the end point of the voice can be obtained more accurately, thereby improving the timeliness and accuracy of the user's voice command response and reducing the user's waiting time.
  • the sound category of the plurality of audio frames may include a sound category of each audio frame in the plurality of audio frames.
  • the sound categories of the plurality of audio frames may include sound categories of some audio frames in the plurality of audio frames.
  • each audio frame can extract its sound category, but not every audio frame can extract its semantics. Specifically, an audio frame that does not contain human voice has no semantics, therefore, the semantics of an audio frame that does not contain speech cannot be extracted, and at this time, the semantics of an audio frame that does not contain speech can be regarded as empty or nothing.
  • the instruction can be responded immediately, or the instruction can be responded after a period of time. That is, after the end point of the speech is obtained, the operation corresponding to the audio data before the end point of the speech in the audio data can be executed immediately. Alternatively, after the end point of the speech is obtained, the operation corresponding to the audio data before the end point of the speech in the audio data may be performed after a period of time. This period of time may be redundant time or error time.
  • the scheme of the present application can more accurately determine the voice end point of the audio data, and then can more accurately respond to the instructions corresponding to the audio data before the voice end point, which is conducive to improving the timeliness and accuracy of the user's voice command response, reducing the The user's waiting time is long and the user experience is improved.
  • the sound categories of the above multiple audio frames may be obtained according to a relationship between energy of the multiple audio frames and a preset energy threshold.
  • the sound category includes "speech", "pending sound” and “mute”, and the preset energy threshold includes a first energy threshold and a second energy threshold, the first The energy threshold is greater than the second energy threshold, and the sound category of the audio frame with energy greater than or equal to the first energy threshold in the multiple audio frames can be determined as "speech"; or the energy in the multiple audio frames is less than the first energy threshold and Determine the sound category of audio frames greater than the second energy threshold as "pending sound”; or determine the sound category of audio frames with energy less than or equal to the second energy threshold among multiple audio frames as "silence".
  • the first energy threshold and the second energy threshold may be determined according to energy of background sound of the audio data.
  • the silence energy curve is different. For example, in a relatively quiet environment, the silence energy (that is, the energy of the background sound) is relatively low, and in a relatively noisy environment, the silence energy (that is, the energy of the background sound) is relatively high. high. Therefore, obtaining the first energy threshold and the second energy threshold according to the silence energy can meet requirements in different environments.
  • the multiple audio frames include a first audio frame and a second audio frame, the first audio frame is an audio frame carrying semantics, and the second audio frame is a plurality of audio frames The audio frame after the first audio frame in the frame; and obtaining the speech end point of the audio data according to the sound category and semantics, including: obtaining the speech end point according to the semantics and the sound category of the second audio frame.
  • the first audio frame is a plurality of audio frames carrying semantics.
  • the second audio frame is one or more audio frames after the first audio frame.
  • multiple audio frames carrying semantics is a different concept from “multiple audio frames” included in the audio data.
  • the number of audio frames included in the first audio frame is smaller than the number of audio frames included in the audio data.
  • the semantics can be fused with the sound category of the second audio frame to obtain fusion features of multiple audio frames; and then the speech end point can be obtained according to the fusion features.
  • fusion features for processing can improve processing efficiency on the one hand, and improve accuracy on the other hand.
  • the speech endpoint category includes "speaking", “thinking” or "end”
  • the speech endpoint of the audio data can be determined according to the semantics and the sound category of the second audio frame category, in the case that the audio endpoint category of the audio data is "end", the audio end point is obtained.
  • the speech endpoint category of the audio data may be determined according to the fusion features of the multiple audio frames.
  • the semantics and the second audio frame can be processed by using the speech endpoint classification model to obtain the speech endpoint category, and the speech endpoint classification model uses the speech sample and the endpoint of the speech sample
  • the format of the voice sample corresponds to the semantics and the format of the second audio frame obtained from the category label
  • the endpoint category included in the endpoint category label corresponds to the voice endpoint category.
  • the fusion features can be processed by using the speech endpoint classification model to obtain the speech endpoint category, and the speech endpoint classification model is obtained by using the speech samples and the endpoint category labels of the speech samples Yes, the format of the voice sample corresponds to the format of the fusion feature, and the endpoint category included in the endpoint category label corresponds to the voice endpoint category.
  • a voice recognition method including: acquiring first audio data; determining the first voice end point of the first audio data; after obtaining the first voice end point, responding to the first voice in the first audio data The instruction corresponding to the audio data before the end point; obtain the second audio data; determine the second voice end point of the second audio data; after obtaining the second voice end point, respond to the first audio data and the second audio data in the second audio data An instruction corresponding to the audio data between the end point of the first speech and the end point of the second speech.
  • Utilizing the scheme of the embodiment of the present application can obtain the end point of the speech more accurately, avoiding the too late detection of the end point of the speech and causing the response delay to be too long, and obtaining the end point of the speech faster, so as to make a follow-up response in time and reduce the waiting time of the user.
  • Improve user experience Specifically, in the solution of the embodiment of the present application, the audio data is acquired in real time and the speech end point in the audio data is recognized, which is beneficial to real-time recognition of the speech end points of different instructions, and after obtaining the speech end points of each instruction, the should command.
  • using the scheme of the present application is beneficial to identify the end point of the voice of each instruction after each instruction is issued, so as to respond to each instruction in time without waiting until the instruction is issued. Respond to all commands after sending out multiple commands.
  • the first audio data includes a plurality of audio frames
  • determining the first speech end point of the first audio data includes: extracting the sound category and semantics of the plurality of audio frames ; Obtain the first voice end point of the first audio data according to the sound category and semantics of the multiple audio frames.
  • extracting the sound category and semantics of multiple audio frames includes: obtaining the energy of multiple audio frames according to the relationship between the energy of multiple audio frames and a preset energy threshold sound category.
  • the sound category includes "speech", "pending sound” and “mute”, and the preset energy threshold includes a first energy threshold and a second energy threshold, the first The energy threshold is greater than the second energy threshold, and the sound category of the audio frames whose energy is greater than or equal to the first energy threshold in multiple audio frames is "speech"; or the energy in multiple audio frames is less than the first energy threshold and greater than the second energy
  • the sound category of the audio frame of the threshold is "pending sound”; or the sound category of the audio frame whose energy is less than or equal to the second energy threshold among the multiple audio frames is "silence".
  • the first energy threshold and the second energy threshold are determined according to the energy of the background sound of the first audio data.
  • the multiple audio frames include a first audio frame and a second audio frame, the first audio frame is an audio frame carrying semantics, and the second audio frame is a plurality of audio frames An audio frame after the first audio frame in the frame; and obtaining the first speech end point of the first audio data according to the sound category and semantics, including: obtaining the first speech end point according to the semantics and the sound category of the second audio frame.
  • the speech endpoint category includes "speaking", “thinking” or "end”
  • the first speech end point is obtained according to the semantics and the sound category of the second audio frame, including : According to the semantics and the sound category of the second audio frame, determine the speech endpoint category of the first audio data, and obtain the first speech end point when the speech endpoint category of the first audio data is "end”.
  • determining the speech endpoint category of the first audio data according to the semantics and the sound category of the second audio frame includes: using the speech endpoint classification model to classify the semantics and the second audio
  • the voice category of the frame is processed to obtain the voice endpoint category.
  • the voice endpoint classification model is obtained by using the voice sample and the endpoint category label of the voice sample.
  • the format of the voice sample corresponds to the semantics and the format of the voice category of the second audio frame.
  • the endpoint category The endpoint classes included in the label correspond to the speech endpoint classes.
  • a method for training a voice endpoint classification model comprising: obtaining training data, the training data including a voice sample and an endpoint category label of the voice sample, the format of the voice sample and a plurality of audio frames of the audio data
  • the semantics correspond to the format of the sound category of the second audio frame
  • the multiple audio frames include the first audio frame and the second audio frame
  • the first audio frame is the audio frame carrying the semantics
  • the second audio frame is the first audio frame
  • the endpoint category included in the endpoint category label corresponds to the speech endpoint category
  • the speech endpoint classification model is trained using the training data to obtain the target speech endpoint classification model.
  • the target speech endpoint classification model obtained by using the method described in the third aspect can be used to perform the operation of the first aspect "using the speech endpoint classification model to process the semantics and the sound category of the second audio frame to obtain the speech endpoint category".
  • the voice sample in combination with the third aspect, in some implementations of the third aspect, can be in the format of "initial symbol + semantic + sound category + end symbol", or the voice sample can be in the format of "initial symbol + sound category + semantic + end character” format.
  • some text corpus can be obtained, and a trie of these corpora can be established.
  • Each node in the trie (each node corresponds to a word) includes the following information: whether it is an end point and the frequency of the prefix.
  • the above speech samples can be generated according to the node information.
  • the end point is the end point of a sentence.
  • the prefix frequency is used to indicate the number of characters that will exist after the word to the end point. The higher the prefix frequency, the less likely the word is the end point.
  • verbs such as "give” and “take” or other prepositions are relatively less likely to be used as the end of a sentence, and they often have a higher frequency of prefixes in the dictionary tree; The possibility of the end of the statement is relatively large, and tends to have a low frequency of prefixes in the trie.
  • a speech recognition device in a fourth aspect, includes a unit for performing the method in any one implementation manner of the above-mentioned first aspect.
  • a voice recognition device includes a unit for performing the method in any one implementation manner of the second aspect above.
  • a training device for a voice endpoint classification model includes a unit for performing the method in any one of the implementation manners of the above-mentioned third aspect.
  • a speech recognition device which includes: a memory for storing a program; a processor for executing the program stored in the memory, and when the program stored in the memory is executed, the processor uses to execute the method in any one of the implementation manners of the first aspect or the second aspect.
  • the device can be installed in various types of speech recognition, voice assistants or smart speakers and other devices or systems that need to judge the end point of speech. , hosts or servers and other devices with computing capabilities.
  • the device can also be a chip.
  • a training device for a speech endpoint classification model comprising: a memory for storing a program; a processor for executing the program stored in the memory, when the program stored in the memory is executed , the processor is configured to execute the method in any one implementation manner in the third aspect.
  • the training device can be a computer, a mainframe or a server and other devices with computing capabilities.
  • the training device can also be a chip.
  • a computer-readable medium stores program code for execution by a device, and the program code includes an implementation for executing any one of the first aspect, the second aspect, or the third aspect methods in methods.
  • a computer program product containing instructions is provided, and when the computer program product is run on a computer, it causes the computer to execute the method in any one of the above-mentioned first aspect, second aspect or third aspect. .
  • a vehicle-mounted system includes the device in any one of the implementation manners of the fourth aspect, the fifth aspect, or the sixth aspect.
  • the vehicle system may include a cloud service device and a terminal device.
  • the terminal device may be any one of a vehicle, a vehicle-mounted chip, or a vehicle-mounted device (such as a vehicle machine, a vehicle-mounted computer), and the like.
  • an electronic device in a twelfth aspect, includes the apparatus in any one implementation manner of the fourth aspect, the fifth aspect, or the sixth aspect.
  • the electronic equipment may specifically include a computer, a smart phone, a tablet computer, a personal digital assistant (personal digital assistant, PDA), a wearable device, a smart speaker, a TV, a drone, a vehicle, a vehicle-mounted chip, a vehicle-mounted device (such as One or more of devices such as car machine, on-board computer) or robot.
  • a computer a smart phone, a tablet computer, a personal digital assistant (personal digital assistant, PDA), a wearable device, a smart speaker, a TV, a drone, a vehicle, a vehicle-mounted chip, a vehicle-mounted device (such as One or more of devices such as car machine, on-board computer) or robot.
  • the speech end point of the audio data can be determined more accurately, so as to more accurately respond to the subsequent operation based on speech , to improve user experience.
  • the solution of the present application can avoid the excessively long response delay caused by the late detection of the end point of the speech, and obtain the end point of the speech faster, so as to make a follow-up response in time, reduce the waiting time of the user, and improve the user experience; at the same time, this
  • the proposed solution can obtain accurate speech end points, avoid premature truncation of the user's voice commands caused by premature detection of speech end points, and obtain audio data with complete semantics, which is conducive to accurately identifying user intentions and making accurate speech commands. Responsive and improve user experience.
  • Obtaining the energy threshold according to the energy of the background sound can adapt to the needs of different environments, and further improve the accuracy of speech end point discrimination. Firstly, the sound category and semantics are fused, and then the speech end point is obtained according to the fusion feature. On the one hand, the processing efficiency can be improved, and on the other hand, the accuracy of the speech end point can be further improved.
  • FIG. 1 is a schematic structural diagram of a speech recognition device according to an embodiment of the present application.
  • Fig. 2 is a schematic diagram of classification of sound categories in the embodiment of the present application.
  • FIG. 3 is a schematic diagram of processing fusion features by the speech endpoint classification model of the embodiment of the present application.
  • Fig. 4 is a schematic flow chart of a speech recognition method according to an embodiment of the present application.
  • FIG. 5 is a schematic flow chart of a speech recognition method according to an embodiment of the present application.
  • FIG. 6 is a schematic flowchart of a training method for a speech endpoint classification model according to an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a dictionary tree according to an embodiment of the present application.
  • Fig. 8 is a schematic block diagram of a speech recognition device according to an embodiment of the present application.
  • FIG. 9 is a schematic diagram of a hardware structure of a voice recognition device according to an embodiment of the present application.
  • the solution of the present application can be applied in various voice interaction scenarios.
  • the solution of the present application may be applied in a voice interaction scenario of an electronic device and a voice interaction scenario of an electronic system.
  • electronic equipment specifically may include computers, smart phones, tablet computers, personal digital assistants (personal digital assistants, PDAs), wearable devices, smart speakers, TVs, drones, vehicles, on-board chips, on-board devices (such as car and machine , on-board computer) or one or more of devices such as robots.
  • the electronic system may include cloud service equipment and terminal equipment.
  • the electronic system may be a vehicle system or a smart home system.
  • the end-side device of the on-board system may include any one of the vehicle, on-board chip, or on-board device (eg, on-board computer, on-board computer).
  • the cloud service device includes physical servers and virtual servers. The server receives data uploaded by the end side (such as a car machine), processes the data, and sends the processed data to the end side.
  • voice interaction can be realized through voice assistants.
  • a smart phone can be operated through voice interaction with a voice assistant, or a conversation can be conducted with a voice assistant.
  • the voice assistant can obtain audio data through a microphone, and then use the processing unit to determine the end point of the audio data, and trigger a follow-up response after obtaining the end point of the audio data, for example, report the user's intention in the audio data to the operation The system responds.
  • the voice end point of the audio data can be accurately identified, thereby improving the accuracy and timeliness of subsequent responses and improving user experience.
  • the control of the vehicle can be realized through voice interaction.
  • the audio data can be obtained through the microphone, and then the processing unit can determine the end point of the audio data, and trigger a follow-up response after the end point of the audio data is obtained, for example, the user intention in the audio data Report to the vehicle system for response.
  • functions such as obtaining routes, playing music, and controlling in-vehicle hardware (such as windows, air conditioners, etc.) can be realized to improve the interactive experience of in-vehicle systems.
  • in-vehicle hardware such as windows, air conditioners, etc.
  • the voice end point of the audio data can be accurately identified, thereby improving the accuracy and timeliness of subsequent responses and improving user experience.
  • FIG. 1 is a schematic structural diagram of a speech recognition device according to an embodiment of the present application.
  • the speech recognition device 100 is used to process audio data to obtain the end point of speech in the audio data, that is, to obtain the stop point of speech in the audio data.
  • the end point of the voice of the audio data can be the last word corresponding to the last word in the voice.
  • the audio frame, that is, the audio end point of this segment of audio data is the last audio frame corresponding to the word " ⁇ ".
  • the end point of audio data means that the audio has stopped.
  • the end point of a 5-second audio data is the end point of the audio data.
  • the speech end point of the audio data refers to the stop point of the speech in the segment of audio data. Taking the audio data with a length of 5 seconds as an example again, assuming that the audio with a length of 4 seconds in the first 4 seconds contains speech, and there is no speech in the 5th second, the speech end point of the audio data is the audio frame corresponding to the end of the 4th second. Assume that none of the 5-second audio data contains speech, and the preset time interval for speech recognition is 3 seconds. If no speech is recognized within 3 consecutive seconds, the speech recognition will be terminated, and the speech of the 5-second audio data will end. The point is the corresponding audio frame at the end of the 3rd second.
  • Audio data may include multiple audio frames. It should be understood that the input audio data may or may not include speech. For example, assuming that a person wakes up the voice collection function but does not speak within a few seconds, the audio data collected at this time is audio data that does not include voice.
  • the multiple audio frames in the audio data may be continuous audio frames or discontinuous audio frames.
  • the speech recognition device 100 includes an acquisition module 110 , a processing module 120 and a decision module 130 .
  • the decision module 130 may also be integrated in the processing module 120 .
  • the obtaining module 110 is used for obtaining audio data, and the audio data may include multiple audio frames.
  • the acquisition module 110 may include a voice acquisition device such as a microphone for acquiring voice audio in real time.
  • the obtaining module 110 may also include a communication interface, and the communication interface may use a transceiving device such as a transceiver to realize communication with other devices or communication networks, so as to obtain audio data from other devices or communication networks.
  • the processing module 120 is configured to process multiple audio frames in the audio data to obtain sound categories and semantics of the multiple audio frames. It can be understood that the processing module 120 is used to extract sound categories and semantics of multiple audio frames in the audio data.
  • Semantics is used to represent the language included in the audio data, and may also be referred to as textual meaning, literal meaning, linguistic meaning, and the like. Semantics can also be carried by audio streams.
  • the sound category of the plurality of audio frames may include a sound category of each audio frame in the plurality of audio frames.
  • the sound categories of the plurality of audio frames may include sound categories of some audio frames in the plurality of audio frames.
  • the processing module 120 may extract the sound categories of all the audio frames in the multiple audio frames, or may only extract the sound categories of some audio frames in the multiple audio frames.
  • a device such as automatic speech recognition (auto speech recognition, ASR) may be used to obtain the audio stream carrying semantics.
  • ASR automatic speech recognition
  • Each audio stream can be represented by corresponding text.
  • Each audio stream may contain one or more audio frames.
  • the sound category may include: "speech (speech, SPE)", “pending sound (neutral, NEU)” and “silence (silence, SIL)".
  • Speech refers to the part of the audio that can be definitely spoken by a person (or can be understood as the human voice in the audio)
  • the undetermined sound refers to the part of the audio that is relatively vague and cannot be determined as a speech (or can be understood as the part in the audio
  • mute refers to the part of the audio that does not include the human voice (or can be understood as the part of the audio without the human voice). It should be understood that in this embodiment of the present application, mute may be no speech, no sound, or only background sound, etc., and it does not mean that the decibel is 0 or there is no sound at all in the physical sense.
  • the sound category may only include “speaking voice” and “mute”, or may include “mute” and “non-mute”, or may include “speaking voice” and “non-speaking voice” .
  • “Speaking voice” and “non-speech voice” can also be called “human voice” and “non-human voice” respectively.
  • the above-mentioned “speaking voice”, “pending voice” and “mute” can also be called “human voice” respectively , “probably a human voice” and “not a human voice”. It should be understood that this is only an example, and this embodiment of the present application does not limit the classification of sound categories.
  • the sound category is equivalent to judging and classifying the audio data from an acoustic point of view, and is used to distinguish the category of the audio frame. Semantics is to extract the language components in the audio data, which is used to infer whether the words have been spoken from the perspective of language. It should be understood that each audio frame can have its sound category extracted. However, since an audio frame that does not contain human voice has no semantics, the semantics of an audio frame that does not contain speech can be regarded as null or nothing.
  • the energy of the audio frame of "speaking voice” is relatively high
  • the energy of the audio frame of "mute” is relatively low
  • the energy of the audio frame of "pending sound” is relatively high.
  • the energy of the frame is lower than that of "speech” but higher than that of "silence”.
  • the energy of an audio frame may also be referred to as the intensity of the audio frame.
  • the audio frame may be classified according to the energy of the audio frame, so as to obtain the sound category of the audio frame.
  • the sound category of the corresponding audio frame is obtained according to the energy of the audio frame in the audio data.
  • the sound category of the corresponding audio frame may be obtained according to the relationship between the energy of the audio frame and a preset energy threshold.
  • the preset energy threshold may include a first energy threshold and a second energy threshold, the first energy threshold is greater than the second energy threshold, an audio frame with energy greater than or equal to the first energy threshold is determined as a speech, and an audio frame with energy less than the first energy threshold is determined as a speech. Audio frames with an energy threshold greater than the second energy threshold are determined as pending sounds, and audio frames with energy less than or equal to the second energy threshold are determined as silence.
  • the sound category of the audio frame whose energy is greater than or equal to the first energy threshold among the plurality of audio frames is determined as "speech", and the sound category of the audio frame whose energy is less than the first energy threshold and greater than the second energy threshold among the plurality of audio frames
  • the sound category of the audio frame is determined as "pending sound”
  • the sound category of the audio frame whose energy is less than or equal to the second energy threshold among the plurality of audio frames is determined as "silence”.
  • audio whose energy is equal to the first energy threshold or equal to the second energy threshold may also be determined as the pending sound, and the above description is taken as an example of the embodiment of the present application.
  • Fig. 2 is a schematic diagram of classification of sound categories in the embodiment of the present application.
  • the abscissa represents the audio frame sequence
  • the ordinate represents the energy value corresponding to the audio frame sequence.
  • the energy curve represents the energy change curve of multiple audio frames in the audio data.
  • the first energy threshold curve represents the energy lower limit curve of "speaking voice” and the energy upper limit curve of "pending voice”
  • the second energy threshold curve represents the energy lower limit curve of "pending voice” and the energy of "mute”
  • the upper limit curve and the mute energy curve represent the energy curve of the background sound of the audio segment. Both the first energy threshold curve and the second energy threshold curve can be obtained according to the mute energy curve, that is, both the first energy threshold and the second energy threshold can be Obtained from the energy of the background sound.
  • the silence energy curves are different. For example, in a relatively quiet environment, the silence energy (ie, the energy of the background sound) is relatively low; in a relatively noisy environment, the silence energy (ie, the energy of the background sound energy) is relatively high. Therefore, obtaining the first energy threshold curve and the second energy threshold curve according to the silent energy curve can meet requirements in different environments.
  • the sound categories of the audio frame are divided into “speaking voice” (shown as SPE), "pending sound” (shown as NEU) and “Silence” (illustrated SIL) three categories.
  • SPE speaking voice
  • NEU pending sound
  • Silence illustrated SIL
  • the sound category sequence of the audio frame sequence is "SPE NEU SIL SIL NEU SPE NEU SIL NEU SPE NEU NEU SPE NEU SIL" from left to right.
  • the processing module 120 may be a processor capable of data processing, such as a central processing unit or a microprocessor, or other budget-capable devices or chips or integrated circuits.
  • the decision module 130 is used for obtaining the voice end point of the audio data according to the sound category and semantics from the processing module 120 .
  • the speech end point is the detection result of the speech end state.
  • the acquisition module 110 acquires the audio data in real time
  • the acquisition of the audio data may end after the detection result is obtained.
  • the decision-making module 130 may obtain the above-mentioned speech end point according to the sound category and semantics of all the audio frames in the plurality of audio frames.
  • the text endpoints that may be the end points of speech can be preliminarily determined according to the semantics, and then the end point of speech can be found out from the text endpoints that may be the end points of speech according to the sound category of each text endpoint.
  • the candidate audio frames that may be the end points of speech may be preliminarily determined according to the sound category, and then the end point of speech may be determined according to the semantics before these candidate audio frames.
  • the multiple audio frames include a first audio frame and a second audio frame
  • the first audio frame is an audio frame carrying semantics
  • the second audio frame is the audio after the first audio frame in the multiple audio frames frame.
  • the decision module 130 may obtain the above-mentioned speech end point according to the semantics and the sound category of the second audio frame.
  • the first audio frame is a plurality of audio frames carrying semantics.
  • the second audio frame is one or more audio frames after the first audio frame.
  • multiple audio frames carrying semantics is a different concept from “multiple audio frames” included in the audio data.
  • the number of audio frames included in the first audio frame is smaller than the number of audio frames included in the audio data.
  • the decision module 130 may fuse the semantics and the sound category of the second audio frame to obtain a fusion feature, and obtain a speech end point according to the fusion feature.
  • the fused feature can be understood as being obtained by superimposing one or more audio frames with determined sound categories after the audio stream carrying the semantics. For example, assuming that a piece of audio data includes "I want to watch TV” and 5 audio frames after "view”, after the above-mentioned extraction of sound category and semantics, the semantics of "I want to watch TV" and the following 5 audio frames can be obtained sound category of each audio frame, then the fusion feature is to superimpose the above five audio frames with sound category on multiple audio frames carrying the semantics of the above-mentioned "I want to watch TV". Compared with the above-mentioned direct processing, using fusion features for processing can improve processing efficiency on the one hand and accuracy on the other hand.
  • speech endpoint categories may include "speak", "think” and "end”.
  • the decision-making module 130 may determine the speech endpoint category of the audio data according to the semantics and the sound category of the second audio frame, and obtain the above-mentioned speech end point when the speech endpoint category of the audio data is "end”.
  • the voice endpoint category of the audio data is "end", which can be understood as the audio data includes text endpoints whose voice endpoint category is "end”.
  • the audio frame corresponding to the text endpoint whose speech endpoint category is "End” can be used as the speech end point.
  • the decision module 130 determines the speech end point category of the audio data according to the fusion feature, and obtains the above speech end point when the speech end point category of the audio data is "end".
  • the decision module 130 may also use the speech endpoint classification model to process the semantics and the sound category of the second audio frame to obtain the speech endpoint category, thereby obtaining the speech end point of the audio data.
  • the decision-making module 130 may use the speech endpoint classification model to process the above fused feature to obtain the speech endpoint category, thereby obtaining the speech end point of the audio data.
  • the fusion feature is input into the speech endpoint classification model as an input feature for processing, and the speech endpoint category is obtained, thereby obtaining the speech end point of the audio data.
  • the decision module 130 may not process the semantics and the sound category of the second audio frame, and directly input the semantics and the sound category of the second audio frame as two input features to the speech endpoint classification processed in the model.
  • the above speech recognition device 100 is implemented in the form of functional modules, and the term "module" here may be implemented in the form of software and/or hardware, which is not limited in this embodiment of the present application.
  • the division of the above modules is only a logical function division, and there may be another division method in actual implementation, for example, multiple modules may be integrated into one module. That is, the acquisition module 110, the processing module 120 and the decision module 130 may be integrated into one module. Alternatively, each of the multiple modules may exist independently. Alternatively, two of the multiple modules are integrated into one module, for example, the decision-making module 130 may be integrated into the processing module 120 .
  • the above multiple modules can be deployed on the same hardware or on different hardware.
  • the functions required by the above multiple modules may be performed by the same hardware, or may be performed by different hardware, which is not limited in this embodiment of the present application.
  • the aforementioned speech endpoint categories may include "speaking”, “thinking” and “ending”.
  • Speech can be understood as talking, that is, the endpoint is neither a terminating endpoint nor a stopping endpoint;
  • thinking can be understood as thinking or a short pause, that is, the endpoint is just a pause endpoint, and there may be follow-up Speech;
  • end can be understood as stop or end, that is to say, this endpoint is the end point of speech termination.
  • the speech endpoint classification model can be obtained by using a language model, for example, by using a bidirectional encoder representations from transformers (BERT) model based on a converter.
  • a language model for example, by using a bidirectional encoder representations from transformers (BERT) model based on a converter.
  • BERT transformers
  • the following uses the BERT model as an example to introduce, but should Comprehension can also use any other language-based models that can perform the above-mentioned classifications.
  • Fig. 3 is a schematic diagram of processing fusion features by the classification model of the embodiment of the present application.
  • the classification model can be a BERT model, and the BERT model includes an embedding layer and a fully connected layer (indicated by a white box C in Figure 3), and the embedding layer includes a label embedding layer (token embeddings), a sentence embedding layer (segment embeddings) and position embedding layer (position embeddings), the embedding layer processes the input data (that is, the above-mentioned fusion features) and outputs the result obtained to the fully connected layer, and the fully connected layer further obtains the above-mentioned speech endpoint category.
  • the input data that is, the above-mentioned fusion features
  • the fully connected layer further obtains the above-mentioned speech endpoint category.
  • the input data gives an example of 5 fusion features, which are represented by In1 to In5 in Figure 3,
  • In1 is "[CLS] open the window [SPE] [SIL] [SEP]”
  • In2 For "[CLS] open the window[SIL][SPE][SIL][SIL][SEP]”
  • In3 is "[CLS] set the temperature to twenty[SIL][NEU][SIL][SIL][SEP ]”
  • In4 is "[CLS] the temperature is adjusted to twenty-six degrees [SIL][SIL][SIL][SEP]”
  • In5 is "[CLS] the temperature is adjusted to twenty-six degrees [SIL][SIL][SEP ]”.
  • [CLS] is the starting data of the input data of the BERT model or can be called the initial symbol
  • [SEP] is the termination data of the input data of the BERT model or can be called the terminator. That is, in this example, the format of the fused feature is "[CLS]+semantics+sound category+[SEP]".
  • EX represents the energy value of X
  • E CLS represents the energy corresponding to [CLS].
  • Fig. 3 is only a specific example of the speech endpoint classification model, and there is no limitation.
  • the training method for the speech endpoint classification model will be introduced below, and will not be expanded here.
  • Fig. 4 is a schematic flow chart of a speech recognition method according to an embodiment of the present application. Each step in FIG. 4 will be introduced below.
  • the method shown in FIG. 4 may be performed by an electronic device, and the electronic device specifically may include a computer, a smart phone, a tablet computer, a personal digital assistant (personal digital assistant, PDA), a wearable device, a smart speaker, a TV, a drone, One or more of vehicles, on-board chips, on-board devices (such as car machines, on-board computers) or robots.
  • the method 400 shown in FIG. 4 may also be executed by a cloud service device.
  • the method 400 shown in FIG. 4 may also be executed by a system composed of a cloud service device and a terminal device, such as a vehicle system or a smart home system.
  • the method shown in FIG. 4 may be executed by the speech recognition apparatus 100 in FIG. 1 .
  • the audio data may be obtained by using a voice collection device such as a microphone, or may be obtained from a storage device or a network.
  • the audio data may be obtained in real time or stored.
  • Step 401 can be executed by using the above acquisition module 110 .
  • the relevant introduction of audio data and its acquisition method refers to the above, and will not be repeated here.
  • the audio frame may be obtained by performing a frame division operation on the audio data.
  • one audio frame may be tens of milliseconds or tens of milliseconds.
  • Step 402 may be performed by using the above processing module 120 .
  • the introduction of sound categories and semantics can refer to the above, so I won’t go into details. It should be understood that each audio frame can extract its sound category, but not every audio frame can extract its semantics. Specifically, an audio frame that does not contain human voice has no semantics, therefore, the semantics of an audio frame that does not contain speech cannot be extracted, and at this time, the semantics of an audio frame that does not contain speech can be regarded as empty or nothing.
  • the sound category may be obtained according to the relationship between the energy of the audio frame and a preset energy threshold.
  • the preset energy threshold may include a first energy threshold and a second energy threshold, the first energy threshold is greater than the second energy threshold, and the sound category of an audio frame whose energy is greater than or equal to the first energy threshold among multiple audio frames may be determined be "speech"; or determine the sound category of the audio frame whose energy is less than the first energy threshold and greater than the second energy threshold in multiple audio frames as "pending sound", and the first energy threshold is greater than the second energy threshold; or The sound category of the audio frame whose energy is less than or equal to the second energy threshold among the plurality of audio frames is determined to be "silence".
  • Each audio stream can be represented by corresponding text.
  • Each audio stream may contain one or more audio frames.
  • the sound category of the plurality of audio frames may include a sound category of each audio frame in the plurality of audio frames.
  • the sound categories of the plurality of audio frames may include sound categories of some audio frames in the plurality of audio frames.
  • step 402 the sound categories of all the audio frames in the plurality of audio frames may be extracted, or only the sound categories of some audio frames in the plurality of audio frames may be extracted.
  • each audio frame can extract its sound category, that is, each audio frame can obtain its corresponding sound category.
  • multiple audio frames correspond to one text, or in other words, one text is carried by multiple audio frames.
  • each word in "open the car window" in Fig. 3 corresponds to multiple audio frames
  • each sound category corresponds to one audio frame.
  • the decision module 130 above can be used to execute step 403 .
  • the method 400 further includes step 404 (not shown in the figure).
  • the operation corresponding to the audio data before the end point of the speech in the audio data may be performed.
  • the operation corresponding to the audio signal before the end of the speech in the audio data may also be understood as the operation corresponding to the end of the speech.
  • the operation corresponding to the audio data before the end point of the speech in the audio data can be executed immediately.
  • the operation corresponding to the audio signal data before the end point of the speech in the audio data may also be performed after a period of time. This period of time may be redundant time or error time.
  • the operation corresponding to the voice end may be an operation in any service processing function.
  • the speech recognition can be stopped.
  • the speech recognition result may be returned to the user.
  • the speech recognition result may be sent to the subsequent module, so that the subsequent module performs a corresponding operation.
  • the audio data may include a control command, and the subsequent module executes the control operation corresponding to the command.
  • the audio data may include an inquiry instruction, and the follow-up module returns an answer sentence corresponding to the inquiry instruction to the user.
  • step 404 may be performed by the executing device of the method 400, or may also be performed by other devices, which is not limited in this embodiment of the present application.
  • an audio signal includes a user instruction for description.
  • User instructions can be used to implement various functions such as obtaining routes, playing music, and controlling hardware (eg, lights, air conditioners, etc.). For example, taking control of an air conditioner as an example, the user instruction may be to turn on the air conditioner. It should be understood that the subsequent module may be one module or multiple modules.
  • indication information can be sent to the subsequent module, and the indication information is used to indicate the speech end point, so that the subsequent module can obtain the audio data before the speech end point in the audio data, based on the audio data Get the semantic text (for example, "turn on the air conditioner"), and then analyze the user instruction based on the semantic text, and control the corresponding module to perform the operation indicated by voice.
  • the subsequent module can obtain the audio data before the speech end point in the audio data, based on the audio data Get the semantic text (for example, "turn on the air conditioner"), and then analyze the user instruction based on the semantic text, and control the corresponding module to perform the operation indicated by voice.
  • the semantic text for example, "turn on the air conditioner”
  • ASR sends the speech recognition result (for example, the semantic text of "turn on the air conditioner") to the semantic analysis module, and the semantic analysis module parses out the user instruction and sends the control signal to Air conditioner to control the air conditioner to turn on.
  • the speech recognition result for example, the semantic text of "turn on the air conditioner”
  • the semantic analysis module parses out the user instruction and sends the control signal to Air conditioner to control the air conditioner to turn on.
  • the audio data before the end point of the speech in the audio data can be sent to the subsequent module, so that the subsequent module can obtain semantic text (for example, "turn on the air conditioner") based on the audio data, and then based on the Semantic text parses out user instructions, and controls the corresponding modules to perform operations instructed by voice.
  • semantic text for example, "turn on the air conditioner”
  • the semantic text for example, "turn on the air conditioner”
  • the user instruction can be parsed out based on the semantic text, and the corresponding module can be controlled to execute the operation indicated by the speech.
  • the speech end point may be obtained according to the sound category and semantics of all the audio frames in the plurality of audio frames.
  • the text endpoints that may be the end points of speech can be preliminarily determined according to the semantics, and then the end point of speech can be found out from the text endpoints that may be the end points of speech according to the sound category of each text endpoint.
  • the candidate audio frames that may be the end points of speech may be preliminarily determined according to the sound category, and then the end point of speech may be determined according to the semantics before these candidate audio frames.
  • the multiple audio frames include a first audio frame and a second audio frame
  • the first audio frame is an audio frame carrying semantics
  • the second audio frame is the audio after the first audio frame in the multiple audio frames frame.
  • the above speech end point can be obtained according to the semantics and the sound category of the second audio frame.
  • the first audio frame is a plurality of audio frames carrying semantics.
  • the second audio frame is one or more audio frames after the first audio frame.
  • multiple audio frames carrying semantics is a different concept from “multiple audio frames” included in the audio data.
  • the number of audio frames included in the first audio frame is smaller than the number of audio frames included in the audio data.
  • the semantics can be fused with the sound category of the second audio frame to obtain a fusion feature, and the speech end point can be obtained according to the fusion feature.
  • fusion features for processing can improve processing efficiency on the one hand, and improve accuracy on the other hand.
  • the speech endpoint category can include "speak", "think” and "end”
  • the speech endpoint category of the audio data can be determined according to the semantics and the sound category of the second audio frame, and the speech endpoint category of the audio data In the case of "end", the above speech end point is obtained.
  • the voice endpoint category of the audio data is "end", which can be understood as the audio data includes text endpoints whose voice endpoint category is "end”.
  • the audio frame corresponding to the text endpoint whose speech endpoint category is "End” can be used as the speech end point.
  • the speech endpoint category of the audio data can be determined according to the fusion feature, and the above-mentioned speech end point is obtained when the speech endpoint category of the audio data is "end".
  • the speech endpoint classification model may be used to process the semantics and the sound category of the second audio frame to obtain the speech endpoint category, thereby obtaining the speech end point of the audio data.
  • the fused feature when the fused feature is obtained, the fused feature can be processed by using the speech endpoint classification model to obtain the speech endpoint category of the fused feature, so as to obtain the speech end point of the audio data.
  • the speech endpoint classification model can be obtained by using speech samples and endpoint category labels of the speech samples, and the format of the speech samples corresponds to the format of the fusion feature, and the endpoint categories included in the endpoint category labels correspond to the speech endpoint categories.
  • the speech end point of the audio data is obtained by extracting and synthesizing the sound category and semantics in the audio data, so that the speech end point of the audio data can be determined more accurately.
  • the above audio data may be audio data acquired in real time, or read audio data that has been stored, which may be understood as online speech recognition and offline speech recognition respectively.
  • the voice end point After the voice end point is obtained, it can be used to perform subsequent operations such as voice-based control. For example, it can be used to control the switches of electronic equipment, to conduct information inquiries, to control the playback of audio and video, and so on.
  • the acquisition of audio data can be ended after the end point of the speech is obtained, that is, the speech recognition is stopped.
  • the instruction corresponding to the audio data before the end point of the speech may be executed after the end point of the speech is obtained, and the audio data may continue to be acquired.
  • the solution of the embodiment of the present application can obtain the end point of the voice more accurately, so as to respond to the voice-based follow-up operation more accurately, and improve user experience.
  • the solution of the present application can avoid the excessively long response delay caused by the late detection of the end point of the speech, and obtain the end point of the speech faster, so as to make a follow-up response in time, reduce the waiting time of the user, and improve the user experience; at the same time, this
  • the proposed solution can obtain accurate speech end points, avoid premature truncation of the user's voice commands caused by premature detection of speech end points, and obtain audio data with complete semantics, which is conducive to accurately identifying user intentions and making accurate speech commands. Responsive and improve user experience.
  • the user generates a total of 7 seconds of audio data: the first 3 seconds is the voice of "Please turn on the air conditioner", the 4th second is a 1-second pause, and the 5-7 seconds are coughing .
  • the solution of the embodiment of the present application can accurately obtain the word "tune” as the end point of the speech, then the speech acquisition can be ended after the 3rd second or the 4th second.
  • the traditional method based on liveness detection needs to set a fixed waiting time, and the voice acquisition will be ended only after the actual waiting time is greater than or equal to the fixed waiting time.
  • the method based on activity detection delays at least 5 seconds before ending the voice acquisition.
  • the user Assume that in a voice-controlled electronic device scenario, the user generates a total of 6 seconds of audio data: "Please call Qian Yi Er", but there is a 1.5-second pause after the word "Qian”, if the above method based on activity detection Fixed wait time is 1.5 seconds. If the scheme of the embodiment of the present application is adopted, the word “two" can be accurately obtained as the end point of the speech, because according to the semantic information, after the word "money”, more audio frames with the sound category of "silence" are needed to be used as the "end” endpoint , so the audio acquisition will not end after a pause of 1.5 seconds, but the audio acquisition will end at or after the 6th second. However, if the method based on activity detection is used, the voice acquisition will end at the end of the 1.5-second pause after the word "qian”, resulting in a wrong judgment of the voice end point and failure to respond to subsequent control strategies.
  • the waiting time before the end point of the speech is not fixed, but will change with the actual speech recognition process, and the existing preset fixed waiting time, in this Compared with the method of obtaining the end point of the speech after the waiting time is over, the end point of the speech can be obtained more accurately, thereby improving the timeliness and accuracy of the follow-up response, reducing the waiting time of the user, and improving the user experience.
  • the embodiment of the present application also provides a speech recognition method 500 .
  • the method 500 includes step 501 to step 506 .
  • Each step of the method 500 is introduced below.
  • the method 500 can be understood as an example of an online speech recognition method.
  • the first audio data may be acquired in real time.
  • the first audio data may include multiple audio frames.
  • the first audio data includes a plurality of audio frames.
  • Step 502 may include: extracting sound categories and semantics of the plurality of audio frames; and obtaining a first speech end point of the first audio data according to the sound categories and the semantics of the plurality of audio frames.
  • the operation corresponding to the audio data before the end point of the first speech may be performed.
  • the instruction corresponding to the audio data before the end point of the first speech in the first audio data is referred to as the first user instruction.
  • the first voice end point is the voice end point of the first user instruction.
  • the first audio data may only include one user instruction, and the voice end point of the instruction can be identified through step 502 .
  • step 502 audio data may continue to be acquired, that is, step 504 may be executed, as long as the second audio data is audio data acquired after the first audio data.
  • the first audio data and the second audio data may be continuously collected audio data.
  • the second audio data includes a plurality of audio frames.
  • Step 502 may include: extracting sound categories and semantics of the plurality of audio frames; and obtaining a first speech end point of the first audio data according to the sound categories and the semantics of the plurality of audio frames.
  • the audio data formed by the audio data after the end point of the first speech in the first audio data and the second audio data includes a plurality of audio frames
  • step 502 may include: extracting the sound category and semantics of the plurality of audio frames; According to the sound category and semantics of the plurality of audio frames, the second voice end point of the second audio data is obtained.
  • the audio data after the first audio end point in the first audio data includes 4 audio frames, and the second audio data includes 10 audio frames, then the audio data composed of the 4 audio frames and the 10 audio frames includes 14 audio frames, extracting the sound category and semantics of the 14 audio frames; according to the sound category and semantics of the 14 audio frames, obtaining the second voice end point of the second audio data.
  • the audio data after the first audio end point in the first audio data includes 4 audio frames
  • the second audio data includes 10 audio frames
  • the audio data composed of the 4 audio frames and the 10 audio frames includes There are 14 audio frames
  • the second audio end point of the second audio data is located at the 12th audio frame.
  • the instruction corresponding to the audio data between the first audio end point and the second audio end point in the first audio data and the second audio data is referred to as a second user instruction.
  • the second voice end point is the voice end point of the second user instruction.
  • the audio data formed by the audio data after the first audio end point in the first audio data and the second audio data may only include one user instruction, and the audio end point of the instruction can be identified through step 505 .
  • the first user instruction and the second user instruction may be two user instructions issued consecutively by the user, that is, the time interval between the two user instructions is relatively small. For example, in the case that the first audio data and the second audio data are continuously collected audio data, the time interval between the first user instruction and the second user instruction is relatively small.
  • the solution of the present application is beneficial to distinguish the voice end point of the first user instruction from the voice end point of the second user instruction.
  • Utilizing the scheme of the embodiment of the present application can obtain the end point of the speech more accurately, avoiding the too late detection of the end point of the speech and causing the response delay to be too long, and obtaining the end point of the speech faster, so as to make a follow-up response in time and reduce the waiting time of the user.
  • Improve user experience Specifically, in the solution of the embodiment of the present application, the audio data is acquired in real time and the speech end point in the audio data is recognized, which is beneficial to real-time recognition of the speech end points of different instructions, and after obtaining the speech end points of each instruction, the should command.
  • using the scheme of the present application is beneficial to identify the end point of the voice of each instruction after each instruction is issued, so as to respond to each instruction in time without waiting until the instruction is issued. Respond to all commands after sending out multiple commands.
  • the user For example, assume that in a scenario of voice-controlled electronic devices, the user generates 8 seconds of audio data in total, and the 8 seconds of audio data includes two user commands: "Please close the car window” and "Please turn on the air conditioner".
  • the interval between the user instructions is small, for example, 1 second, that is, the user sends the user instruction "please turn on the air conditioner” 1 second after sending the user instruction "please close the window”.
  • the audio data can be acquired in real time and processed to obtain the end point of the voice corresponding to the first user instruction.
  • the voice corresponding to the user command can be confirmed after several audio frames after the word "window”.
  • the solution of the embodiment of the present application can continue to acquire audio data and process them to obtain the end point of the voice corresponding to the second user command.
  • the voice corresponding to the user command can be confirmed after several audio frames after the word "tune”.
  • respond to the user instruction in time and perform the operation of turning on the air conditioner.
  • the above-mentioned fixed waiting time based on the activity detection method is 1.5 seconds, the above-mentioned 1-second pause will be considered as a short pause, and the voice recognition will continue, and the voice end point corresponding to the first user command cannot be obtained. Wait for 1.5 seconds after sending out the second user command before it is considered that the voice is over, and then perform corresponding operations in response to the two user commands.
  • the scheme of the embodiment can obtain the voice end point of each user instruction more accurately, and is conducive to responding in time after each user instruction is issued, without waiting for the user to issue all instructions before responding.
  • FIG. 5 is a schematic flow chart of a speech recognition method according to an embodiment of the present application.
  • FIG. 5 can be regarded as a specific example of the method shown in FIG. 4.
  • audio data is acquired and speech recognition is performed in real time.
  • the audio data can be obtained by using a voice collection device such as a microphone.
  • step 601 may be performed by using the above acquisition module 110 .
  • Step 602 is a specific example of a method for obtaining semantics, that is, an example of obtaining semantics by using ASR.
  • step 607 may be executed directly. It is equivalent to the fact that no voice can be recognized in the audio data acquired within a long period of time, and there is no need to further judge the end point of the voice.
  • Step 602 may be performed by using the above processing module 120 .
  • Step 603 is a specific example of the method for acquiring the sound category.
  • Step 602 and step 603 may or may not be executed at the same time, and there is no limitation on the sequence of execution. Step 602 and step 603 can be regarded as a specific example of step 402 .
  • Step 603 can be executed by using the above processing module 120 .
  • the fusion feature can be obtained by superimposing semantic and sound categories.
  • step 602 may be performed in real time, that is, voice recognition is performed on the acquired audio data in real time.
  • step 604 after each word is recognized, the current semantics and the sound category of one or more audio frames after the word can be superimposed to obtain the fusion feature, and the fusion feature can be input to the subsequent speech endpoint
  • the classification model is processed (ie step 605).
  • the fusion features can be obtained according to the semantics of "I want to watch electricity” and the sound category of one or more audio frames after “electricity”, and then input to the speech endpoint classification model in step 605 for processing.
  • voice recognition is performed on the currently acquired audio data through step 602.
  • the semantics of "I want to watch TV” and one or more words after “view” can be The sound categories of multiple audio frames are fused with features, and then input to the speech endpoint classification model in step 605 for processing.
  • Step 604 may be performed using the decision module 130 above.
  • the speech endpoint categories include: “Speak”, “Think”, and “End”.
  • the speech endpoint category is "End”
  • the audio frame corresponding to the endpoint is the speech end point.
  • steps 604 and 605 is a specific example of step 403 .
  • Step 605 may be performed using the decision module 130 above.
  • step 606. Determine whether the voice endpoint category is "end”, if the determination result is "Yes”, perform step 607; if the determination result is "No", proceed to step 601.
  • Step 606 may be performed using the decision module 130 above.
  • Step 607 may be performed using the decision module 130 above.
  • FIG. 6 is a schematic flowchart of a training method for a speech endpoint classification model according to an embodiment of the present application. Each step shown in FIG. 6 will be introduced below.
  • the format of the speech sample may correspond to the format of the data input into the speech endpoint classification model above, and the endpoint category included in the endpoint category label may correspond to the above-mentioned speech endpoint category. That is, the voice sample includes the sound category and semantics of the audio data, and the endpoint category labels include "speak", "think” and "end”.
  • the above input data of the speech endpoint classification model is the input data of the inference stage of the speech endpoint classification model, for example, the above semantics and the sound category of the second audio frame, that is, the format of the speech sample can be the same as the above semantics and the second audio
  • the format corresponds to the sound category of the frame.
  • the input data of the inference stage of the speech endpoint classification model may be the above fusion feature, that is, the format of the voice sample may correspond to the above fusion feature format.
  • the speech sample may be in the format of "initial symbol + semantics + sound category + terminator", or the speech sample may be in the format of "initial symbol + sound category + semantics + terminator”.
  • some text corpus can be obtained, and a trie of these corpora can be established.
  • Each node in the trie (each node corresponds to a word) includes the following information: whether it is an end point and the frequency of the prefix.
  • the above speech samples can be generated according to the node information.
  • the end point is the end point of a sentence.
  • the prefix frequency is used to indicate the number of characters that will exist after the word to the end point. The higher the prefix frequency, the less likely the word is the end point.
  • verbs such as "give” and “take” or other prepositions are relatively less likely to be used as the end of a sentence, and they often have a higher frequency of prefixes in the dictionary tree; The possibility of the end of the statement is relatively large, and tends to have a low frequency of prefixes in the trie.
  • the terminal category label of the node's plain text that is, semantics
  • the terminal category label of the signal classification information ie, sound category
  • the node is an endpoint
  • voice samples with different endpoint category labels are generated according to the prefix frequency and signal classification information (ie, sound category). The higher the frequency of the prefix, the more audio frames with the sound category of "silence" need to be attached before the node can be marked as "end”. The following will be introduced in conjunction with FIG. 7 .
  • FIG. 7 is a schematic diagram of a dictionary tree according to an embodiment of the present application.
  • Step 701 may be performed by a training device.
  • the training device can be a cloud service device or a terminal device, such as a computer, server, mobile phone, smart speaker, vehicle, drone or robot, or it can be composed of a cloud service device and a terminal device system, which is not limited in the embodiment of this application.
  • the speech endpoint classification model can refer to the introduction above, and will not be listed one by one.
  • the target speech endpoint classification model can be used to obtain the speech end point of the audio data according to the sound category and semantics.
  • the target speech endpoint classification model can be used to execute step 403 above, and can also be used by the decision-making module 130 to obtain the speech end point.
  • Step 702 may be performed by a training device.
  • Fig. 8 is a schematic block diagram of a speech recognition device according to an embodiment of the present application.
  • the apparatus 2000 shown in FIG. 8 includes an acquisition unit 2001 and a processing unit 2002 .
  • the acquiring unit 2001 and the processing unit 2002 may be used to execute the speech recognition method of the embodiment of the present application.
  • the acquiring unit 2001 may execute the above step 401, and the processing unit 2002 may execute the above steps 402 and 403.
  • the acquiring unit 2001 may execute the above steps 501 and 504, and the processing unit 2002 may execute the above steps 502-503 and steps 505-506.
  • the acquiring unit 2001 may execute the above step 601, and the processing unit 2002 may execute the above steps 602-606.
  • the obtaining unit 2001 may include the above-mentioned obtaining module 110
  • the processing unit 2002 may include the above-mentioned processing module 120 and decision-making module 130 .
  • processing unit 2002 in the above device 2000 may be equivalent to the processor 3002 in the device 3000 hereinafter.
  • unit here may be implemented in the form of software and/or hardware, which is not specifically limited.
  • a "unit” may be a software program, a hardware circuit or a combination of both to realize the above functions.
  • the hardware circuitry may include application specific integrated circuits (ASICs), electronic circuits, processors (such as shared processors, dedicated processors, or group processors) for executing one or more software or firmware programs. etc.) and memory, incorporating logic, and/or other suitable components to support the described functionality.
  • ASICs application specific integrated circuits
  • processors such as shared processors, dedicated processors, or group processors for executing one or more software or firmware programs. etc.
  • memory incorporating logic, and/or other suitable components to support the described functionality.
  • the units of each example described in the embodiments of the present application can be realized by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the present application.
  • FIG. 9 is a schematic diagram of a hardware structure of a voice recognition device according to an embodiment of the present application.
  • the speech recognition apparatus 3000 shown in FIG. 9 includes a memory 3001 , a processor 3002 , a communication interface 3003 and a bus 3004 .
  • the memory 3001 , the processor 3002 , and the communication interface 3003 are connected to each other through a bus 3004 .
  • the memory 3001 may be a read only memory (read only memory, ROM), a static storage device, a dynamic storage device or a random access memory (random access memory, RAM).
  • the memory 3001 may store programs, and when the programs stored in the memory 3001 are executed by the processor 3002, the processor 3002 and the communication interface 3003 are used to execute various steps of the voice recognition method of the embodiment of the present application.
  • the processor 3002 may adopt a general-purpose central processing unit (central processing unit, CPU), microprocessor, ASIC, graphics processing unit (graphics processing unit, GPU) or one or more integrated circuits for executing related programs to realize
  • central processing unit central processing unit, CPU
  • microprocessor central processing unit
  • ASIC central processing unit
  • graphics processing unit graphics processing unit, GPU
  • the processing unit 2002 in the speech recognition device of the embodiment of the present application needs to perform functions, or execute the speech recognition method of the method embodiment of the present application.
  • the processor 3002 may also be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the speech recognition method of the present application may be completed by an integrated logic circuit of hardware in the processor 3002 or instructions in the form of software.
  • the above-mentioned processor 3002 can also be a general-purpose processor, a digital signal processor (digital signal processing, DSP), an ASIC, an off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, discrete gates or transistors Logic devices, discrete hardware components.
  • DSP digital signal processing
  • ASIC application-the-shelf programmable gate array
  • FPGA field programmable gate array
  • the various methods, steps and logic block diagrams disclosed in the embodiments of the present application can be realized or executed.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register.
  • the storage medium is located in the memory 3001, and the processor 3002 reads the information in the memory 3001, and combines its hardware to complete the functions required by the units included in the speech recognition device of the embodiment of the present application, or perform the speech recognition of the method embodiment of the present application method.
  • the processor 3002 may execute the above steps 402 and 403 .
  • the processor 3002 may execute the above steps 502-503 and steps 505-506.
  • the processor 3002 may execute the above steps 602-606.
  • the communication interface 3003 implements communication between the apparatus 3000 and other devices or communication networks by using a transceiver device such as but not limited to a transceiver.
  • the communication interface 3003 may be used to realize the functions required to be performed by the acquiring unit 2001 shown in FIG. 8 .
  • the communication interface 3003 may execute the above step 401 .
  • the communication interface 3003 may perform the above step 501 and step 504 .
  • the communication interface 3003 may perform the above step 601 . That is, the above audio data can be obtained through the communication interface 3003 .
  • the bus 3004 may include a pathway for transferring information between various components of the device 3000 (eg, memory 3001 , processor 3002 , communication interface 3003 ).
  • the speech recognition device 3000 may be set in a vehicle system. Specifically, the voice recognition device 3000 may be set in a vehicle-mounted terminal. Alternatively, the speech recognition device can be set in the server.
  • the voice recognition device 3000 is installed in a vehicle as an example for illustration.
  • the voice recognition device 3000 can also be installed in other devices, for example, the device 3000 can also be applied in devices such as computers, servers, mobile phones, smart speakers, wearable devices, drones or robots.
  • the device 3000 shown in FIG. 9 only shows a memory, a processor, and a communication interface, in the specific implementation process, those skilled in the art should understand that the device 3000 also includes other devices necessary for normal operation. . Meanwhile, according to specific needs, those skilled in the art should understand that the apparatus 3000 may also include hardware devices for implementing other additional functions. In addition, those skilled in the art should understand that the apparatus 3000 may also only include components necessary to realize the embodiment of the present application, and does not necessarily include all the components shown in FIG. 9 .
  • the disclosed systems, methods and devices can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the functions described above are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: Universal Serial Bus flash disk (UFD), UFD can also be referred to as U disk or USB flash drive, mobile hard disk, ROM, RAM, magnetic disk or optical disk, etc., which can store program codes. medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)

Abstract

一种语音识别方法、语音识别装置、系统、计算机可读存储介质和程序产品,涉及人工智能领域,该方法包括:获取音频数据,该音频数据包括多个音频帧(401);提取多个音频帧的声音类别和语义(402);根据声音类别和语义,得到音频数据的语音结束点(403)。该方案通过提取和综合音频数据中的声音类别和语义来得到音频数据的语音结束点,能够更加准确地判定音频数据的语音结束点,以及更准确地对基于语音的后续操作进行响应,提升用户体验。

Description

语音识别方法、语音识别装置及系统 技术领域
本申请实施例涉及人工智能领域,并且更具体地,涉及一种语音识别方法、语音识别装置及系统。
背景技术
当前语音交互产品已经广泛进入到人们的日常生活中,比如智能终端设备、智能家居设备、智能车载设备等。在主流语音交互产品中,语音交互是由用户主动发起的,但语音结束状态(即一轮对话的语音结束点)一般是通过自动语音识别来自动进行判断的。目前,语音结束状态的检测主要存在两个问题:背景噪音导致过晚的语音结束状态、语音中间的停顿导致过早的语音结束状态。
因此,如何更加准确地判断语音结束状态是亟待解决的技术问题。
发明内容
本申请实施例提供一种语音识别方法、语音识别装置及系统,能够更加准确地判断语音结束状态,以便更准确地对基于语音的后续操作进行响应。
第一方面,提供一种语音识别方法,该方法包括:获取音频数据,该音频数据包括多个音频帧;提取多个音频帧的声音类别和语义;根据多个音频帧的声音类别和语义,得到音频数据的语音结束点。
在本申请的技术方案中,通过提取和综合音频数据中的声音类别和语义来得到音频数据的语音结束点,能够更加准确地判定音频数据的语音结束点,以便更准确地对基于语音的后续操作进行响应,提升用户体验。此外,利用本申请实施例的方案进行语音识别时,语音结束点之前的等待时间不是固定的,是会随着实际的语音识别过程而变化的,与现有的预设好了固定等待时间,在该等待时间结束后再响应的方式相比,能够更加准确地得到语音结束点,从而提高用户语音指令响应的及时性和准确性,降低用户的等待时长。
示例性地,该多个音频帧的声音类别可以包括该多个音频帧中的每个音频帧的声音类别。或者,该多个音频帧的声音类别可以包括该多个音频帧中的部分音频帧的声音类别。
应理解,每个音频帧都可以提取出其声音类别,但并不是每个音频帧均可以提取出语义。具体地,不包含人声的音频帧不具有语义,因此,不包含语音的音频帧无法提取出其语义,此时,不包含语音的音频帧的语义可以视为空或无。
结合第一方面,在第一方面的某些实现方式中,在得到语音结束点后,响应音频数据中所述语音结束点之前的音频数据对应的指令。
在得到语音结束点后,可以立即响应该指令,或者也可以经过一段时间后再响应该指令。即在得到语音结束点后,可以立即执行音频数据中所述语音结束点之前的音频数据对 应的操作。或者,在得到语音结束点后,也可以在经过一段时间后再执行音频数据中的语音结束点之前的音频数据对应的操作。该段时间可以为冗余时间或误差时间等。本申请的方案能够更加准确地判定音频数据的语音结束点,进而能够更准确地对语音结束点之前的音频数据对应的指令进行响应,有利于提高用户语音指令响应的及时性和准确性,降低用户的等待时长,提升用户体验。
结合第一方面,在第一方面的某些实现方式中,可以根据多个音频帧的能量和预设能量阈值的关系,得到上述多个音频帧的声音类别。
结合第一方面,在第一方面的某些实现方式中,声音类别包括“说话声”、“待定声音”和“静音”,预设能量阈值包括第一能量阈值和第二能量阈值,第一能量阈值大于第二能量阈值,可以将多个音频帧中能量大于或等于第一能量阈值的音频帧的声音类别确定为“说话声”;或者将多个音频帧中能量小于第一能量阈值且大于第二能量阈值的音频帧的声音类别确定为“待定声音”;或者将多个音频帧中能量小于或等于第二能量阈值的音频帧的声音类别确定为“静音”。
结合第一方面,在第一方面的某些实现方式中,第一能量阈值和第二能量阈值可以是根据音频数据的背景声音的能量确定的。在不同背景环境中,静音能量曲线是不同的,例如在相对安静的环境,静音能量(即背景声音的能量)相对较低,在相对嘈杂的环境,静音能量(即背景声音的能量)相对较高。因此根据静音能量来得到第一能量阈值和第二能量阈值可以适应不同环境下的需求。
结合第一方面,在第一方面的某些实现方式中,多个音频帧包括第一音频帧和第二音频帧,第一音频帧为承载语义的音频帧,第二音频帧为多个音频帧中第一音频帧之后的音频帧;以及根据声音类别和语义,得到音频数据的语音结束点,包括:根据语义和第二音频帧的声音类别得到语音结束点。
第一音频帧为承载语义的多个音频帧。第二音频帧为第一音频帧之后的一个或多个音频帧。
需要说明的是“承载语义的多个音频帧”与音频数据所包含的“多个音频帧”是不同的概念。第一音频帧所包含的音频帧的数量小于音频数据所包含的音频帧的数量。
结合第一方面,在第一方面的某些实现方式中,可以将语义和第二音频帧的声音类别进行融合,得到多个音频帧的融合特征;再根据融合特征,得到语音结束点。利用融合特征进行处理,一方面能够提高处理效率,另一方面也能够提高准确性。
结合第一方面,在第一方面的某些实现方式中,语音端点类别包括“说话”、“思考”或“结束”,可以根据语义和第二音频帧的声音类别,确定音频数据的语音端点类别,在音频数据的语音端点类别为“结束”的情况下,得到语音结束点。
进一步地,在得到该多个音频帧的融合特征的情况下,可以根据该多个音频帧的融合特征,确定音频数据的语音端点类别。
结合第一方面,在第一方面的某些实现方式中,可以利用语音端点分类模型对语义和第二音频帧进行处理,得到语音端点类别,语音端点分类模型是利用语音样本和语音样本的端点类别标签得到的,语音样本的格式与语义和第二音频帧的格式对应,端点类别标签包括的端点类别与语音端点类别对应。
进一步地,在得到该多个音频帧的融合特征的情况下,可以利用语音端点分类模型对 融合特征进行处理,得到语音端点类别,语音端点分类模型是利用语音样本和语音样本的端点类别标签得到的,语音样本的格式与融合特征的格式对应,端点类别标签包括的端点类别与语音端点类别对应。
第二方面,提供一种语音识别方法,包括:获取第一音频数据;确定第一音频数据的第一语音结束点;在得到第一语音结束点后,响应第一音频数据中的第一语音结束点之前的音频数据对应的指令;获取第二音频数据;确定第二音频数据的第二语音结束点;在得到第二语音结束点后,响应第一音频数据和第二音频数据中的第一语音结束点和第二语音结束点之间的音频数据对应的指令。
利用本申请实施例的方案能够更准确的获得语音结束点,避免语音结束点过晚检测导致响应时延过长,更快得到语音结束点,从而及时做出后续响应,减少用户的等待时间,提高用户体验。具体来说,本申请实施例的方案中实时获取音频数据并识别该音频数据中的语音结束点,有利于实时识别出不同的指令的语音结束点,并在得到各个指令的语音结束点后响应该指令。尤其是在用户发出的多个指令之间的间隔较短的情况下,利用本申请的方案有利于在各个指令发出后识别出各个指令的语音结束点,以便及时响应各个指令,而无需等到该多个指令全部发出后再响应所有指令。
结合第二方面,在第二方面的某些实现方式中,第一音频数据包括多个音频帧,确定第一音频数据的第一语音结束点,包括:提取多个音频帧的声音类别和语义;根据多个音频帧的声音类别和语义,得到第一音频数据的第一语音结束点。
结合第二方面,在第二方面的某些实现方式中,提取多个音频帧的声音类别和语义,包括:根据多个音频帧的能量和预设能量阈值的关系,得到多个音频帧的声音类别。
结合第二方面,在第二方面的某些实现方式中,声音类别包括“说话声”、“待定声音”和“静音”,预设能量阈值包括第一能量阈值和第二能量阈值,第一能量阈值大于第二能量阈值,多个音频帧中能量大于或等于第一能量阈值的音频帧的声音类别为“说话声”;或者多个音频帧中能量小于第一能量阈值且大于第二能量阈值的音频帧的声音类别为“待定声音”;或者多个音频帧中能量小于或等于第二能量阈值的音频帧的声音类别为“静音”。
结合第二方面,在第二方面的某些实现方式中,第一能量阈值和第二能量阈值是根据第一音频数据的背景声音的能量确定的。
结合第二方面,在第二方面的某些实现方式中,多个音频帧包括第一音频帧和第二音频帧,第一音频帧为承载语义的音频帧,第二音频帧为多个音频帧中的第一音频帧之后的音频帧;以及根据声音类别和语义,得到第一音频数据的第一语音结束点,包括:根据语义和第二音频帧的声音类别得到第一语音结束点。
结合第二方面,在第二方面的某些实现方式中,语音端点类别包括“说话”、“思考”或“结束”,根据语义和第二音频帧的声音类别得到第一语音结束点,包括:根据语义和第二音频帧的声音类别,确定第一音频数据的语音端点类别,在第一音频数据的语音端点类别为“结束”的情况下,得到第一语音结束点。
结合第二方面,在第二方面的某些实现方式中,根据语义和第二音频帧的声音类别,确定第一音频数据的语音端点类别,包括:利用语音端点分类模型对语义和第二音频帧的声音类别进行处理,得到语音端点类别,语音端点分类模型是利用语音样本和语音样本的端点类别标签得到的,语音样本的格式与语义和第二音频帧的声音类别的格式对应,端点 类别标签包括的端点类别与语音端点类别对应。
第三方面,提供一种语音端点分类模型的训练方法,该训练方法包括:获取训练数据,该训练数据包括语音样本和语音样本的端点类别标签,语音样本的格式与音频数据的多个音频帧的语义和第二音频帧的声音类别的格式对应,所述多个音频帧包括第一音频帧和第二音频帧,第一音频帧为承载语义的音频帧,第二音频帧为第一音频帧之后的音频帧,端点类别标签包括的端点类别与语音端点类别对应;利用训练数据对语音端点分类模型进行训练,得到目标语音端点分类模型。
利用第三方面所述方法得到的目标语音端点分类模型能够用于执行第一方面“利用语音端点分类模型对语义和第二音频帧的声音类别进行处理,得到语音端点类别”的操作。
结合第三方面,在第三方面的某些实现方式中,语音样本可以为“初始符+语义+声音类别+结束符”的格式,或者语音样本可以为“初始符+声音类别+语义+结束符”的格式。
可选地,可以获取一些文本语料,并建立这些语料的字典树,字典树中的每个节点(每个节点对应一个文字)包括以下信息:是否为终点和前缀频次。之后可以根据节点信息生成上述语音样本。终点即为一句话的结束点,前缀频次用于表示该文字之后到结束点还会存在的文字数,前缀频次越高,也说明该文字是终点的可能越小。例如,“给”、“拿”等动词或其他介词等文字作为语句的结束的可能相对较小,在字典树中往往具有较高的前缀频次;而例如“吗”、“呢”等文字作为语句的结束的可能相对较大,在字典树中往往具有较低的前缀频次。
第四方面,提供一种语音识别装置,该装置包括用于执行上述第一方面的任意一种实现方式的方法的单元。
第五方面,提供一种语音识别装置,该装置包括用于执行上述第二方面的任意一种实现方式的方法的单元。
第六方面,提供一种语音端点分类模型的训练装置,该训练装置包括用于执行上述第三方面的任意一种实现方式的方法的单元。
第七方面,提供一种语音识别装置,该装置包括:存储器,用于存储程序;处理器,用于执行所述存储器存储的程序,当所述存储器存储的程序被执行时,所述处理器用于执行第一方面或第二方面中的任意一种实现方式中的方法。该装置可以设置在各类语音识别、语音助手或智能音箱等需要进行语音结束点判断的设备或系统中,例如可以为手机终端、车载终端或可穿戴设备等各类终端设备,也可以为电脑、主机或服务器等各类具备运算能力的设备。该装置还可以为芯片。
第八方面,提供一种语音端点分类模型的训练装置,该训练装置包括:存储器,用于存储程序;处理器,用于执行所述存储器存储的程序,当所述存储器存储的程序被执行时,所述处理器用于执行第三方面中的任意一种实现方式中的方法。该训练装置可以为电脑、主机或服务器等各类具备运算能力的设备。该训练装置还可以为芯片。
第九方面,提供一种计算机可读介质,该计算机可读介质存储用于设备执行的程序代码,该程序代码包括用于执行第一方面、第二方面或第三方面中的任意一种实现方式中的方法。
第十方面,提供一种包含指令的计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行上述第一方面、第二方面或第三方面中的任意一种实现方式中的方 法。
第十一方面,提供一种车载系统,该系统包括第四方面、第五方面或第六方面中的任意一种实现方式中的装置。
示例性地,该车载系统可以包括云服务设备和终端设备。该终端设备可以为车辆、车载芯片或车载装置(例如车机、车载电脑)等中的任一项。
第十二方面,提供一种电子设备,该电子设备包括第四方面、第五方面或第六方面中的任意一种实现方式中的装置。
示例性地,电子设备具体可以包括电脑、智能手机、平板电脑、个人数字助理(personal digital assistant,PDA)、可穿戴设备、智能音箱、电视、无人机、车辆、车载芯片、车载装置(例如车机、车载电脑)或机器人等装置中的一个或多个。
在本申请中,通过提取和综合音频数据中的声音类别和语义来得到音频数据的语音结束点,能够更加准确地判定音频数据的语音结束点,以便更准确地对基于语音的后续操作进行响应,提升用户体验。具体来说,本申请的方案能够避免语音结束点过晚检测导致响应时延过长,更快得到语音结束点,从而及时做出后续响应,减少用户的等待时间,提高用户体验;同时,本申请的方案能够得到准确的语音结束点,避免语音结束点过早检测导致的用户的语音指令被过早截断,得到具有完整语义的音频数据,有利于准确地识别用户意图,进而做出准确的响应,提高用户体验。根据背景声音的能量来得到能量阈值则能够适应不同环境下的需求,进一步提高语音结束点的判别的准确性。先将声音类别和语义融合再根据融合特征来得到语音结束点则一方面能够提高处理效率,另一方面也能够进一步提高语音结束点的判别的准确性。
附图说明
图1是本申请实施例的语音识别装置的示意性结构图。
图2是本申请实施例的声音类别的分类示意图。
图3是本申请实施例的语音端点分类模型对融合特征进行处理的示意图。
图4是本申请实施例的语音识别方法的示意性流程图。
图5是本申请实施例的语音识别方法的示意性流程图。
图6是本申请实施例的语音端点分类模型的训练方法的示意性流程图。
图7是本申请实施例的字典树的示意图。
图8是本申请实施例的语音识别装置的示意性框图。
图9是本申请实施例的语音识别装置的硬件结构示意图。
具体实施方式
下面将结合附图,对本申请实施例中的技术方案进行描述。
本申请的方案可以应用在多种语音交互场景中。示例性地,本申请的方案可以应用于电子设备的语音交互场景中和电子系统的语音交互场景中。其中,电子设备具体可以包括电脑、智能手机、平板电脑、个人数字助理(personal digital assistant,PDA)、可穿戴设备、智能音箱、电视、无人机、车辆、车载芯片、车载装置(例如车机、车载电脑)或机器人等装置中的一个或多个。电子系统可以包括云服务设备和终端设备。例如,电子系统 可以为车载系统或智能家居系统等。车载系统的端侧设备可以包括车辆、车载芯片或车载装置(例如车机、车载电脑)等装置中的任一项。云服务设备包括实体服务器和虚拟服务器,服务器接收端侧(例如车机)上传的数据,对数据进行处理后将处理后的数据发送给端侧。
下面对两种较为常用的应用场景进行简单的介绍。
应用场景1:智能手机的语音交互
在智能手机中,可以通过语音助手实现语音交互。例如,可以通过与语音助手之间的语音交互操作智能手机,或者,与语音助手进行对话等。具体地,语音助手可以通过麦克风获取音频数据,然后再通过处理单元判断音频数据的语音结束点,在得到音频数据的语音结束点后触发后续响应,例如,将音频数据中的用户意图上报给操作系统进行响应。
通过语音交互,可以实现拨打电话、发送信息、获取路线、播放音乐、获取对话式应答等功能,大大提高了智能手机的科技感与交互的便利性。
利用本申请的方案,能够准确识别音频数据的语音结束点,从而提高后续响应的准确性和及时性,提高用户体验。
应用场景2:车载系统语音交互
在车载系统中,通过语音交互可以实现对车辆的控制。具体地,在车载系统中,可以通过麦克风获取音频数据,然后再通过处理单元判断音频数据的语音结束点,在得到音频数据的语音结束点后触发后续响应,例如,将音频数据中的用户意图上报给车载系统进行响应。
通过语音交互,可以实现获取路线、播放音乐、控制车内硬件(例如,车窗、空调等)等功能,提升车载系统的交互体验。
利用本申请的方案,能够准确识别音频数据的语音结束点,从而提高后续响应的准确性和及时性,提高用户体验。
图1是本申请实施例的语音识别装置的示意性结构图。如图1所示,语音识别装置100用于对音频数据进行处理,得到音频数据的语音结束点,即得到音频数据中的语音的停止点。例如,输入的音频数据包括“我想打电话”这段语音,则可以通过语音识别装置100的处理,得到该段音频数据的语音结束点可以为该段语音中的最后一个字对应的最后一个音频帧,即该段音频数据的语音结束点为“话”字对应的最后一个音频帧。
需要说明的是,音频数据的结束点和音频数据的语音结束点是不同的概念,音频数据的结束点是指音频停止了,例如一段5秒长度的音频数据的结束点即为该段音频的最后一个音频帧。音频数据的语音结束点则是指该段音频数据中的语音的停止点。再以上述5秒长度的音频数据为例,假设前4秒长度的音频均包含语音,第5秒没有语音,则音频数据的语音结束点为第4秒结束对应的音频帧。假设该5秒长度的音频数据均不包含语音,语音识别的预设时间间隔为3秒,如果在连续3秒内都识别不到语音则终止语音的识别,则该5秒音频数据的语音结束点为第3秒结束对应的音频帧。
音频数据可以包括多个音频帧。应理解,对于输入的音频数据可能包括语音但也可能不包括语音。例如,假设某个人唤醒了语音采集功能,但是几秒钟之内都没有说话,则此时采集到的音频数据就是不包括语音的音频数据。
音频数据中的多个音频帧可以为连续的音频帧,也可以为不连续的音频帧。
语音识别装置100包括获取模块110、处理模块120和决策模块130。决策模块130也可以是集成在处理模块120中的。
获取模块110用于获取音频数据,该音频数据可以包括多个音频帧。获取模块110可以包括用于实时获取语音音频的麦克风等语音采集设备。或者,获取模块110也可以包括通信接口,通信接口可以使用收发器一类的收发装置,来实现与其他设备或通信网络之间的通信,以便从其他设备或通信网络获取音频数据。
处理模块120用于对音频数据中的多个音频帧进行处理,得到多个音频帧的声音类别和语义。可以理解为,处理模块120用于提取音频数据中的多个音频帧的声音类别和语义。
语义用于表示音频数据中包括的语言,也可以称之为文本含义、文字含义、语言含义等等。语义也可以用音频流来承载。
示例性地,该多个音频帧的声音类别,可以包括,该多个音频帧中的每个音频帧的声音类别。
或者,该多个音频帧的声音类别,可以包括,该多个音频帧中的部分音频帧的声音类别。
换言之,处理模块120可以提取该多个音频帧中的所有音频帧的声音类别,或者,也可以仅提取该多个音频帧中的部分音频帧的声音类别。
可选地,可以利用自动语音识别(auto speech recognition,ASR)等装置来获取承载了语义的音频流。每段音频流可以用对应的文字表示。每段音频流可能包括一个或多个音频帧。
声音类别可以包括:“说话声(speech,SPE)”、“待定声音(neutral,NEU)”和“静音(silence,SIL)”。说话声是指音频中能够肯定是人在说话的部分(或者理解为音频中的人声部分),待定声音是指音频中相对模糊、无法判定是否为说话声的部分(或者可以理解为音频中的模糊部分),静音是指音频中确定不包括人声的部分(或者可以理解为音频中的无人声的部分)。应理解,在本申请实施例中,静音可以为没有说话声、没有声音或者只有背景声音等,并不是物理意义上的分贝为0或者是完全没有声音。
应理解,声音类别也可以存在其他分类方式,例如可以只包括“说话声”和“静音”,或者可以包括“静音”和“非静音”,或者可以包括“说话声”和“非说话声”。“说话声”和“非说话声也可以分别称之为“人声”和“非人声”。上述“说话声”、“待定声音”和“静音”也可以分别称之为“人声”、“可能是人声”和“不是人声”。应理解,此处仅为示例,本申请实施例对声音类别的分类方式不做限定。
需要说明的是,声音类别相当于是从声学角度对音频数据进行一个判断和分类,用于区分出音频帧的类别。语义则是提取出音频数据中的语言成分,用于从语言角度推断是否话已经说完。应理解,每个音频帧都可以提取出其声音类别。但是由于不包含人声的音频帧不具有语义,因此,不包含语音的音频帧的语义可以视为空或无。
一般而言,不同声音类别的音频帧的能量是存在差异,例如“说话声”的音频帧的能量相对较高,“静音”的音频帧的能量则相对较低,而“待定声音”的音频帧的能量则低于“说话声”的能量但高于“静音”的能量。
音频帧的能量也可以称为音频帧的强度。
可选地,可以根据音频帧的能量来对音频帧进行分类,从而得到该音频帧的声音类别。 具体地,根据音频数据中音频帧的能量获得相应音频帧的声音类别。
在一种实现方式中,可以根据音频帧的能量与预设能量阈值的关系,来得到相应音频帧的声音类别。例如,预设能量阈值可以包括第一能量阈值和第二能量阈值,第一能量阈值大于第二能量阈值,将能量大于或等于第一能量阈值的音频帧确定为说话声,将能量小于第一能量阈值且大于第二能量阈值的音频帧确定为待定声音,将能量小于或等于第二能量阈值的音频帧确定为静音。也就是说,将多个音频帧中能量大于或等于第一能量阈值的音频帧的声音类别确定为“说话声”,将多个音频帧中能量小于第一能量阈值且大于第二能量阈值的音频帧的声音类别确定为“待定声音”,将多个音频帧中能量小于或等于第二能量阈值的音频帧的声音类别确定为“静音”。但应理解,也可以将能量等于第一能量阈值或者等于第二能量阈值的音频确定为待定声音,上述描述作为本申请实施例的一个示例。
图2是本申请实施例的声音类别的分类示意图。如图2所示,横坐标表示音频帧序列,纵坐标表示音频帧序列对应的能量值。能量曲线表示的是音频数据中的多个音频帧的能量变化曲线。第一能量阈值曲线表示“说话声”的能量下限值曲线和“待定声音”的能量上限值曲线,第二能量阈值曲线表示“待定声音”的能量下限值曲线和“静音”的能量上限值曲线,静音能量曲线表示该段音频的背景声音的能量曲线,第一能量阈值曲线和第二能量阈值曲线均可以根据静音能量曲线得到,即第一能量阈值和第二能量阈值均可以根据背景声音的能量得到。
需要说明的是,在不同背景环境中,静音能量曲线是不同的,例如在相对安静的环境,静音能量(即背景声音的能量)相对较低,在相对嘈杂的环境,静音能量(即背景声音的能量)相对较高。因此根据静音能量曲线来得到第一能量阈值曲线和第二能量阈值曲线可以适应不同环境下的需求。
如图2所示,通过将音频数据中音频帧的能量与两个阈值进行比较,将音频帧的声音类别划分为“说话声”(图示SPE)、“待定声音”(图示NEU)和“静音”(图示SIL)三类。如图2中所示,该音频帧序列的声音类别序列从左到右依次为“SPE NEU SIL SIL NEU SPE NEU SIL NEU SPE NEU NEU SPE NEU SIL”。
处理模块120可以是能够进行数据处理的处理器,例如中央处理器或微处理器等,也可以是其他能够进行预算的装置或芯片或集成电路等。
决策模块130用于根据来自于处理模块120的声音类别和语义,得到音频数据的语音结束点。该语音结束点即为语音结束状态的检测结果。
可选地,如果是获取模块110实时获取音频数据,则在得到检测结果后,可以结束获取音频数据。
示例性地,决策模块130可以根据该多个音频帧中的全部音频帧的声音类别以及语义得到上述语音结束点。
例如,可以先根据语义来初步确定可能是语音结束点的文字端点,再根据每个文字端点的声音类别来从可能是语音结束点的文字端点中找出语音结束点。又例如,可以先根据声音类别来初步确定可能是语音结束点的备选音频帧,再根据这些备选音频帧前的语义来确定语音结束点。
示例性地,该多个音频帧包括第一音频帧和第二音频帧,第一音频帧为承载语义的音 频帧,第二音频帧为该多个音频帧中的第一音频帧之后的音频帧。决策模块130可以根据语义和第二音频帧的声音类别得到上述语音结束点。
第一音频帧为承载语义的多个音频帧。第二音频帧为第一音频帧之后的一个或多个音频帧。
需要说明的是“承载语义的多个音频帧”与音频数据所包含的“多个音频帧”是不同的概念。第一音频帧所包含的音频帧的数量小于音频数据所包含的音频帧的数量。
进一步地,决策模块130可以将语义和第二音频帧的声音类别进行融合,以得到融合特征,根据该融合特征得到语音结束点。融合特征可以理解为在承载了语义的音频流之后叠加了一个或多个确定了声音类别的音频帧得到的。举例说明,假设一段音频数据包括“我要看电视”和在“视”之后的5个音频帧,则经过上述提取声音类别和语义之后,可以得到“我要看电视”这段语义和后面5个音频帧的声音类别,则融合特征即为承载了上述“我要看电视”这段语义的多个音频帧叠加上述5个具有声音类别的音频帧。与上述直接处理相比,利用融合特征进行处理,一方面能够提高处理效率,另一方面也能够提高准确性。
可选地,语音端点类别可以包括“说话”、“思考”和“结束”。决策模块130可以根据语义和第二音频帧的声音类别,确定音频数据的语音端点类别,在音频数据的语音端点类别为“结束”的情况下得到上述语音结束点。
在音频数据的语音端点类别为“结束”可以理解为音频数据中包括语音端点类别为“结束”的文字端点。语音端点类别为“结束”的文字端点对应的音频帧可以作为语音结束点。
进一步地,在得到融合特征的情况下,决策模块130根据融合特征确定音频数据的语音端点类别,在音频数据的语音端点类别为“结束”的情况下得到上述语音结束点。
在一些实现方式中,决策模块130还可以利用语音端点分类模型对上述语义和第二音频帧的声音类别进行处理,得到语音端点类别,从而得到音频数据的语音结束点。
进一步地,在得到融合特征的情况下,决策模块130可以利用语音端点分类模型对上述融合特征进行处理,得到语音端点类别,从而得到音频数据的语音结束点。
即将融合特征作为一个输入特征输入语音端点分类模型中进行处理,得到语音端点类别,从而得到音频数据的语音结束点。
应理解,此处仅为示例,例如,决策模块130也可以不对语义和第二音频帧的声音类别进行处理,直接将语义和第二音频帧的声音类别作为两个输入特征输入至语音端点分类模型中进行处理。
需要说明的是,上述语音识别装置100以功能模块的形式体现,这里的术语“模块”可以通过软件和/或硬件形式实现,本申请实施例对此不做限定。上述模块的划分,仅仅是一种逻辑功能划分,实际实现时可以有另外的划分方式,例如,多个模块可以集成在一个模块中。即获取模块110、处理模块120和决策模块130可以集成在一个模块中。或者,该多个模块也可以是各个模块单独存在的。或者,该多个模块中的两个模块集成在一个模块中,比如,决策模块130可以集成于处理模块120中。
上述多个模块可以部署于同一个硬件上,也可以部署于不同的硬件上。或者说,上述多个模块所需要的执行的功能可以由同一个硬件执行,也可以由不同的硬件执行,本申请实施例对此不做限定。
上述语音端点类别可以包括“说话(speaking)”、“思考(thinking)”和“结束(ending)”。 “说话”可以理解为正在说话,也就是说,该端点既不是终止端点也不是停止端点;“思考”可以理解为正在考虑或短暂停顿,也就是说,该端点只是暂停端点,可能后续还有语音;“结束”则可以理解为停止或终止,也就是说,该端点是语音终止的端点。
在一些实现方式中,该语音端点分类模型可以利用语言类模型得到,例如利用基于转换器的双向编码器表示(bidirectional encoder representations from transformers,BERT)模型得到,下面以BERT模型为例介绍,但应理解也可以采用任意的其他可以进行上述分类的语言类模型。
图3是本申请实施例的分类模型对融合特征进行处理的示意图。如图3所示,该分类模型可以为BERT模型,BERT模型包括嵌入层和全连接层(图3中用白色框C表示),嵌入层包括标签嵌入层(token embeddings)、语句嵌入层(segment embeddings)和位置嵌入层(position embeddings),上述嵌入层将输入数据(即上述融合特征)进行处理后得到的结果输出给全连接层,全连接层则进一步得到上述语音端点类别。
如图3所示,输入数据给出了5条融合特征的示例,在图3中分别用In1至In5表示,In1为“[CLS]打开车窗[SPE][SIL][SEP]”,In2为“[CLS]打开车窗[SIL][SPE][SIL][SIL][SEP]”,In3为“[CLS]温度调到二十[SIL][NEU][SIL][SIL][SEP]”,In4为“[CLS]温度调到二十六[SIL][SIL][SIL][SEP]”,In5为“[CLS]温度调到二十六度[SIL][SIL][SEP]”。每个圆圈表示一个元素,[CLS]为BERT模型的输入数据的起始数据或者可以称之为初始符,[SEP]为BERT模型的输入数据的终止数据或者可以称之为结束符。也就是说,在该示例中,融合特征的格式为“[CLS]+语义+声音类别+[SEP]”。
如图3所示,E X表示X的能量值,例如,E CLS表示[CLS]对应的能量。BERT模型的嵌入层和全连接层对输入数据进行处理之后,可以得到语音端点类别。
图3只是语音端点分类模型的一个具体示例,并不存在限定,对于语音端点分类模型的训练方法会在下文中介绍,在此不再展开。
图4是本申请实施例的语音识别方法的示意性流程图。下面对图4的各个步骤进行介绍。图4所示的方法可以由电子设备来执行,电子设备具体可以包括电脑、智能手机、平板电脑、个人数字助理(personal digital assistant,PDA)、可穿戴设备、智能音箱、电视、无人机、车辆、车载芯片、车载装置(例如车机、车载电脑)或机器人等装置中的一个或多个。或者,图4所示的方法400也可以由云服务设备执行。或者,图4所示的方法400还可以是由云服务设备和终端设备构成的系统执行,例如车载系统或智能家居系统等。
示例性地,图4所示的方法可以由图1中的语音识别装置100执行。
401、获取音频数据,音频数据包括多个音频帧。
可以是利用麦克风等语音采集装置获取音频数据,也可以从存储装置或网络中获取音频数据,该音频数据可以是实时获取的,也可以是已存储好的。可以利用上文获取模块110执行步骤401。音频数据及其获取方式的相关介绍参照上文,不再重复。
音频帧可以是通过对音频数据进行分帧操作后得到的。例如,一个音频帧可以为十几毫秒或几十毫秒。
402、提取多个音频帧的声音类别和语义。
可以利用上文处理模块120执行步骤402。声音类别和语义的介绍可以参照上文,不 再赘述。应理解,每个音频帧都可以提取出其声音类别,但并不是每个音频帧均可以提取出语义。具体地,不包含人声的音频帧不具有语义,因此,不包含语音的音频帧无法提取出其语义,此时,不包含语音的音频帧的语义可以视为空或无。
可选地,可以根据音频帧的能量与预设能量阈值的关系,来得到声音类别。例如,预设能量阈值可以包括第一能量阈值和第二能量阈值,第一能量阈值大于第二能量阈值,可以将多个音频帧中能量大于或等于第一能量阈值的音频帧的声音类别确定为“说话声”;或者将多个音频帧中能量小于第一能量阈值且大于第二能量阈值的音频帧的声音类别确定为“待定声音”,第一能量阈值大于第二能量阈值;或者将多个音频帧中能量小于或等于第二能量阈值的音频帧的声音类别确定为“静音”。
可选地,可以利用自动语音识别等装置来获取承载了语义的音频流。每段音频流可以用对应的文字表示。每段音频流可能包括一个或多个音频帧。
示例性地,该多个音频帧的声音类别,可以包括,该多个音频帧中的每个音频帧的声音类别。
或者,该多个音频帧的声音类别,可以包括,该多个音频帧中的部分音频帧的声音类别。
换言之,在步骤402中可以提取该多个音频帧中的所有音频帧的声音类别,或者,也可以仅提取该多个音频帧中的部分音频帧的声音类别。
需要说明的是,每个音频帧都可以提取出其声音类别,即每个音频帧均可以得到其对应的声音类别。而通常多个音频帧对应一个文字,或者说,一个文字是由多个音频帧承载的。以图3为例,图3中的“打开车窗”中的每个文字分别对应多个音频帧,而每个声音类别对应一个音频帧。
403、根据声音类别和语义,得到音频数据的语音结束点。
示例性地,可以利用上文的决策模块130执行步骤403。
可选地,方法400还包括步骤404(图中未示出)。
404,在得到语音结束点后,响应音频数据中的语音结束点之前的音频数据对应的指令。
换言之,在得到语音结束点之后,可以执行音频数据中的语音结束点之前的音频数据对应的操作。音频数据中的语音结束点之前的音频信号对应的操作也可以理解为语音结束对应的操作。
需要说明的是,在得到语音结束点后,可以立即执行音频数据中的语音结束点之前的音频数据对应的操作。或者,在得到语音结束点后,也可以在经过一段时间后执行音频数据中的语音结束点之前的音频信数据对应的操作。该段时间可以为冗余时间或误差时间等。
语音结束对应的操作可以为任意业务处理功能中的操作。
示例性地,在得到语音结束点后,可以停止语音识别。或者,在得到语音结束点后,可以将语音识别结果返回给用户。或者,在得到语音结束点后,可以将语音识别结果发送至后续模块,以使后续模块执行相应的操作。例如,音频数据中可以包括控制指令,后续模块执行该指令相应的控制操作。比如,音频数据中可以包括问询指令,后续模块将该问询指令对应的回答语句返回给用户。
语音结束对应的操作,即步骤404可以由方法400的执行装置执行,或者,也可以由其他装置执行,本申请实施例对此不作限定。
下面以音频信号中包括用户指令为例进行说明。用户指令可以用于实现获取路线、播放音乐、控制硬件(例如,灯、空调等)等各种功能。例如,以控制空调为例,用户指令可以为,打开空调。应理解,后续模块可以为一个模块,也可以为多个模块。
示例性地,在得到语音结束点后,可以向后续模块发送指示信息,该指示信息用于指示语音结束点,以便后续模块可以获取该音频数据中语音结束点之前的音频数据,基于该音频数据得到语义文本(例如,“打开空调”),进而基于该语义文本解析出用户指令,控制相应模块执行语音执行指示的操作。
例如,在得到语音结束点后,指示ASR停止语音识别,ASR将语音识别结果(例如,“打开空调”的语义文本)发送至语义分析模块,由语义分析模块解析出用户指令,发送控制信号至空调,以控制空调打开。
示例性地,在得到语音结束点后,可以向后续模块发送该音频数据中语音结束点之前的音频数据,以便后续模块基于该音频数据得到语义文本(例如,“打开空调”),进而基于该语义文本解析出用户指令,控制相应模块执行语音执行指示的操作。
示例性地,在得到语音结束点后,可以基于该音频数据得到语义文本(例如,“打开空调”),进而基于该语义文本解析出用户指令,控制相应模块执行语音执行指示的操作。
可选地,可以根据该多个音频帧中的全部音频帧的声音类别以及语义得到上述语音结束点。
例如,可以先根据语义来初步确定可能是语音结束点的文字端点,再根据每个文字端点的声音类别来从可能是语音结束点的文字端点中找出语音结束点。又例如,可以先根据声音类别来初步确定可能是语音结束点的备选音频帧,再根据这些备选音频帧前的语义来确定语音结束点。
示例性地,该多个音频帧包括第一音频帧和第二音频帧,第一音频帧为承载语义的音频帧,第二音频帧为该多个音频帧中的第一音频帧之后的音频帧。可以根据语义和第二音频帧的声音类别得到上述语音结束点。
第一音频帧为承载语义的多个音频帧。第二音频帧为第一音频帧之后的一个或多个音频帧。
需要说明的是“承载语义的多个音频帧”与音频数据所包含的“多个音频帧”是不同的概念。第一音频帧所包含的音频帧的数量小于音频数据所包含的音频帧的数量。
进一步地,可以将语义和第二音频帧的声音类别进行融合,以得到融合特征,根据该融合特征得到语音结束点。利用融合特征进行处理,一方面能够提高处理效率,另一方面也能够提高准确性。
在一些实现方式中,语音端点类别可以包括“说话”、“思考”和“结束”,可以根据语义和第二音频帧的声音类别,确定音频数据的语音端点类别,在音频数据的语音端点类别为“结束”的情况下得到上述语音结束点。
在音频数据的语音端点类别为“结束”可以理解为音频数据中包括语音端点类别为“结束”的文字端点。语音端点类别为“结束”的文字端点对应的音频帧可以作为语音结束点。
进一步地,在得到融合特征的情况下,可以根据融合特征确定音频数据的语音端点类 别,在音频数据的语音端点类别为“结束”的情况下得到上述语音结束点。
可选地,可以利用语音端点分类模型对上述语义和第二音频帧的声音类别进行处理,得到语音端点类别,从而得到音频数据的语音结束点。
进一步地,在得到融合特征的情况下,可以利用语音端点分类模型对融合特征进行处理,得到融合特征的语音端点类别,从而得到音频数据的语音结束点。
该语音端点分类模型可以是利用语音样本和语音样本的端点类别标签得到的,且语音样本的格式与融合特征的格式对应,端点类别标签包括的端点类别与语音端点类别对应。
在图4所示方案中,通过提取和综合音频数据中的声音类别和语义来得到音频数据的语音结束点,能够更加准确地判定音频数据的语音结束点。
应理解,上述音频数据可以是实时获取的音频数据,也可以是读取已经存储好的音频数据,可以分别理解为在线语音识别和离线语音识别。在得到了语音结束点之后,可以用于执行后续的基于语音的控制等操作。例如可以用于控制电子设备的开关,用于进行信息查询,用于控制播放音频视频等等。如果是在线语音识别,可以在得到语音结束点之后结束音频数据的获取,即停止语音识别。或者,如果是在线语音识别,也可以在得到语音结束点后执行语音结束点之前的音频数据对应的指令,并继续获取音频数据。
利用本申请实施例的方案能够更准确的获得语音结束点,从而更准确地对基于语音的后续操作进行响应,提升用户体验。具体来说,本申请的方案能够避免语音结束点过晚检测导致响应时延过长,更快得到语音结束点,从而及时做出后续响应,减少用户的等待时间,提高用户体验;同时,本申请的方案能够得到准确的语音结束点,避免语音结束点过早检测导致的用户的语音指令被过早截断,得到具有完整语义的音频数据,有利于准确地识别用户意图,进而做出准确的响应,提高用户体验。
举例说明,假设在一个语音控制电子设备的场景中,用户共产生7秒音频数据:前3秒为“请打开空调”的语音,第4秒为1秒停顿,第5-7秒为咳嗽声。如果采用本申请实施例的方案可以准确得到“调”字为语音结束点,则在第3秒之后或第4秒可以结束语音获取。而传统的基于活性检测的方法则需要设置固定等待时间,只有在实际等待时间大于或等于该固定等待时间之后才会结束语音获取。假设此处采用基于活性检测的方法,且该固定等待时间为2秒,则上述1秒的停顿就会被认为是短暂停顿,语音识别会继续,从而继续识别后面的3秒咳嗽声以及咳嗽声之后依然需要等待该固定等待时间。则与采用本申请实施例的方案相比,基于活性检测的方法共延迟至少5秒才会结束语音获取。
假设在一个语音控制电子设备的场景中,用户共产生6秒音频数据:“请打电话给钱一二”,但在“钱”字后有1.5秒的停顿,若基于活性检测的方法的上述固定等待时间为1.5秒。如果采用本申请实施例的方案能够准确得到“二”字为语音结束点,因为根据语义信息,“钱”字之后需要更多的声音类别为“静音”的音频帧才能被作为“结束”端点,所以不会在1.5秒停顿之后就结束语音获取,而是在第6秒或第6秒之后会结束语音获取。而如果采用基于活性检测的方法,则会在“钱”字后的1.5秒停顿结束时结束语音获取,导致语音结束点的判断错误,无法响应后续的控制策略。
利用本申请实施例的方案进行语音识别时,语音结束点之前的等待时间不是固定的,是会随着实际的语音识别过程而变化的,与现有的预设好了固定等待时间,在该等待时间结束后才得到语音结束点的方式相比,能够更加准确地得到语音结束点,从而提高后续响 应的及时性和准确性,降低用户的等待时长,提高用户体验。
本申请实施例还提供了一种语音识别方法500。方法500包括步骤501至步骤506。下面对方法500的各个步骤进行介绍。方法500可以理解为一种在线语音识别方法的示例。
501,获取第一音频数据。
第一音频数据可以是实时获取的。该第一音频数据中可以包括多个音频帧。
502,确定第一音频数据的第一语音结束点。
可选地,第一音频数据包括多个音频帧。步骤502可以包括:提取该多个音频帧的声音类别和语义;根据该多个音频帧的声音类别和所述语义,得到第一音频数据的第一语音结束点。
第一语音结束点的具体确定方法可以参考前文中的方法400,将方法400中的“音频数据”替换为“第一音频数据”即可,此处不再赘述。
503,在得到第一语音结束点后,响应第一音频数据中的第一语音结束点之前的音频数据对应的指令。
换言之,在得到第一语音结束点后,可以执行第一语音结束点之前的音频数据对应的操作。
为了便于描述,将第一音频数据中的第一语音结束点之前的音频数据对应的指令称为第一用户指令。
第一语音结束点即为第一用户指令的语音结束点。第一音频数据可以只包括一条用户指令,通过步骤502能够识别出该指令的语音结束点。
504,获取第二音频数据。
需要说明的是,方法500中的步骤的序号仅为了描述方便,不对步骤的执行顺序构成限定。在方法500中,音频数据的处理过程和音频数据的获取过程可以是相互独立的。换言之,在步骤502的执行的同时,可以继续获取音频数据,即执行步骤504,只要第二音频数据为第一音频数据之后获取的音频数据即可。
示例性地,第一音频数据和第二音频数据可以是连续采集的音频数据。
505,确定第二音频数据的第二语音结束点。
可选地,第二音频数据包括多个音频帧。步骤502可以包括:提取该多个音频帧的声音类别和语义;根据该多个音频帧的声音类别和所述语义,得到第一音频数据的第一语音结束点。
或者,第一音频数据中的第一语音结束点之后的音频数据和第二音频数据所构成的音频数据包括多个音频帧,步骤502可以包括:提取该多个音频帧的声音类别和语义;根据该多个音频帧的声音类别和语义,得到第二音频数据的第二语音结束点。
例如,第一音频数据中的第一语音结束点之后的音频数据包括4个音频帧,第二音频数据包括10个音频帧,则该4个音频帧和该10个音频帧构成的音频数据包括14个音频帧,提取该14个音频帧的声音类别和语义;根据该14个音频帧的声音类别和语义,得到第二音频数据的第二语音结束点。
第二语音结束点的具体确定方法可以参考前文中的方法400,此处不再赘述。
506,在得到第二语音结束点后,响应第一音频数据和第二音频数据中的第一语音结束点和第二语音结束点之间的音频数据对应的指令。
例如,第一音频数据中的第一语音结束点之后的音频数据包括4个音频帧,第二音频数据包括10个音频帧,则该4个音频帧和该10个音频帧构成的音频数据包括14个音频帧,第二音频数据的第二语音结束点位于第12个音频帧处。在得到第二语音结束点后,响应该14个音频帧中第12个音频帧之前的音频数据对应的指令。
为了便于描述,将第一音频数据和第二音频数据中的第一语音结束点和第二语音结束点之间的音频数据对应的指令称为第二用户指令。
第二语音结束点即为第二用户指令的语音结束点。第一音频数据中的第一语音结束点之后的音频数据和第二音频数据所构成的音频数据中可以只包括一条用户指令,通过步骤505可以识别出该指令的语音结束点。第一用户指令和第二用户指令可以是用户连续发出的两条用户指令,即两条用户指令之间的时间间隔较小。例如,在第一音频数据和第二音频数据是连续采集的音频数据的情况下,第一用户指令和第二用户指令之间的时间间隔较小。本申请的方案有利于区分出第一用户指令的语音结束点和第二用户指令的语音结束点。
利用本申请实施例的方案能够更准确的获得语音结束点,避免语音结束点过晚检测导致响应时延过长,更快得到语音结束点,从而及时做出后续响应,减少用户的等待时间,提高用户体验。具体来说,本申请实施例的方案中实时获取音频数据并识别该音频数据中的语音结束点,有利于实时识别出不同的指令的语音结束点,并在得到各个指令的语音结束点后响应该指令。尤其是在用户发出的多个指令之间的间隔较短的情况下,利用本申请的方案有利于在各个指令发出后识别出各个指令的语音结束点,以便及时响应各个指令,而无需等到该多个指令全部发出后再响应所有指令。
举例说明,假设在一个语音控制电子设备的场景中,用户共产生8秒音频数据,该8秒的音频数据中包括“请关闭车窗”和“请打开空调”两个用户指令,这两个用户指令之间的间隔较小,例如,1秒,即用户在发出“请关闭车窗”的用户指令的1秒后发出“请打开空调”的用户指令。如果采用本申请实施例的方案可以实时获取音频数据并进行处理,得到第一个用户指令对应的语音结束点。根据语义信息,“窗”字之后需要几个声音类别为“静音”的音频帧即可作为“结束”端点,即在“窗”字之后经过几个音频帧即可确认该用户指令对应的语音结束,及时响应该用户指令,执行关闭车窗的操作。同时本申请实施例的方案可以继续获取音频数据并进行处理,得到第二个用户指令对应的语音结束点,例如,在“调”字之后经过几个音频帧即可确认该用户指令对应的语音结束,及时响应该用户指令,执行打开空调的操作。若基于活性检测的方法的上述固定等待时间为1.5秒,则上述1秒的停顿就会被认为是短暂停顿,语音识别会继续,无法得到第一个用户指令对应的语音结束点,只有在用户发出第二个用户指令后等待1.5秒才会得到认为语音结束,然后响应这两个用户指令分别执行相应的操作。
换言之,采用本申请实施例的方案,能够准确得到该多个用户指令对应的语音结束点,及时响应各个用户指令,尤其是在多个用户指令之间的时间间隔较小的情况下,本申请实施例的方案能够更准确地得到各个用户指令的语音结束点,有利于在各个用户指令发出后及时作出响应,而无需等待用户发出所有指令后再进行响应。
图5是本申请实施例的语音识别方法的示意性流程图。图5可以看作是图4所示方法的一个具体示例,在该例子中,实时获取音频数据和进行语音识别。
601、实时获取音频数据。
可以利用麦克风等语音采集设备获取音频数据。
或者,可以利用上文获取模块110执行步骤601。
602、利用ASR对音频数据进行语音识别,得到承载语义的音频流。
步骤602即为获取语义的方法的一个具体示例,即利用ASR来得到语义的示例。
可选地,如果ASR识别出音频流已结束,可以不再执行下面的步骤603至606,直接转为执行步骤607。相当于在较长一段时间间隔内获取的音频数据中已经识别不出语音了,不需要再进一步判断语音结束点。
可以利用上文处理模块120执行步骤602。
603、根据音频数据的能量和预设能量阈值的关系,得到音频数据中音频帧的声音类别。
步骤603即为获取声音类别的方法的一个具体示例。
步骤602和步骤603可以同时执行也可以不同时执行,且执行的先后顺序不存在限定。步骤602和步骤603可以看作是步骤402的一个具体例子。
可以利用上文处理模块120执行步骤603。
604、将语义和声音类别进行融合,得到融合特征。
该融合特征可以是将语义和声音类别进行叠加得到。
示例性地,步骤602可以是实时执行的,即实时对获取的音频数据进行语音识别。在步骤604中,在每识别出一个字之后,即可以将当前的语义和该字之后的一个或多个音频帧的声音类别进行叠加得到融合特征,并将该融合特征输入至后续的语音端点分类模型进行处理(即步骤605)。举例说明,假设用户要发出的完整指令为“我要看电视”,当前发出的指令为“我要看电”,通过步骤602对当前获取的音频数据进行语音识别,识别出“电”字后,即可根据“我要看电”这段语义和“电”之后的一个或多个音频帧的声音类别得到融合特征,进而输入至步骤605中的语音端点分类模型进行处理。在用户继续发出“视”后,通过步骤602对当前获取的音频数据进行语音识别,识别出“视”字后,即可根据“我要看电视”这段语义和“视”之后的一个或多个音频帧的声音类别得到融合特征,进而输入至步骤605中的语音端点分类模型进行处理。
与上述直接处理相比,利用融合特征进行处理,一方面能够提高处理效率,另一方面也能够提高准确性。具体内容也可以参照图3中有关输入数据的介绍,在此不再赘述。
可以利用上文决策模块130执行步骤604。
605、利用语音端点分类模型对融合特征进行处理,得到语音端点类别。
该语音端点类别包括:“说话”、“思考”和“结束”。当语音端点类别为“结束”时,该端点对应的音频帧即为语音结束点。
步骤604和605的结合即为步骤403的一个具体示例。
可以利用上文决策模块130执行步骤605。
606、判断语音端点类别是否为“结束”,当判定结果为“是”,执行步骤607;当判定结果为“否”,转为执行步骤601。
可以利用上文决策模块130执行步骤606。
607、输出语音结束点。
可以利用上文决策模块130执行步骤607。
图6是本申请实施例的语音端点分类模型的训练方法的示意性流程图。下面对图6所示各个步骤进行介绍。
701、获取训练数据,该训练数据包括语音样本和语音样本的端点类别标签。
可选地,语音样本的格式可以与上文输入语音端点分类模型的数据的格式对应,端点类别标签包括的端点类别可以与上文所述语音端点类别对应。即语音样本包括音频数据的声音类别和语义,端点类别标签包括“说话”、“思考”和“结束”。
上文输入语音端点分类模型的数据即语音端点分类模型的推理阶段的输入数据,例如,上文中的语义和第二音频帧的声音类别,即语音样本的格式可以与上文语义和第二音频帧的声音类别的格式对应。再如,语音端点分类模型的推理阶段的输入数据可以为上文的融合特征,即语音样本的格式可以与上文融合特征的格式对应。
在一些实现方式中,语音样本可以为“初始符+语义+声音类别+结束符”的格式,或者语音样本可以为“初始符+声音类别+语义+结束符”的格式。
可选地,可以获取一些文本语料,并建立这些语料的字典树,字典树中的每个节点(每个节点对应一个文字)包括以下信息:是否为终点和前缀频次。之后可以根据节点信息生成上述语音样本。终点即为一句话的结束点,前缀频次用于表示该文字之后到结束点还会存在的文字数,前缀频次越高,也说明该文字是终点的可能越小。例如,“给”、“拿”等动词或其他介词等文字作为语句的结束的可能相对较小,在字典树中往往具有较高的前缀频次;而例如“吗”、“呢”等文字作为语句的结束的可能相对较大,在字典树中往往具有较低的前缀频次。
如果某个节点不是终点,则该节点的纯文本(即语义)的端点类别标签是“说话”,信号分类信息(即声音类别)的端点类别标签是“思考”。如果该节点是终点,则根据前缀频次和信号分类信息(即声音类别)产生不同端点类别标签的语音样本。前缀频次越高,需要附加更多的声音类别为“静音”的音频帧,该节点才能被标注为“结束”。下面结合图7进行介绍。
图7是本申请实施例的字典树的示意图。如图7所示的字典树,每个节点包括一个文字,灰色圈表示该节点不是终点,白色圈表示该节点是终点,前缀频次用数字和字母表示(如图中得到0、1、2和x、y、z),例如白色圈内的“五,0”表示该节点为终点,前缀频次为0;白色圈内的“十,z”表示该节点是终点,前缀频次为z=x+5+y+2+1+0+0,即该节点到字典树末端的所有节点前缀频次的总和。应理解,图7中的字母是为了说明字典树中节点的前缀频次的标注方法,在实际中不需要引入字母,字典树建立后,字典树中节点的前缀频次是固定的。
可以由训练设备执行步骤701。示例性地,训练设备可以是云服务设备,也可以是终端设备,例如,电脑、服务器、手机、智能音箱、车辆、无人机或机器人等装置,也可以是由云服务设备和终端设备构成的系统,本申请实施例对此不做限定。
702、利用训练数据对语音端点分类模型进行训练,得到目标语音端点分类模型。
该语音端点分类模型可以参照上文介绍,不再一一列举。该目标语音端点分类模型可以用于根据声音类别和语义,得到音频数据的语音结束点。该目标语音端点分类模型可以用于执行上文步骤403,也可以由决策模块130使用来得到语音结束点。
可以由训练设备执行步骤702。
图8是本申请实施例的语音识别装置的示意性框图。图8所示的装置2000包括获取单元2001和处理单元2002。
获取单元2001和处理单元2002可以用于执行本申请实施例的语音识别方法。例如,获取单元2001可以执行上述步骤401,处理单元2002可以执行上述步骤402和403。又例如,获取单元2001可以执行上述步骤501和步骤504,处理单元2002可以执行上述步骤502-503以及步骤505-506。又例如,获取单元2001可以执行上述步骤601,处理单元2002可以执行上述步骤602-606。
获取单元2001可以包括上述获取模块110,处理单元2002可以包括上述处理模块120和决策模块130。
应理解,上述装置2000中的处理单元2002可以相当于下文中的装置3000中的处理器3002。
需要说明的是,上述装置2000以功能单元的形式体现。这里的术语“单元”可以通过软件和/或硬件形式实现,对此不作具体限定。
例如,“单元”可以是实现上述功能的软件程序、硬件电路或二者结合。所述硬件电路可能包括应用特有集成电路(application specific integrated circuit,ASIC)、电子电路、用于执行一个或多个软件或固件程序的处理器(例如共享处理器、专有处理器或组处理器等)和存储器、合并逻辑电路和/或其它支持所描述的功能的合适组件。
因此,在本申请的实施例中描述的各示例的单元,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
图9是本申请实施例的语音识别装置的硬件结构示意图。图9所示的语音识别装置3000(该装置3000具体可以是一种计算机设备)包括存储器3001、处理器3002、通信接口3003以及总线3004。其中,存储器3001、处理器3002、通信接口3003通过总线3004实现彼此之间的通信连接。
存储器3001可以是只读存储器(read only memory,ROM),静态存储设备,动态存储设备或者随机存取存储器(random access memory,RAM)。存储器3001可以存储程序,当存储器3001中存储的程序被处理器3002执行时,处理器3002和通信接口3003用于执行本申请实施例的语音识别方法的各个步骤。
处理器3002可以采用通用的中央处理器(central processing unit,CPU),微处理器,ASIC,图形处理器(graphics processing unit,GPU)或者一个或多个集成电路,用于执行相关程序,以实现本申请实施例的语音识别装置中的处理单元2002所需执行的功能,或者执行本申请方法实施例的语音识别方法。
处理器3002还可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,本申请的语音识别方法的各个步骤可以通过处理器3002中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器3002还可以是通用处理器、数字信号处理器(digital signal processing,DSP)、ASIC、现成可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行 本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器3001,处理器3002读取存储器3001中的信息,结合其硬件完成本申请实施例的语音识别装置中包括的单元所需执行的功能,或者执行本申请方法实施例的语音识别方法。例如,处理器3002可以执行上述步骤402和403。又例如,处理器3002可以执行上述步骤502-503以及步骤505-506。又例如,处理器3002可以执行上述步骤602-606。
通信接口3003使用例如但不限于收发器一类的收发装置,来实现装置3000与其他设备或通信网络之间的通信。通信接口3003可以用于实现图8所示的获取单元2001所需执行的功能。例如,通信接口3003可以执行上述步骤401。又例如,通信接口3003可以执行上述步骤501和步骤504。又例如,通信接口3003可以执行上述步骤601。即可以通过通信接口3003获取上述音频数据。
总线3004可包括在装置3000各个部件(例如,存储器3001、处理器3002、通信接口3003)之间传送信息的通路。
在一种实现方式中,该语音识别装置3000可以设置于车载系统中。具体地,语音识别装置3000可以设置于车载终端中。或者,语音识别装置可以设置于服务器中。
需要说明的是,此处仅以语音识别装置3000设置于车辆中为例进行说明。语音识别装置3000还可以设置于其他设备中,例如,装置3000还可以应用于电脑、服务器、手机、智能音箱、可穿戴设备、无人机或机器人等设备中。
应注意,尽管图9所示的装置3000仅仅示出了存储器、处理器、通信接口,但是在具体实现过程中,本领域的技术人员应当理解,装置3000还包括实现正常运行所必须的其他器件。同时,根据具体需要,本领域的技术人员应当理解,装置3000还可包括实现其他附加功能的硬件器件。此外,本领域的技术人员应当理解,装置3000也可仅仅包括实现本申请实施例所必须的器件,而不必包括图9中所示的全部器件。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同装置来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、方法和装置,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:通用串行总线闪存盘(USB flash disk,UFD),UFD也可以简称为U盘或者优盘、移动硬盘、ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (36)

  1. 一种语音识别方法,其特征在于,包括:
    获取音频数据,所述音频数据包括多个音频帧;
    提取所述多个音频帧的声音类别和语义;
    根据所述多个音频帧的声音类别和所述语义,得到所述音频数据的语音结束点。
  2. 如权利要求1所述的方法,其特征在于,所述方法还包括:
    在得到所述语音结束点后,响应所述音频数据中所述语音结束点之前的音频数据对应的指令。
  3. 如权利要求1或2所述的方法,其特征在于,所述提取所述多个音频帧的声音类别和语义,包括:
    根据所述多个音频帧的能量和预设能量阈值的关系,得到所述多个音频帧的声音类别。
  4. 如权利要求1至3中任一项所述的方法,其特征在于,所述声音类别包括“说话声”、“待定声音”和“静音”,所述预设能量阈值包括第一能量阈值和第二能量阈值,所述第一能量阈值大于所述第二能量阈值,
    所述多个音频帧中能量大于或等于所述第一能量阈值的音频帧的声音类别为“说话声”;或者
    所述多个音频帧中能量小于所述第一能量阈值且大于所述第二能量阈值的音频帧的声音类别为“待定声音”;或者
    所述多个音频帧中能量小于或等于所述第二能量阈值的音频帧的声音类别为“静音”。
  5. 如权利要求4所述的方法,其特征在于,所述第一能量阈值和所述第二能量阈值是根据所述音频数据的背景声音的能量确定的。
  6. 如权利要求1至5中任一项所述的方法,其特征在于,所述多个音频帧包括第一音频帧和第二音频帧,所述第一音频帧为承载所述语义的音频帧,所述第二音频帧为所述多个音频帧中的所述第一音频帧之后的音频帧;以及
    所述根据所述多个音频帧的声音类别和所述语义,得到所述音频数据的语音结束点,包括:
    根据所述语义和所述第二音频帧的声音类别得到所述语音结束点。
  7. 如权利要求6所述的方法,其特征在于,语音端点类别包括“说话”、“思考”或“结束”,所述根据所述语义和所述第二音频帧的声音类别得到所述语音结束点,包括:
    根据所述语义和所述第二音频帧的声音类别,确定所述音频数据的语音端点类别,在所述音频数据的语音端点类别为“结束”的情况下,得到所述语音结束点。
  8. 如权利要求7所述的方法,其特征在于,所述根据所述语义和所述第二音频帧的声音类别,确定所述音频数据的语音端点类别,包括:
    利用语音端点分类模型对所述语义和所述第二音频帧的声音类别进行处理,得到所述语音端点类别,所述语音端点分类模型是利用语音样本和所述语音样本的端点类别标签得到的,所述语音样本的格式与所述语义和所述第二音频帧的声音类别的格式对应,所述端 点类别标签包括的端点类别与所述语音端点类别对应。
  9. 一种语音识别方法,其特征在于,包括:
    获取第一音频数据;
    确定所述第一音频数据的第一语音结束点;
    在得到所述第一语音结束点后,响应所述第一音频数据中的所述第一语音结束点之前的音频数据对应的指令;
    获取第二音频数据;
    确定所述第二音频数据的第二语音结束点;
    在得到所述第二语音结束点后,响应所述第一音频数据和所述第二音频数据中的所述第一语音结束点和所述第二语音结束点之间的音频数据对应的指令。
  10. 如权利要求9所述的方法,其特征在于,所述第一音频数据包括多个音频帧,所述确定所述第一音频数据的第一语音结束点,包括:
    提取所述多个音频帧的声音类别和语义;
    根据所述多个音频帧的声音类别和所述语义,得到所述第一音频数据的第一语音结束点。
  11. 如权利要求10所述的方法,其特征在于,所述提取所述多个音频帧的声音类别和语义,包括:
    根据所述多个音频帧的能量和预设能量阈值的关系,得到所述多个音频帧的声音类别。
  12. 如权利要求10或11所述的方法,其特征在于,所述声音类别包括“说话声”、“待定声音”和“静音”,所述预设能量阈值包括第一能量阈值和第二能量阈值,所述第一能量阈值大于所述第二能量阈值,
    所述多个音频帧中能量大于或等于所述第一能量阈值的音频帧的声音类别为“说话声”;或者
    所述多个音频帧中能量小于所述第一能量阈值且大于所述第二能量阈值的音频帧的声音类别为“待定声音”;或者
    所述多个音频帧中能量小于或等于所述第二能量阈值的音频帧的声音类别为“静音”。
  13. 如权利要求12所述的方法,其特征在于,所述第一能量阈值和所述第二能量阈值是根据所述第一音频数据的背景声音的能量确定的。
  14. 如权利要求10至13中任一项所述的方法,其特征在于,所述多个音频帧包括第一音频帧和第二音频帧,所述第一音频帧为承载所述语义的音频帧,所述第二音频帧为所述多个音频帧中的所述第一音频帧之后的音频帧;以及
    所述根据所述多个音频帧的声音类别和所述语义,得到所述第一音频数据的第一语音结束点,包括:
    根据所述语义和所述第二音频帧的声音类别得到所述第一语音结束点。
  15. 如权利要求14所述的方法,其特征在于,语音端点类别包括“说话”、“思考”或“结束”,所述根据所述语义和所述第二音频帧的声音类别得到所述第一语音结束点,包括:
    根据所述语义和所述第二音频帧的声音类别,确定所述第一音频数据的语音端点类 别,在所述第一音频数据的语音端点类别为“结束”的情况下,得到所述第一语音结束点。
  16. 如权利要求15所述的方法,其特征在于,所述根据所述语义和所述第二音频帧的声音类别,确定所述第一音频数据的语音端点类别,包括:
    利用语音端点分类模型对所述语义和所述第二音频帧的声音类别进行处理,得到所述语音端点类别,所述语音端点分类模型是利用语音样本和所述语音样本的端点类别标签得到的,所述语音样本的格式与所述语义和所述第二音频帧的声音类别的格式对应,所述端点类别标签包括的端点类别与所述语音端点类别对应。
  17. 一种语音识别装置,其特征在于,包括:
    获取单元,用于获取音频数据,所述音频数据包括多个音频帧;
    处理单元,用于提取所述多个音频帧的声音类别和语义;
    所述处理单元还用于,根据所述多个音频帧的声音类别和所述语义,得到所述音频数据的语音结束点。
  18. 如权利要求17所述的装置,其特征在于,所述处理单元还用于:
    在得到所述语音结束点后,响应所述音频数据中所述语音结束点之前的音频数据对应的指令。
  19. 如权利要求17或18所述的装置,其特征在于,所述处理单元具体用于,根据所述多个音频帧的能量和预设能量阈值的关系,得到所述多个音频帧的声音类别。
  20. 如权利要求17至19中任一项所述的装置,其特征在于,所述声音类别包括“说话声”、“待定声音”和“静音”,所述预设能量阈值包括第一能量阈值和第二能量阈值,所述第一能量阈值大于所述第二能量阈值,所述多个音频帧中能量大于或等于所述第一能量阈值的音频帧的声音类别为“说话声”;或者
    所述多个音频帧中能量小于所述第一能量阈值且大于所述第二能量阈值的音频帧的声音类别为“待定声音”;或者
    所述多个音频帧中能量小于或等于所述第二能量阈值的音频帧的声音类别为“静音”。
  21. 如权利要求20所述的装置,其特征在于,所述第一能量阈值和所述第二能量阈值是根据所述音频数据的背景声音的能量确定的。
  22. 如权利要求17至21中任一项所述的装置,其特征在于,所述多个音频帧包括第一音频帧和第二音频帧,所述第一音频帧为承载所述语义的音频帧,所述第二音频帧为所述多个音频帧中的所述第一音频帧之后的音频帧;所述处理单元具体用于:
    根据所述语义和所述第二音频帧的声音类别得到所述语音结束点。
  23. 如权利要求22所述的装置,其特征在于,语音端点类别包括“说话”、“思考”或“结束”,所述处理单元具体用于:
    根据所述语义和所述第二音频帧的声音类别,确定所述音频数据的语音端点类别,在所述音频数据的语音端点类别为“结束”的情况下,得到所述语音结束点。
  24. 如权利要求23所述的装置,其特征在于,所述处理单元具体用于:
    利用语音端点分类模型对所述语义和所述第二音频帧的声音类别进行处理,得到所述语音端点类别,所述语音端点分类模型是利用语音样本和所述语音样本的端点类别标签得到的,所述语音样本的格式与所述语义和所述第二音频帧的声音类别的格式对应,所述端点类别标签包括的端点类别与所述语音端点类别对应。
  25. 一种语音识别装置,其特征在于,包括:
    获取单元,用于获取第一音频数据;
    处理单元,用于:
    确定所述第一音频数据的第一语音结束点;
    在得到所述第一语音结束点后,响应所述第一音频数据中的所述第一语音结束点之前的音频数据对应的指令;
    所述获取单元还用于获取第二音频数据;
    所述处理单元还用于:
    确定所述第二音频数据的第二语音结束点;
    在得到所述第二语音结束点后,响应所述第一音频数据和所述第二音频数据中的所述第一语音结束点和所述第二语音结束点之间的音频数据对应的指令。
  26. 如权利要求25所述的装置,其特征在于,所述第一音频数据包括多个音频帧,所述处理单元具体用于:
    提取所述多个音频帧的声音类别和语义;
    根据所述多个音频帧的声音类别和所述语义,得到所述第一音频数据的第一语音结束点。
  27. 如权利要求26所述的装置,其特征在于,所述处理单元具体用于:
    根据所述多个音频帧的能量和预设能量阈值的关系,得到所述多个音频帧的声音类别。
  28. 如权利要求26或27所述的装置,其特征在于,所述声音类别包括“说话声”、“待定声音”和“静音”,所述预设能量阈值包括第一能量阈值和第二能量阈值,所述第一能量阈值大于所述第二能量阈值,
    所述多个音频帧中能量大于或等于所述第一能量阈值的音频帧的声音类别为“说话声”;或者
    所述多个音频帧中能量小于所述第一能量阈值且大于所述第二能量阈值的音频帧的声音类别为“待定声音”;或者
    所述多个音频帧中能量小于或等于所述第二能量阈值的音频帧的声音类别为“静音”。
  29. 如权利要求28所述的装置,其特征在于,所述第一能量阈值和所述第二能量阈值是根据所述第一音频数据的背景声音的能量确定的。
  30. 如权利要求26至29中任一项所述的装置,其特征在于,所述多个音频帧包括第一音频帧和第二音频帧,所述第一音频帧为承载所述语义的音频帧,所述第二音频帧为所述多个音频帧中的所述第一音频帧之后的音频帧;以及
    所述处理单元具体用于:
    根据所述语义和所述第二音频帧的声音类别得到所述第一语音结束点。
  31. 如权利要求30所述的装置,其特征在于,语音端点类别包括“说话”、“思考”或“结束”,所述处理单元具体用于:
    根据所述语义和所述第二音频帧的声音类别,确定所述第一音频数据的语音端点类别,在所述第一音频数据的语音端点类别为“结束”的情况下,得到所述第一语音结束点。
  32. 如权利要求31所述的装置,其特征在于,所述处理单元具体用于:
    利用语音端点分类模型对所述语义和所述第二音频帧的声音类别进行处理,得到所述语音端点类别,所述语音端点分类模型是利用语音样本和所述语音样本的端点类别标签得到的,所述语音样本的格式与所述语义和所述第二音频帧的声音类别的格式对应,所述端点类别标签包括的端点类别与所述语音端点类别对应。
  33. 一种计算机可读存储介质,其特征在于,所述计算机可读介质存储用于设备执行的程序代码,该程序代码包括用于执行如权利要求1至8或权利要求9至16中任一项所述方法的指令。
  34. 一种语音识别装置,其特征在于,所述装置包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,以执行如权利要求1至8或权利要求9至16中任一项所述的方法。
  35. 一种计算机程序产品,其特征在于,当所述计算机程序在计算机上执行时,使得所述计算机执行如权利要求1至8或权利要求9至16中任一项所述的方法。
  36. 一种车载系统,其特征在于,所述系统包括如权利要求17至24或权利要求25至32中任一项所述的装置。
PCT/CN2021/133207 2021-11-25 2021-11-25 语音识别方法、语音识别装置及系统 WO2023092399A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202180104424.0A CN118235199A (zh) 2021-11-25 2021-11-25 语音识别方法、语音识别装置及系统
PCT/CN2021/133207 WO2023092399A1 (zh) 2021-11-25 2021-11-25 语音识别方法、语音识别装置及系统

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/133207 WO2023092399A1 (zh) 2021-11-25 2021-11-25 语音识别方法、语音识别装置及系统

Publications (1)

Publication Number Publication Date
WO2023092399A1 true WO2023092399A1 (zh) 2023-06-01

Family

ID=86538476

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/133207 WO2023092399A1 (zh) 2021-11-25 2021-11-25 语音识别方法、语音识别装置及系统

Country Status (2)

Country Link
CN (1) CN118235199A (zh)
WO (1) WO2023092399A1 (zh)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102314884A (zh) * 2011-08-16 2012-01-11 捷思锐科技(北京)有限公司 语音激活检测方法与装置
US20160358598A1 (en) * 2015-06-07 2016-12-08 Apple Inc. Context-based endpoint detection
CN106847270A (zh) * 2016-12-09 2017-06-13 华南理工大学 一种双门限地名语音端点检测方法
CN108257616A (zh) * 2017-12-05 2018-07-06 苏州车萝卜汽车电子科技有限公司 人机对话的检测方法以及装置
CN110047470A (zh) * 2019-04-11 2019-07-23 深圳市壹鸽科技有限公司 一种语音端点检测方法
CN111583923A (zh) * 2020-04-28 2020-08-25 北京小米松果电子有限公司 信息控制方法及装置、存储介质
CN112567457A (zh) * 2019-12-13 2021-03-26 华为技术有限公司 语音检测方法、预测模型的训练方法、装置、设备及介质
US11056098B1 (en) * 2018-11-28 2021-07-06 Amazon Technologies, Inc. Silent phonemes for tracking end of speech
CN113345473A (zh) * 2021-06-24 2021-09-03 科大讯飞股份有限公司 语音端点检测方法、装置、电子设备和存储介质

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102314884A (zh) * 2011-08-16 2012-01-11 捷思锐科技(北京)有限公司 语音激活检测方法与装置
US20160358598A1 (en) * 2015-06-07 2016-12-08 Apple Inc. Context-based endpoint detection
CN106847270A (zh) * 2016-12-09 2017-06-13 华南理工大学 一种双门限地名语音端点检测方法
CN108257616A (zh) * 2017-12-05 2018-07-06 苏州车萝卜汽车电子科技有限公司 人机对话的检测方法以及装置
US11056098B1 (en) * 2018-11-28 2021-07-06 Amazon Technologies, Inc. Silent phonemes for tracking end of speech
CN110047470A (zh) * 2019-04-11 2019-07-23 深圳市壹鸽科技有限公司 一种语音端点检测方法
CN112567457A (zh) * 2019-12-13 2021-03-26 华为技术有限公司 语音检测方法、预测模型的训练方法、装置、设备及介质
CN111583923A (zh) * 2020-04-28 2020-08-25 北京小米松果电子有限公司 信息控制方法及装置、存储介质
CN113345473A (zh) * 2021-06-24 2021-09-03 科大讯飞股份有限公司 语音端点检测方法、装置、电子设备和存储介质

Also Published As

Publication number Publication date
CN118235199A (zh) 2024-06-21

Similar Documents

Publication Publication Date Title
KR101056511B1 (ko) 실시간 호출명령어 인식을 이용한 잡음환경에서의음성구간검출과 연속음성인식 시스템
JP2021067939A (ja) 音声インタラクション制御のための方法、装置、機器及び媒体
US11132509B1 (en) Utilization of natural language understanding (NLU) models
CN110689877A (zh) 一种语音结束端点检测方法及装置
JP2004538543A (ja) 多モード入力を使用した多モード・フォーカス検出、参照の曖昧性の解決およびムード分類のためのシステムおよび方法
CN113362828B (zh) 用于识别语音的方法和装置
CN109754809A (zh) 语音识别方法、装置、电子设备及存储介质
Këpuska et al. A novel wake-up-word speech recognition system, wake-up-word recognition task, technology and evaluation
CN109712610A (zh) 用于识别语音的方法和装置
JP6797338B2 (ja) 情報処理装置、情報処理方法及びプログラム
CN116417003A (zh) 语音交互系统、方法、电子设备和存储介质
CN103680505A (zh) 语音识别方法及系统
WO2022074869A1 (en) System and method for producing metadata of an audio signal
CN116153311A (zh) 一种音频处理方法、装置、车辆及计算机可读存储介质
KR20190032557A (ko) 음성 기반 통신
Raux Flexible turn-taking for spoken dialog systems
CN113160854A (zh) 语音交互系统、相关方法、装置及设备
CN111009240A (zh) 一种语音关键词筛选方法、装置、出行终端、设备及介质
CN113611316A (zh) 人机交互方法、装置、设备以及存储介质
CN112185357A (zh) 一种同时识别人声和非人声的装置及方法
WO2023092399A1 (zh) 语音识别方法、语音识别装置及系统
KR20210042520A (ko) 전자 장치 및 이의 제어 방법
WO2021253779A1 (zh) 一种语音识别方法以及系统
Kos et al. A speech-based distributed architecture platform for an intelligent ambience
WO2022217621A1 (zh) 语音交互的方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21965134

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2021965134

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2021965134

Country of ref document: EP

Effective date: 20240524