WO2018233300A1 - 语音识别方法和语音识别装置 - Google Patents

语音识别方法和语音识别装置 Download PDF

Info

Publication number
WO2018233300A1
WO2018233300A1 PCT/CN2018/076031 CN2018076031W WO2018233300A1 WO 2018233300 A1 WO2018233300 A1 WO 2018233300A1 CN 2018076031 W CN2018076031 W CN 2018076031W WO 2018233300 A1 WO2018233300 A1 WO 2018233300A1
Authority
WO
WIPO (PCT)
Prior art keywords
operator
processor
speech recognition
voice
acoustic
Prior art date
Application number
PCT/CN2018/076031
Other languages
English (en)
French (fr)
Inventor
杨向东
Original Assignee
京东方科技集团股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 京东方科技集团股份有限公司 filed Critical 京东方科技集团股份有限公司
Priority to US16/327,319 priority Critical patent/US11355124B2/en
Publication of WO2018233300A1 publication Critical patent/WO2018233300A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • G10L15/07Adaptation to the speaker
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/226Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics
    • G10L2015/227Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics of the speaker; Human-factor methodology

Definitions

  • the present disclosure relates to the field of speech recognition, and in particular to a speech recognition method and a speech recognition device.
  • Voice recognition based voice recognition technology is gaining more and more applications. However, due to the presence of noise or noise from non-control personnel, voice recognition and corresponding voice control operations for control personnel are affected. Therefore, voice control technology is considerably limited in scenarios where accuracy and security are high. Especially in a small space, when there are multiple people or loud noises, it is difficult to identify the controller, which may result in malfunction or danger. For example, in a driving scene of a vehicle, since the space inside the vehicle is relatively small, the sound emitted from each position may affect the voice control of the in-vehicle system, which may seriously affect driving safety.
  • the present disclosure proposes a speech recognition method and a speech recognition apparatus.
  • a speech recognition method includes: acquiring an identification result of an operator; acquiring an acoustic feature set corresponding to the operator based on an identification result of the operator; and receiving the received sound based on the acquired acoustic feature set The operator's voice is identified.
  • the speech recognition method prior to the step of obtaining an operator's identification result, further includes determining an environmental status.
  • the step of obtaining an operator's identification result further includes: obtaining an operator's identity recognition result according to the environmental state.
  • the step of determining an environmental state includes: receiving sensor data from the at least one environmental sensor; determining, based on the sensor data, whether an identification function needs to be activated; and returning an indication of whether an activation of the identity is required based on the determined result Identify the environmental status of the feature.
  • the voice recognition method further includes: sending an identity reminder when the result of the identity is not obtained within a preset time period.
  • the speech recognition method further includes creating an identity for the operator and establishing a corresponding set of acoustic features for the operator.
  • establishing a corresponding set of acoustic features for the operator comprises: receiving speech of the operator; extracting acoustic characteristics of the operator from the received speech; and based on the extracted acoustic features Establishing an acoustic feature set corresponding to the operator.
  • establishing a corresponding set of acoustic features for the operator comprises: receiving a voice of the operator; transmitting the received speech to a server; and receiving an acoustic feature corresponding to the operator from a server set.
  • the step of identifying the operator's voice from the received sound further comprises: extracting an acoustic feature from the received sound; and extracting the extracted acoustic feature from the acquired operator The corresponding set of acoustic features are matched; if matched, the received sound is identified as the voice of the operator.
  • the step of matching the extracted acoustic features with the acquired set of acoustic features corresponding to the operator comprises calculating based on the acquired set of acoustic features corresponding to the operator A maximum likelihood probability of the extracted acoustic features; when the calculated probability is greater than the first threshold, determining that the extracted acoustic features match the operator's acoustic feature set.
  • the speech recognition method further includes updating the acoustic feature set of the operator with the extracted acoustic features when the calculated probability is greater than the first threshold but less than the second threshold.
  • the speech recognition method further includes identifying an operation to be performed from the operator's voice.
  • a voice recognition device includes: a processor; a memory having instructions stored thereon, the instructions, when executed by the processor, causing the processor to: obtain an operator identification result; and based on an operator identification result Obtaining an acoustic feature set corresponding to the operator; and identifying the operator's voice from the received sound based on the acquired acoustic feature set.
  • the instructions when executed by the processor, further cause the processor to: determine an environmental state; and perform an operation of obtaining an operator's identification result based on the environmental state.
  • the instructions when executed by the processor, further cause the processor to: receive sensor data from at least one environmental sensor; determine, based on the sensor data, whether activation of an identification function is required; The determined result is returned to an environmental state indicating whether the identity recognition function needs to be activated.
  • the instructions when executed by the processor, further cause the processor to issue an identification reminder when a result of the identification is not obtained within a predetermined time period.
  • the instructions when executed by the processor, further cause the processor to: create an identity for the operator and establish a corresponding set of acoustic features for the operator.
  • the instructions when executed by the processor, further cause the processor to: receive the operator's voice; extract the operator's acoustic characteristics from the received speech; The extracted acoustic features establish an acoustic feature set corresponding to the operator.
  • the instructions when executed by the processor, further cause the processor to: receive the operator's voice; transmit the received voice to a server; and receive from the server Corresponding set of acoustic features.
  • the instructions when executed by the processor, further cause the processor to: extract acoustic features from the received sound; and correlate the extracted acoustic features with the acquired operator The set of acoustic features is matched; if matched, the received sound is identified as the voice of the operator.
  • the instructions when executed by the processor, further cause the processor to: calculate a maximum likelihood of the extracted acoustic features based on the acquired set of acoustic features corresponding to the operator Probability; when the calculated probability is greater than the first threshold, determining that the extracted acoustic feature matches the operator's acoustic feature set.
  • the instructions when executed by the processor, further cause the processor to: update the operation with the extracted acoustic features when the calculated probability is greater than the first threshold but less than the second threshold The acoustic feature set.
  • the instructions when executed by the processor, further cause the processor to: identify an operation to be performed from the operator's voice.
  • FIG. 1A illustrates a network architecture of a vehicle voice control network in accordance with an embodiment of the present disclosure
  • FIG. 1B illustrates an in-vehicle voice control scenario for a vehicle in a network architecture in accordance with an embodiment of the present disclosure
  • FIG. 2 is a block diagram showing the structure of a voice recognition apparatus according to an embodiment of the present disclosure
  • FIG. 3 illustrates a flow chart of a speech recognition method in accordance with an embodiment of the present disclosure.
  • the speech recognition apparatus and the speech recognition method proposed by the present disclosure can be applied to various scenes capable of speech recognition, such as home appliance control, industrial machine operation, vehicle driving, etc., and the present disclosure does not limit.
  • the speech recognition apparatus and the speech recognition method of the present disclosure are particularly suitable for an application scenario requiring a specific operator to operate on a target device.
  • applying the speech recognition apparatus and the speech recognition method proposed by the present disclosure to the target device can The accuracy of the operation of the target device by the operator is increased, and the security of the target device is increased.
  • FIGS. 1A and 1B a vehicle driving scene that can be used to implement the voice recognition device and the voice recognition method of the present disclosure will be described with reference to FIGS. 1A and 1B.
  • FIG. 1A shows a network architecture of a vehicle voice control network 100
  • FIG. 1B shows an in-vehicle voice control scenario for a single vehicle 110A in the network architecture.
  • the vehicle voice control network 100 includes vehicles 110A and 110B and a cloud server 120.
  • Vehicles 110A and 110B communicate with cloud server 120 via wireless communication, respectively. It should be understood that although only two vehicles 110A and 110B are shown in FIG. 1A, in other embodiments, the network 100 may include more or fewer vehicles, and the disclosure is not limited herein.
  • the cloud server 120 can be a local or remote server implemented by any server configuration that enables processing of transceiving, computing, storing, training, etc. of data from the vehicle.
  • the wireless communication between the cloud server 120 and the vehicle can be implemented by various means such as cellular communication (such as 2G, 3G, 4G or 5G mobile communication technology), WiFi, satellite communication, etc., although the cloud server 120 and the vehicle 110A are shown in FIG. 1A. And 110B are shown as communicating directly, but it should be understood that in other embodiments of the present disclosure, there may be indirect communication therebetween.
  • the vehicles 110A and 110B communicate with the cloud server 120 by wireless communication to implement voice control of the vehicle or the in-vehicle system by the data obtained from the cloud server 120.
  • voice control scene of the vehicle 110A is shown as an example in FIG. 1B.
  • a voice control device 112 is disposed in the vehicle 110A, and an identity recognition device 114 is also exemplarily disposed.
  • the identification device 114 can be implemented as part of the speech recognition device 112, such as as an identity recognition unit integrated in the speech recognition device 112.
  • the voice recognition device 112 is capable of collecting and processing sounds and controlling vehicle operations based on the processing results.
  • speech recognition device 112 includes a sound input unit and a processor.
  • the sound input unit may be, for example, a microphone for receiving sound from the outside and converting it into an electrical signal.
  • the processor is configured to process the generated electrical signal and instruct the vehicle to operate based on the result of the processing.
  • the voice recognition device 112 may not include a voice input unit, but may receive from an external electronic device (eg, an external dedicated microphone or a sound collection device disposed in the vehicle separate from the voice recognition device 112) The desired sound related signal.
  • the speech recognition device 112 may also include a database.
  • Data related to the driver's identity and voice can be stored in the database.
  • the database may include data needed for the processor to process the sound signal, such as acoustic model parameters, acoustic feature sets, and the like.
  • data related to the identity of the driver such as driver ID, driver preference data, driver facial features, etc., may be included in the database.
  • the identification device 114 is used to identify the driver. Although the identification device 114 is illustrated as a camera (or camera) for face recognition in FIG. 1B, it should be understood that in other embodiments of the present disclosure, the identification device 114 may be implemented for Other devices such as iris recognition, fingerprint recognition, password recognition, and login information recognition, such as fingerprint readers and keyboards.
  • various sensors such as a door sensor, a driving position sensor, a driving state sensor, etc., may also be disposed in the vehicle 110A for sensing whether the driver is approaching or entering the vehicle.
  • speech recognition device 112 and/or identification device 114 are only activated when the driver is sensed to approach or enter the vehicle to reduce power consumption.
  • vehicle operating information may also be obtained from the vehicle system bus and determined whether the voice recognition device 112 and/or the identity recognition device 114 are activated based on the vehicle operating information.
  • the driving position sensor can utilize pedestrian tracking technology.
  • the driving position sensor may be a camera mounted on the rear view mirror for acquiring an image at a driving position.
  • the humanoid shape on the image is recognized by the human body type classifier in the pedestrian tracking technology.
  • the driving state sensor may be a switch type Hall sensor that detects a vehicle speed mounted on the axle.
  • a magnetic steel can be adhered to the edge of the disc of the non-magnetic material of the wheel, the Hall sensor is placed near the edge of the disc, the disc rotates once, and the Hall sensor outputs a pulse, so that the rotation can be measured. number.
  • a vehicle speed is sensed, it can be judged that the vehicle is in a running state; when the vehicle speed is not sensed, the vehicle is judged to be in a stopped state.
  • the door sensor may be a Hall sensor mounted on the door, which judges the switch of the door by the contact state with the magnet on the door frame. For example, the magnet is close to the Hall sensor and can output a specific level, at which point the door can be judged to be closed. On the contrary, it can be judged that the door is opened.
  • the driver P1 in the driving position and the passenger P2 in the passenger driving position are also schematically shown in Fig. 1B.
  • FIG. 2 shows a structural block diagram of a voice recognition device 112 according to an embodiment of the present disclosure.
  • the voice recognition device 112 may include a voice input unit 210, an identity recognition result acquisition unit 220, and an acoustic feature set acquisition unit. 230, acoustic recognition unit 240, environmental state determination unit 250, identity recognition reminder unit 260, identity creation unit 270, instruction recognition unit 280, and update unit 290.
  • the environment state determination unit 250, the identity recognition reminder unit 260, the identity creation unit 270, the instruction recognition unit 280, and the update unit 290 are illustrated by dashed boxes, which are not necessary in the embodiments of the present disclosure. In other embodiments, one or more of the units may be omitted or combined, or other processing modules may be added depending on the processing performed.
  • the sound input unit 210 may be, for example, a microphone for receiving sound from the outside and converting it into an electrical signal.
  • the identification result acquisition unit 220 can be used to acquire the identification result of the operator (ie, the driver P1 of the scene shown in FIG. 1B).
  • the acoustic feature set acquisition unit 230 is configured to acquire an acoustic feature set corresponding to the operator based on the operator's identification result.
  • the acoustic recognition unit 240 is configured to recognize the voice of the operator from the received sound based on the acquired acoustic feature set.
  • the environmental state determination unit 250 is for determining an environmental state.
  • the identity reminding unit 260 is configured to issue an identity reminder when the result of the identity recognition is not obtained within the preset time period.
  • the identity creation unit 270 is for creating an identity for the operator and establishing a corresponding set of acoustic features for the operator.
  • the instruction recognition unit 280 is for recognizing an operation to be performed from the voice of the operator.
  • the updating unit 290 is configured to update the operator's acoustic feature set with the extracted acoustic features when the calculated probability is greater than the first threshold but less than the second threshold.
  • speech recognition device 112 may include, for example, a processor and a memory. Instructions may be stored on the memory that, when executed by the processor, cause the processor to perform the method steps described below in connection with FIG. Moreover, the instructions, when executed by the processor, can also cause the processor to become individual functional units 210-290 as described above.
  • the voice recognition device 112 includes hardware (for example, a microphone) for sound input
  • the sound input unit 210 may be the hardware itself; and in the case where the voice recognition device 112 does not include hardware for sound input,
  • the sound input unit 210 may be a processor-implemented functional unit configured to receive signals related to sound input from external hardware.
  • FIG. 3 illustrates a flow diagram of a speech recognition method 300 that may be used in conjunction with the speech recognition device 112 illustrated in FIG. 2 in accordance with an embodiment of the present disclosure.
  • the method 300 begins in step S310, in which an operator's identification result is obtained. Then, in step S320, based on the operator's identification result, an acoustic feature set corresponding to the operator is acquired. Next, in step S330, the operator's voice is recognized from the received sound based on the acquired acoustic feature set.
  • step S310 the operator's identification result is obtained.
  • the identification may be implemented by, for example, the external identification device 114 shown in FIG. 1B.
  • the identity recognition result acquisition unit 220 in the voice recognition device 112 can obtain the result of the identity recognition from the identity recognition device 114.
  • step S310 may simply perform an operation of receiving an identification result from, for example, the identity recognition device 114.
  • the operator is identified by the identity recognition result acquisition unit 220 in the voice recognition device 112.
  • the obtaining step is equivalent to the step of performing identification.
  • identity recognition can be implemented by means of face recognition, iris recognition, fingerprint recognition, password recognition, and login information recognition.
  • the speech recognition method 300 may further include the step of determining an environmental status. And, in step S310, it is determined whether the result of the identity recognition is acquired according to the environmental state. For example, the operation in step S310 can be performed only when the environmental condition satisfies a predetermined condition (for example, a person approaches, the driving position sensor senses the pressure).
  • a predetermined condition for example, a person approaches, the driving position sensor senses the pressure.
  • step S310 can be performed by the environmental state determination unit 250.
  • the environmental state (taking the scenes shown in Figs. 1A and 1B as an example) can be judged by information from one or a combination of the following: a door sensor, a driving position sensor, a traveling state sensor, and a vehicle system bus.
  • the identification function (hardware and/or software module) can be activated based on the information detected by the sensors installed on the vehicle.
  • the identity recognition function can be activated by communicating with the Bluetooth/RFID module on the owner's key, such as a Bluetooth transceiver and/or an RFID reader installed on the vehicle. And / or software modules).
  • the Bluetooth/RFID module on the owner's key
  • the door opening state and/or the car key insertion state and/or the door opening command may be detected by a door sensor, a Bluetooth transceiver, and/or an RFID reader installed on the vehicle, and the like.
  • the identification function hardware and / or software module.
  • the corresponding information (Bluetooth, RFID, driving position sensor, fingerprint sensor, microphone, etc.) can also be used to detect and activate the corresponding information.
  • Identification function hardware and / or software module. As described above, by judging the environmental state, it is possible to only when the driver (ie, the operator) approaches or enters the vehicle (at this time, the door sensor, the driving position sensor, the driving state sensor, and the vehicle system bus have specific states or values) The step of obtaining the identification result is activated, so that power consumption can be effectively reduced.
  • the result of the identity recognition when it is determined according to the environmental state that the result of the identity recognition is to be acquired, if the identity is not obtained within a preset time period (eg, a period of a certain length (eg, 10 seconds) or a period before the car starts)
  • the result of the identification can send an identification reminder to the operator.
  • the reminder can be, for example, an alarm, a flash of light, a vibration, or the like.
  • this operation can be performed by the identity alerting unit 260.
  • the voice recognition device 112 may determine that identification should be performed. At this point, the identity device 114 should have begun attempting to extract driver facial features. If the driver's face feature is not detected within the predetermined time period, that is, the voice recognition device 112 has not been able to obtain the result of the identity recognition, the voice alert will be made to the driver and the detection will continue until the driver's face feature is acquired. .
  • step S320 an acoustic feature set corresponding to the operator may be acquired based on the operator's identification result.
  • the step S320 can be performed by the acoustic feature set acquisition unit 230.
  • An acoustic feature set refers to a collection of acoustic features. Acoustic features are important concepts in the field of speech recognition. The following is a brief description of the relevant content:
  • O) represents the conditional probability that W (ie, the corresponding actual text sequence) occurs when event O (ie, the actual observed result) occurs
  • the meaning of the above formula is Is W in L such that the conditional probability function P(W
  • W) part is called the maximum likelihood probability and can be calculated by the acoustic model; the P(W) part is called the prior probability, which can be calculated by the language model.
  • W) part is mainly involved, that is, mainly related to an acoustic model.
  • the acoustic model is about how to calculate the degree to which a phoneme matches a speech signal. Therefore, there is a need to find a suitable method for representing a speech signal.
  • the speech signal is divided into voice units, for example, into a plurality of frames. For each frame, it is converted into an acoustic feature by the acoustic model used (where a series of operations such as Fourier transform is used).
  • acoustic features include linear prediction coefficients, cepstral coefficients, Mel frequency cepstral coefficients, perceptual linear prediction coefficients, and the like.
  • a large number of acoustic features can be extracted therefrom, and the correspondence between these acoustic features and phonemes can be obtained.
  • These acoustic features having a corresponding relationship with phonemes constitute an acoustic feature set.
  • classifiers from acoustic features to phonemes can be trained, and these classifiers can be used to determine the maximum likelihood probability P(O
  • Commonly used classifiers include the Gaussian Mixture Model (GMM) and the Deep Neural Network (DNN) model.
  • GMM Gaussian Mixture Model
  • DNN Deep Neural Network
  • the principle of GMM is to estimate the distribution of the acoustic characteristics of each phoneme. Then, in the recognition phase, calculate the probability that the acoustic features of each frame are generated by the corresponding phonemes, multiply the probability of each frame to obtain P. (O
  • HMM Hidden Markov Model
  • DTW Dynamic Time Programming Model
  • the acoustic feature set can cover the acoustic features of almost all speech units emitted by the user.
  • the more complete the acoustic feature set the more accurate the result of speech recognition.
  • acoustic feature sets and the speech material used in the training process are not limited, which is a set of acoustic features in the general sense, which does not have the ability to distinguish for users.
  • the acquired acoustic feature set is an acoustic feature set corresponding to the operation, and the acoustic feature set is obtained by using the voice emitted by the operator.
  • the voice material is created (see the creation process of the acoustic feature set corresponding to the operator below), with user distinguishing ability. For example, referring to the scenario in FIG. 1B, after detecting that the driver is P1, an acoustic feature set corresponding thereto is acquired based on his identity.
  • the set of acoustic features is obtained from a cloud server (such as cloud server 120 in Figure 1A). In another embodiment, the set of acoustic features is obtained from a local memory, such as the database described above.
  • acoustic feature does not define its own length or number.
  • acoustic features may refer to one or more acoustic features, as well as to an acoustic feature set.
  • a mode selection step may be additionally provided prior to steps S310-S320, in which the mode of the speech recognition device 112 may be selected, wherein the modes include, for example, a manual mode and an automatic mode.
  • the modes include, for example, a manual mode and an automatic mode.
  • the operations in steps S310-S320 will not be performed, and the operator will be assigned a generic non-specific acoustic feature set directly.
  • steps S310-S320 will be performed and subsequent operations will be performed based on the set of acoustic features corresponding to the operator determined in step S320.
  • step S330 the operator's voice is recognized from the received sound based on the acquired acoustic feature set.
  • the step S330 is performed by the acoustic recognition unit 240.
  • step S330 specifically includes:
  • acoustic features are extracted from the received sound.
  • This process of extracting acoustic features can be performed using an acoustic model used in establishing an acoustic feature set for the operator.
  • the extracted acoustic features are then matched to the acquired set of acoustic features corresponding to the operator.
  • the maximum likelihood probability of the extracted acoustic feature is first calculated based on the acquired acoustic feature set corresponding to the operator, and then it is determined whether the calculated probability is greater than the first threshold. (such as, but not limited to, 80%), and when the calculated probability is greater than the first threshold, it may be determined that the extracted acoustic features match the operator's acoustic feature set.
  • the first threshold used herein may be a probability threshold for indicating that the speaker of the detected speech is actually the corresponding operator of the acoustic feature set. When the calculated probability is greater than the first threshold, then the speaker may be considered to be very likely to be an operator, and instead is less likely to be an operator or possibly someone else.
  • the value of the first threshold may be set empirically, experimentally, or in various ways and may be dynamically adjusted.
  • a deterministic criterion for determining whether the speaker is an operator can be given.
  • the principle of this step is that, in the case where the set of acoustic features used is determined, the maximum likelihood probability of the operator's speech calculation for establishing the acoustic feature set is higher than that for other people (or The calculated maximum likelihood probability of noise).
  • the maximum likelihood probability of the operator's speech calculation for establishing the acoustic feature set is higher than that for other people (or The calculated maximum likelihood probability of noise).
  • the received sound can be identified as the voice of the operator.
  • the speech recognition method 300 can further include the steps of creating an identity for the operator and establishing a corresponding set of acoustic features for the operator. In the voice recognition device 112 shown in FIG. 2, this operation can be performed by the identity creation unit 270.
  • the step of establishing a corresponding set of acoustic features for the operator may include: receiving a voice of the operator; extracting an acoustic feature of the operator from the received voice; and extracting from the received voice An acoustic feature that establishes an acoustic feature set corresponding to the operator. This embodiment corresponds to the case where the identity and acoustic feature sets are created locally at the speech recognition device 112.
  • the step of establishing a corresponding set of acoustic features for the operator may include: receiving the voice of the operator; transmitting the received voice to a server; and receiving from the server Corresponding set of acoustic features.
  • This embodiment corresponds to the case of creating an identity and acoustic feature set on the server.
  • the server may be, for example, the cloud server 120 shown in FIG. 1A.
  • the voice material used is a voice that is emitted by the operator himself.
  • speech recognition can be performed using the set of acoustic features specifically established for the operator when the operator operates, by means of recognition of the operator, so that noise and other people's speech-to-speech recognition results can be better filtered out. influences.
  • the system storage space is strained in order to avoid excessive identification of the identity feature model and acoustic model files deployed locally or on the server due to large operator mobility.
  • the system allows N unique operator IDs and N1 normal operator IDs to be set, depending on the system storage space.
  • the priority of the N1 normal operator IDs in the system will be determined according to the weight of the time and number of times the target device is operated. During a statistical period, if an operator ID is always in a non-operational state, its priority will be lowered, that is, it will be erased first. If desired, you can choose to manually clear the low priority operator ID and its data.
  • the operator's acoustic feature set may be automatically updated using the acoustic features extracted each time the operator's voice is received.
  • the update can be done manually. In the voice recognition device 112 shown in FIG. 2, this operation can be performed by the update unit 290.
  • an update condition can also be set, for example, in one embodiment, only when the calculated maximum likelihood probability is greater than the first threshold but less than the second threshold (such as, but not limited to, 90%).
  • the extracted acoustic features update the operator's acoustic feature set.
  • the second threshold used herein may be a threshold for indicating that an acoustic feature set needs to be updated.
  • a time length parameter is additionally set, wherein only when the maximum likelihood probability calculated during the time period equal to the time length parameter is greater than the first threshold but less than the second threshold, The operator's acoustic feature set is updated with the extracted acoustic features.
  • the target device may be directly caused to execute a corresponding instruction according to a preset setting. In this case, no further language recognition (ie, the process from phoneme to text) is required.
  • the speech recognition method 300 may further include: recognizing an operation to be performed from the operator's voice. In the voice recognition device 112 shown in FIG. 2, this operation can be performed by the command recognition unit 280.
  • the phoneme (sequence) corresponding to the voice can be determined while recognizing the voice of the operator.
  • the text corresponding to the determined phoneme can be further estimated according to the language model, and the operation to be performed by the target device is determined according to the text content.
  • the "operator” referred to herein is not limited to a human operator, but may be any operator, such as an electronic device (eg, an unmanned device/program, etc.) or any possible device that can operate the vehicle.

Abstract

一种语音识别方法及语音识别装置,语音识别方法包括以下操作:获取操作者的身份识别结果(S310);基于操作者的身份识别结果,获取与操作者相对应的声学特征集(S320);以及基于所获取的声学特征集,从所接收的声音中识别出操作者的语音(S330)。

Description

语音识别方法和语音识别装置
相关申请的交叉引用
本申请要求于2017年6月20日递交的题为“语音识别方法和语音识别装置”的中国专利申请(申请号201710466754.X)的优先权,在此以全文引用的方式将该中国专利申请并入本文中。
技术领域
本公开涉及语音识别领域,具体地涉及语音识别方法和语音识别装置。
背景技术
基于语音识别的语音控制技术正在得到越来越广泛的应用,然而,由于噪音或非控制人员发出的声音的存在,会影响针对控制人员的语音识别以及相应的语音控制操作。因此,在对精确性、安全性要求较高的场景中,语音控制技术受到了相当大的限制。尤其是在狭小空间内,存在多个人员或较大噪音的情况下,难以辨识控制者,可能导致误操作或产生危险。比如,在车辆驾驶场景中,由于车内空间相对狭小,各个位置发出的声音都可能对车载系统的语音控制产生影响,这会严重影响驾驶安全。
发明内容
为了至少部分解决或减轻现有技术中存在的上述问题,本公开提出了语音识别方法及语音识别装置。
根据本公开的一个方面,提供了一种语音识别方法。所述语音识别方法包括:获取操作者的身份识别结果;基于操作者的身份识别结果,获取与所述操作者相对应的声学特征集;以及基于所获取的声学特征集,从所接收的声音中识别出所述操作者的语音。
在一个实施例中,在获取操作者的身份识别结果的步骤之前,所述语音识别方法还包括判断环境状态。所述获取操作者的身份识别结果的步骤还包括:根据所述环境状态,获取操作者的身份识别结果。
在一个实施例中,判断环境状态的步骤包括:接收来自至少一个环境传感器的传感器数据;根据所述传感器数据来确定是否需要激活身份识别功能;以及根据所确定的结果来返回指示是否需要激活身份识别功能的环境状态。
在一个实施例中,所述语音识别方法还包括:在预设时段内未获得身份识别的结果时,发出身份识别提醒。
在一个实施例中,所述语音识别方法还包括:为所述操作者创建身份,并为所述操作者建立相对应的声学特征集。
在一个实施例中,为所述操作者建立相对应的声学特征集包括:接收所述操作者的语音;从所接收的语音中提取所述操作者的声学特征;以及根据所提取的声学特征,建立与所述操作者相对应的声学特征集。
在一个实施例中,为所述操作者建立相对应的声学特征集包括:接收所述操作者的语音;向服务器发送所接收的语音;以及从服务器接收与所述操作者相对应的声学特征集。
在一个实施例中,从所接收的声音中识别出所述操作者的语音的步骤还包括:从所接收的声音提取声学特征;将所提取的声学特征与所获取的与所述操作者相对应的声学特征集进行匹配;如果匹配,则将所接收的声音识别为所述操作者的语音。
在一个实施例中,将所提取的声学特征与所获取的与所述操作者相对应的声学特征集进行匹配的步骤包括:基于所获取的与所述操作者相对应的声学特征集,计算所提取的声学特征的最大似然概率;当计算出的概率大于第一阈值时,确定所提取的声学特征与所述操作者的声学特征集匹配。
在一个实施例中,所述语音识别方法还包括:当计算出的概率大于第一阈值但小于第二阈值时,以所提取的声学特征更新所述操作者的声学特征集。
在一个实施例中,所述语音识别方法还包括:从所述操作者的语音中识别出将要执行的操作。
根据本公开的另一方面,还提出了一种语音识别装置。该语音识别装置包括:处理器;存储器,其上存储有指令,所述指令在由所述处理器执行时,使得所述处理器:获取操作者的身份识别结果;基于操作者的身份识别结果,获取与所述操作者相对应的声学特征集;以及基于所获取的声学特征集,从所接 收的声音中识别出所述操作者的语音。
在一个实施例中,所述指令在由所述处理器执行时还使得所述处理器:判断环境状态;以及根据所述环境状态进行获取操作者的身份识别结果的操作。
在一个实施例中,所述指令在由所述处理器执行时还使得所述处理器:接收来自至少一个环境传感器的传感器数据;根据所述传感器数据来确定是否需要激活身份识别功能;以及根据所确定的结果来返回指示是否需要激活身份识别功能的环境状态。
在一个实施例中,所述指令在由所述处理器执行时还使得所述处理器:在预设时段内未获得身份识别的结果时,发出身份识别提醒。
在一个实施例中,所述指令在由所述处理器执行时还使得所述处理器:为所述操作者创建身份,并为所述操作者建立相对应的声学特征集。
在一个实施例中,所述指令在由所述处理器执行时还使得所述处理器:接收所述操作者的语音;从所接收的语音中提取所述操作者的声学特征;以及根据所提取的声学特征,建立与所述操作者相对应的声学特征集。
在一个实施例中,所述指令在由所述处理器执行时还使得所述处理器:接收所述操作者的语音;向服务器发送所接收的语音;以及从服务器接收与所述操作者相对应的声学特征集。
在一个实施例中,所述指令在由所述处理器执行时还使得所述处理器:从所接收的声音提取声学特征;将所提取的声学特征与所获取的与所述操作者相对应的声学特征集进行匹配;如果匹配,则将所接收的声音识别为所述操作者的语音。
在一个实施例中,所述指令在由所述处理器执行时还使得所述处理器:基于所获取的与所述操作者相对应的声学特征集,计算所提取的声学特征的最大似然概率;当计算出的概率大于第一阈值时,确定所提取的声学特征与所述操作者的声学特征集匹配。
在一个实施例中,所述指令在由所述处理器执行时还使得所述处理器:当计算出的概率大于第一阈值但小于第二阈值时,以所提取的声学特征更新所述操作者的声学特征集。
在一个实施例中,所述指令在由所述处理器执行时还使得所述处理器:从 所述操作者的语音中识别出将要执行的操作。
附图说明
为了更清楚地说明本公开实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本公开的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图,图中:
图1A示出了根据本公开实施例的车辆语音控制网络的网络架构;
图1B示出了根据本公开实施例的针对网络架构中的车辆的车内语音控制场景;
图2示出了根据本公开实施例的语音识别装置的结构框图;以及
图3示出了根据本公开实施例的语音识别方法的流程图。
具体实施方式
下面将详细描述本公开的具体实施例,应当注意,这里描述的实施例只用于举例说明,并不用于限制本公开。在以下描述中,为了提供对本公开的透彻理解,阐述了大量特定细节。然而,对于本领域普通技术人员显而易见的是:不必采用这些特定细节来实行本公开。在其他实例中,为了避免混淆本公开,未具体描述公知的电路、材料或方法。
在整个说明书中,对“一个实施例”、“实施例”、“一个示例”或“示例”的提及意味着:结合该实施例或示例描述的特定特征、结构或特性被包含在本公开至少一个实施例中。因此,在整个说明书的各个地方出现的短语“在一个实施例中”、“在实施例中”、“一个示例”或“示例”不一定都指同一实施例或示例。此外,可以以任何适当的组合和/或子组合将特定的特征、结构或特性组合在一个或多个实施例或示例中。此外,本领域普通技术人员应当理解,在此提供的附图都是为了说明的目的,并且附图不一定是按比例绘制的。这里使用的术语“和/或”包括一个或多个相关列出的项目的任何和所有组合。
应该理解的是,本公开提出的语音识别装置和语音识别方法可以应用于各种能够进行语音识别的场景,比如,家用电器控制、工业机械操作、车辆驾驶 等等,本公开在此并不进行限制。本公开的语音识别装置和语音识别方法尤其适用于需要特定操作者对目标设备进行操作的应用场景,此时,将本公开所提出的语音识别装置和语音识别方法应用于所述目标设备,能够提高所述操作者对目标设备操作的准确性,增加所述目标设备的安全性。
本公开的以下具体描述中,为了便于理解,将以车辆驾驶场景作为示例对本公开的实施例进行描述。但应该理解的是,本公开的技术方案同样适用于以上所提及的其他场景。
首先,参照图1A和图1B描述能够用于实现本公开的语音识别装置和语音识别方法的车辆驾驶场景。
图1A示出了车辆语音控制网络100的网络架构,图1B示出了针对所述网络架构中的单个车辆110A的车内语音控制场景。
在图1A中,所述车辆语音控制网络100包括车辆110A和110B以及云服务器120。车辆110A和110B分别通过无线通信与云服务器120进行通信。应该理解的是,虽然图1A中只是示出了两个车辆110A和110B,但在其他实施例中,所述网络100可以包括更多或更少的车辆,本公开在此并不进行限制。
云服务器120可以是通过任意服务器配置实现的本地或远程服务器,其能够实现对来自车辆的数据的收发、计算、存储、训练等处理。云服务器120与车辆之间的无线通信可以通过蜂窝通信(如2G、3G、4G或5G移动通信技术)、WiFi、卫星通信等各种方式来实现,虽然图1A中将云服务器120与车辆110A和110B示为直接进行通信,但应该理解的是,在本公开的其他实施例中,二者之间可以是间接通信。
如上所述,车辆110A和110B通过无线通信与云服务器120进行通信,以通过从云服务器120获得的数据来实现对车辆或车载系统的语音控制。具体地,图1B中示出了作为示例的车辆110A的车内语音控制场景。
在图1B中,车辆110A中布置有语音控制装置112,并且还示例性地布置有身份识别装置114。在其他实施例中,身份识别装置114可以实现为语音识别装置112的一部分,比如作为语音识别装置112中集成的身份识别单元。
语音识别装置112能够采集并处理声音,并基于处理结果控制车辆操作。 在一个实施例中,语音识别装置112包括声音输入单元和处理器。声音输入单元可以是例如麦克风,其用于从外界接收声音,并转换成电信号。处理器用于对生成的电信号进行处理,并根据处理的结果指示车辆进行操作。然而,在一些实施例中,语音识别装置112可以不包括声音输入单元,而是可以从外部电子设备(例如,外接的专用麦克风或者车辆中设置的与语音识别装置112分离的声音采集设备)接收所需的与声音有关的信号。
在一个实施例中,所述语音识别装置112还可以包括数据库。数据库中能够存储与驾驶员的身份以及语音有关的数据。比如,数据库可包括用于供处理器对声音信号进行处理所需的数据,比如,声学模型参数、声学特征集合等。再例如,数据库中可以包括与驾驶员的身份有关的数据,比如驾驶员ID、驾驶员偏好数据、驾驶员脸部特征等。
身份识别装置114用于对驾驶员进行身份识别。尽管在图1B中将身份识别装置114示为用于进行人脸识别的相机(或摄像头),但应该理解的是,在本公开的其他实施例中,身份识别装置114可以实现为用于进行虹膜识别、指纹识别、密码识别、登录信息识别的其他设备,比如指纹读取器、键盘等。
在一个实施例中,车辆110A中还可布置各种传感器,比如车门传感器、驾驶位置传感器、行驶状态传感器等,以用于对驾驶员是否接近或进入车辆进行感测。在一个实施例中,只有在通过传感器感测到驾驶员接近或进入车辆时,才激活语音识别装置112和/或身份识别装置114,以减少功耗。在另一实施例中,还可以从车辆系统总线获得车辆运行信息,并根据车辆运行信息来判断是否激活语音识别装置112和/或身份识别装置114。
具体地,在各种传感器中,驾驶位置传感器可以利用行人跟踪技术。驾驶位置传感器可以是安装在后视镜上的摄像头,其用于获取驾驶位置上的图像。通过行人跟踪技术中的人体体型分类器对图像上驾驶位置上的人形特征进行识别。
行驶状态传感器可以是安装在车轴上的检测车速的开关型霍尔传感器。举例来讲,可以在车轮的非磁性材料的圆盘边上粘一块磁钢,霍尔传感器放在靠近圆盘边缘处,圆盘旋转一周,霍尔传感器就输出一个脉冲,从而可测出转数。当感应到有车速时,即可判断为车辆处于行驶状态;当感应不到车速时,则判 断车辆为停止状态。
车门传感器可以是安装在车门上的霍尔传感器,其通过与门框上的磁体的接触状态来判断车门的开关。举例来讲,磁体靠近霍尔传感器,可输出特定电平,这时可判断车门为关闭。反之,则可判断车门开启。
图1B中还示意性地示出了位于驾驶位置的驾驶员P1和位于副驾驶位置的乘客P2。
应该理解的是,图1B中所示的语音识别装置112不仅可以实现为声音输入单元和单一处理器的形式,也可以实现为多个处理模块的形式。例如,图2示出了根据本公开实施例的语音识别装置112的结构框图,如图2所示,语音识别装置112可以包括声音输入单元210、身份识别结果获取单元220、声学特征集获取单元230、声学识别单元240、环境状态判断单元250、身份识别提醒单元260、身份创建单元270、指令识别单元280和更新单元290。其中,环境状态判断单元250、身份识别提醒单元260、身份创建单元270、指令识别单元280和更新单元290是通过虚线框示出的,这些单元在本公开的实施例中并不是必需的。在其他实施例中,可以省略或合并其中的一个或多个单元,或根据所执行的处理增加其他的处理模块。
具体地,声音输入单元210可以是例如麦克风,其用于从外界接收声音,并转换成电信号。身份识别结果获取单元220可以用于获取操作者(即,图1B中所示场景的驾驶员P1)的身份识别结果。声学特征集获取单元230用于基于操作者的身份识别结果,获取与所述操作者相对应的声学特征集。声学识别单元240用于基于所获取的声学特征集从所接收的声音中识别出所述操作者的语音。环境状态判断单元250用于判断环境状态。身份识别提醒单元260用于在预设时段内未获得身份识别的结果时,发出身份识别提醒。身份创建单元270用于为操作者创建身份,并为操作者建立相对应的声学特征集。指令识别单元280用于从操作者的语音中识别出将要执行的操作。更新单元290用于当计算出的概率大于第一阈值但小于第二阈值时,以所提取的声学特征更新操作者的声学特征集。
此外,作为语音识别装置112的硬件实现,在一些实施例中,其可以包括例如处理器和存储器。在存储器上可存储有指令,该指令在由处理器执行时, 可使得处理器执行以下结合图3来描述的方法步骤。此外,该指令在由处理器执行时,也可使得处理器变为如上所述的各个功能单元210~290。例如,在语音识别装置112包括用于声音输入的硬件(例如,麦克风)的情况下,声音输入单元210可以是该硬件本身;而在语音识别装置112不包括用于声音输入的硬件的情况下,声音输入单元210可以是处理器实现的功能单元,其被配置为从外部的硬件接收与声音输入有关的信号。
图3示出了根据本公开实施例的可结合图2所示的语音识别装置112一起使用的语音识别方法300的流程图。
如图3所示,方法300开始于步骤S310,其中,获取操作者的身份识别结果。然后,在步骤S320中,基于操作者的身份识别结果,获取与所述操作者相对应的声学特征集。接下来,在步骤S330中,基于所获取的声学特征集,从所接收的声音中识别出所述操作者的语音。
以下结合图2和图3对本公开的实施例进行详细描述。
首先,在步骤S310中,获取操作者的身份识别结果。
在一个实施例中,所述身份识别可以是通过例如图1B中所示的外置身份识别装置114实现的。比如,语音识别装置112中的身份识别结果获取单元220可以从身份识别装置114获取身份识别的结果。在这一情况下,步骤S310可以只是简单地执行从例如身份识别装置114接收身份识别结果的操作。
在另一实施例中,由语音识别装置112中的身份识别结果获取单元220自己来对操作者进行身份识别。在这一情况下,在步骤S310中,该获取步骤等同于进行身份识别的步骤。
如上文所述,可以通过人脸识别、虹膜识别、指纹识别、密码识别、登录信息识别等方式来实现身份识别。
在一个实施例中,在所述步骤S310之前,所述语音识别方法300还可包括判断环境状态的步骤。并且,在步骤S310中,根据环境状态来确定是否获取身份识别的结果。举例来讲,可以只在环境状态满足预定条件时(比如,有人接近、驾驶位置传感器感测到压力)才执行步骤S310中的操作。
比如,在图2的语音识别装置112中,可以通过环境状态判断单元250来执行步骤S310。具体地,可以通过从以下各项之一或其组合的信息来判断环 境状态(以图1A和图1B所示的场景为例):车门传感器、驾驶位置传感器、行驶状态传感器以及车辆系统总线。例如,在车主来到车辆附近、打开车门或启动汽车的情况下,可以根据车辆上安装的传感器检测到的信息,来激活身份识别功能(硬件和/或软件模块)。更具体地,当车主首次来到车辆附近时,可以通过车辆上安装的例如蓝牙收发机和/或RFID读取器等与车主钥匙上的蓝牙/RFID模块等通信,以来激活身份识别功能(硬件和/或软件模块)。又例如,当车主打开车门时,可以通过车辆上安装的车门传感器、蓝牙收发机和/或RFID读取器等来检测车门的打开状态和/或车钥匙的插入状态和/或开门指令,并来激活身份识别功能(硬件和/或软件模块)。再例如,当车主坐到驾驶位置上并通过钥匙插入、指纹、声音启动车辆时,也可以通过相应的传感器(蓝牙、RFID、驾驶位置传感器、指纹传感器、麦克风等)来检测相应信息并来激活身份识别功能(硬件和/或软件模块)。如上文所述,通过判断环境状态,可以只在驾驶员(即,操作者)接近或进入车辆时(此时,车门传感器、驾驶位置传感器、行驶状态传感器以及车辆系统总线具有特定状态或数值)才激活获取身份识别结果的步骤,从而能够有效地减少功耗。
在一个实施例中,当根据环境状态确定将要获取身份识别的结果的情况下,如果在预设时段(比如,特定长度(例如,10秒)的时段或汽车启动之前的时段)内未获得身份识别的结果,则可向操作者发出身份识别提醒。所述提醒可以是例如警报、灯光闪烁、振动等。在图2所示的语音识别装置112中,可通过身份识别提醒单元260来执行这一操作。
举例来讲,以人脸识别为例,如果车处于停止状态、车门关闭和/或已经跟踪到驾驶位置有人员存在,则语音识别装置112可确定应该进行身份识别。这时,身份识别装置114应该已经开始尝试对驾驶员脸部特征进行提取。如果在预定时段内没有检测到驾驶员脸部特征,即语音识别装置112一直未能获取身份识别的结果,则将向驾驶员进行语音提醒并继续进行检测,直到获取到驾驶员脸部特征为止。
接下来,在步骤S320中,可基于操作者的身份识别结果,获取与所述操作者相对应的声学特征集。在图2所示的语音识别装置112中,可通过声学特征集获取单元230来执行所述步骤S320。
声学特征集指的是声学特征的集合。声学特征是语音识别领域中的重要概念。以下对相关内容进行简述:
在语音识别领域,对于特定的语音输入序列O={o1,o2,…,on}(o1-ot为特定的语音单位,比如帧、状态),需要将其识别为特定的文字序列W={w1,w2,…,wn}。因此,在该情况下,O即为实际观察到的结果,而W为与该结果对应的实际文字序列。这个过程一般通过概率来表示,即语音识别事实上要解决的是以下问题:
Figure PCTCN2018076031-appb-000001
其中,P(W|O)代表在发生事件O(即实际观察到的结果)的情况下发生W(即,对应的实际文字序列)的条件概率,因此上述公式的含义是
Figure PCTCN2018076031-appb-000002
为L中的使得条件概率函数P(W|O)为最大值的W,其中,L是W的全部可能取值范围。
由贝叶斯公式,得到
Figure PCTCN2018076031-appb-000003
由于上式是针对单个句子进行计算的,而对于单个句子来讲P(O)是不变的,因此,上式可以改写成
Figure PCTCN2018076031-appb-000004
其中,P(O|W)部分称为最大似然概率,可以通过声学模型进行计算;P(W)部分称为先验概率,可以通过语言模型计算得到。本实施例中主要涉及P(O|W)部分,即主要涉及声学模型。
具体地,声学模型关于怎样计算一个音素与一段语音信号的匹配程度。因此,需要找到一种合适的表示语音信号的方法。一般是把语音信号按照语音单位进行划分,比如分成许多帧。对于每一帧,通过所使用的声学模型(其中,利用傅里叶变换等一系列操作)把它转换成一个声学特征。声学特征的示例包括线性预测系数、倒谱系数、梅尔频率倒谱系数、感知线性预测系数等。
通过对语音材料的积累,可以从中提取出大量的声学特征,并可以得到这些声学特征与音素的对应关系,这些具有与音素的对应关系的声学特征构成声学特征集。从另一个角度讲,利用声学特征集,可以训练从声学特征到音素的分类器,这些分类器可以用于确定最大似然概率P(O|W)。常用的分类器包括 高斯混合模型(GMM)以及深度神经网络(DNN)模型。举例来讲,GMM的原理是估计出每个音素的声学特征的分布,然后在识别阶段,计算每一帧的声学特征由相应音素产生的概率,把每一帧的概率相乘,就得到P(O|W)。
需要指出的是,在语音识别领域中,还常使用隐马尔科夫模型(HMM)和动态时间规划模型(DTW)来解决声学特征序列的可变长度问题,再结合以上分类器创建时使用的模型,便可得到各种可用的声学模型,比如GMM-HMM模型或CD-DNN-HMM模型。
在声学特征集中积累的声学特征足够多时,可以认为得到了较为完备的声学特征集,即,该声学特征集能够涵盖用户所发出的几乎全部语音单位的声学特征。声学特征集越完备,语音识别的结果就越准确。
事实上,在一些简单或特定的场景中,并不需要识别各种语音信号,而只需识别若干特定的语音命令。此时,声学特征集的完备程度对技术方案的影响不大。只要在训练或生成声学特征集的过程中积累了对应于某些特定音素的声学特征,即可实现较为准确的语音识别。比如,在驾驶场景中,只需通过驾驶时常用的语句进行训练,便能够得到符合特定要求的声学特征集。
一般地,声学特征集的生成和训练过程所使用的语音材料并不受到限制,这是一般意义上的声学特征集,其不具备针对用户的区分能力。然而,在本实施例的步骤S320中,在确定了操作者的身份后,所获取的声学特征集是与操作相对应的声学特征集,这一声学特征集是通过使用操作者发出的语音作为语音材料创建的(参见下文中的与操作者相对应的声学特征集的创建过程),具有用户区分能力。举例来讲,参照图1B中的场景,在检测到驾驶员是P1后,基于其身份获取与之相对应的声学特征集。
在一个实施例中,所述声学特征集是从云服务器获得的(比如图1A中的云服务器120)。在另一实施例中,所述声学特征集是从本地存储器获得的,比如上文所述的数据库。
这里还需要指出的是,本文中,术语“声学特征”并不对其自身的长度或数量进行限定。在以上所述的各实施例中,“声学特征”可以指一个或多个声学特征,也可以表示一个声学特征集。
应当理解的是,在步骤S310-S320之前可以附加地设置模式选择步骤,在 该步骤中可以对语音识别装置112的模式进行选择,其中所述模式包括例如手动模式和自动模式。在一些实施例中,一旦选择了手动模式,则将不执行步骤S310-S320中的操作,而直接为操作者指派一般的非特定声学特征集。一旦选择了自动模式,则将执行步骤S310-S320,并基于步骤S320中确定的与所述操作者相对应的声学特征集进行后续操作。
接下来,在步骤S330中,基于所获取的声学特征集,从所接收的声音中识别出所述操作者的语音。在图2所示的语音识别装置112中,通过声学识别单元240来执行所述步骤S330。
在一个实施例中,步骤S330具体包括:
首先,从所接收的声音提取声学特征。
这一提取声学特征的过程可使用在建立针对操作者的声学特征集时所使用的声学模型来进行。
然后,将所提取的声学特征与所获取的与所述操作者相对应的声学特征集进行匹配。
具体地,这一匹配过程中,先基于所获取的与所述操作者相对应的声学特征集来计算所提取的声学特征的最大似然概率,然后再判断计算出的概率是否大于第一阈值(比如但不限于,80%),并且当计算出的概率大于第一阈值时,可确定所提取的声学特征与所述操作者的声学特征集匹配。此处使用的第一阈值可以是用于指示检测到的语音的说话者实际上是声学特征集的对应操作者的概率阈值。当计算出的概率大于该第一阈值时,则可以认为说话者非常可能是操作者,相反则不太可能是操作者或有可能是其他人。该第一阈值的取值可以是根据经验、实验或各种方式来设置的并可以动态地调整。通过使用所设置的第一阈值,可以给出用于确定说话者是否是操作者的确定性判断标准。这一步骤的原理在于,在确定了所使用的声学特征集的情况下,建立该声学特征集时所针对的操作者的语音计算得到的最大似然概率要高于针对其他人员的语音(或噪音)的计算出的最大似然概率。从而,通过设置特定的第一阈值,能够将操作者的语音与其他声音加以区分。
最后,如果确定了匹配,则可以将所接收的声音识别为所述操作者的语音。
在一个实施例中,可能会出现以下情况:虽然检测到了身份识别的特征, 但无法根据检测到的身份识别特征,确定操作者的身份,例如,不存在该操作者的身份记录。在此情况下,语音识别方法300还可包括以下步骤:为所述操作者创建身份,并为所述操作者建立相对应的声学特征集。在图2所示的语音识别装置112中,可通过身份创建单元270来执行这一操作。
在一个实施例中,为所述操作者建立相对应的声学特征集的步骤可包括:接收所述操作者的语音;从所接收的语音中提取所述操作者的声学特征;以及根据所提取的声学特征,建立与所述操作者相对应的声学特征集。该实施例对应于在语音识别装置112本地创建身份和声学特征集的情况。
在另一实施例中,为所述操作者建立相对应的声学特征集的步骤可包括:接收所述操作者的语音;向服务器发送所接收的语音;以及从服务器接收与所述操作者相对应的声学特征集。该实施例对应于在服务器上创建身份和声学特征集的情况。其中,所述服务器可以为例如图1A中所示的云服务器120。
在本公开中,在创建与操作者相对应的声学特征集的过程中,所使用的语音材料是由操作者自己发出的语音。这样,能够借助对操作者的识别而特定地在所述操作者操作时使用为其特别建立的声学特征集进行语音识别,从而能够更好地过滤掉噪音以及其他人员的语音对语音识别结果的影响。
在一个实施例中,为了避免由于操作者流动性大造成部署在本地或服务器上的身份识别特征模型及声学模型文件过多而使得系统存储空间紧张。系统允许设定N个专属操作者ID及N1个普通操作者ID,具体数量可以根据系统存储空间而定。N1个普通操作者ID在系统中的优先级将根据操作目标设备的时间及次数的权重来决定。在统计周期内,如果某个操作者ID一直处于非操作状态,其优先级将降低,即优先被擦除。如果需要,可以选择手动清除低优先级的操作者ID及其数据。
在一个实施例中,可使用每次接收到操作者的语音时所提取的声学特征自动地对该操作者的声学特征集进行更新。在另一实施例中,所述更新可以是手动进行的。在图2所示的语音识别装置112中,可通过更新单元290来执行这一操作。
当然,还可以设定更新条件,比如,在一个实施例中,只有当计算出的最大似然概率大于上述第一阈值但小于第二阈值(比如但不限于,90%)时,才 以所提取的声学特征更新所述操作者的声学特征集。此处使用的第二阈值可以是用于指示需要更新声学特征集的阈值。通过设置比第一阈值高的第二阈值,可以最大限度确保只有操作者本人可以更新操作者的声学特征集,从而避免对声学特征集的篡改。
在另一实施例中,还附加的设置一个时间长度参数,其中,只有在等于该时间长度参数的时间段期间所计算出的最大似然概率都大于上述第一阈值但小于第二阈值时,才以所提取的声学特征更新所述操作者的声学特征集。
在一个实施例中,在步骤S330中识别出所述操作者的语音后,可以根据预先设置,直接使目标设备执行相应的指令。在这种情况中,无需再进行进一步的语言识别(即从音素到文字的过程)。
在另一实施例中,在步骤S330之后,所述语音识别方法300还可包括:从所述操作者的语音中识别出将要执行的操作。在图2所示的语音识别装置112中,可通过指令识别单元280来执行这一操作。
在以上步骤S330中,根据对声学模型的描述,在识别出操作者的语音的同时,还能够确定所述语音所对应的音素(序列)。从而,在此可以进一步根据语言模型估计所确定的音素所对应的文字,并根据文字内容确定目标设备将要执行的操作。
此外,在本文中所提到的“操作者”不限于人类操作者,而可以是任何操作者,例如可以操作车辆的电子设备(例如,无人驾驶设备/程序等)或任何可能的设备。
以上所述的具体实施例,对本公开的目的、技术方案和有益效果进行了进一步详细说明,应理解的是,以上所述仅为本公开的具体实施例而已,并不用于限制本公开,凡在本公开的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本公开的保护范围之内。

Claims (22)

  1. 一种基于操作者身份的语音识别方法,包括:
    获取操作者的身份识别结果;
    基于操作者的身份识别结果,获取与所述操作者相对应的声学特征集;以及
    基于所获取的声学特征集,从所接收的声音中识别出所述操作者的语音。
  2. 根据权利要求1所述的语音识别方法,在获取操作者的身份识别结果的步骤之前,所述语音识别方法还包括:
    判断环境状态;以及
    所述获取操作者的身份识别结果的步骤还包括:
    根据所述环境状态,获取操作者的身份识别结果。
  3. 根据权利要求2所述的语音识别方法,其中,判断环境状态的步骤包括:
    接收来自至少一个环境传感器的传感器数据;
    根据所述传感器数据来确定是否需要激活身份识别功能;以及
    根据所确定的结果来返回指示是否需要激活身份识别功能的环境状态。
  4. 根据权利要求2所述的语音识别方法,还包括:
    在预设时段内未获得所述身份识别结果时,发出身份识别提醒。
  5. 根据权利要求1所述的语音识别方法,还包括:
    为所述操作者创建身份,并为所述操作者建立相对应的声学特征集。
  6. 根据权利要求5所述的语音识别方法,其中,为所述操作者建立相对应的声学特征集包括:
    接收所述操作者的语音;
    从所接收的语音中提取所述操作者的声学特征;以及
    根据所提取的声学特征,建立与所述操作者相对应的声学特征集。
  7. 根据权利要求5所述的语音识别方法,其中,为所述操作者建立相对应的声学特征集包括:
    接收所述操作者的语音;
    向服务器发送所接收的语音;以及
    从服务器接收与所述操作者相对应的声学特征集。
  8. 根据权利要求1所述的语音识别方法,其中,从所接收的声音中识别出所述操作者的语音的步骤还包括:
    从所接收的声音提取声学特征;
    将所提取的声学特征与所获取的与所述操作者相对应的声学特征集进行匹配;
    如果匹配,则将所接收的声音识别为所述操作者的语音。
  9. 根据权利要求8所述的语音识别方法,其中,将所提取的声学特征与所获取的与所述操作者相对应的声学特征集进行匹配的步骤包括:
    基于所获取的与所述操作者相对应的声学特征集,计算所提取的声学特征的最大似然概率;
    当计算出的概率大于第一阈值时,确定所提取的声学特征与所述操作者的声学特征集匹配,其中所述第一阈值是用于指示与所提取的声学特征相对应的操作者是否是所述操作者的概率阈值。
  10. 根据权利要求9所述的语音识别方法,还包括:
    当计算出的概率大于第一阈值但小于第二阈值时,以所提取的声学特征更新所述操作者的声学特征集。
  11. 根据权利要求1所述的语音识别方法,还包括:
    从所述操作者的语音中识别出将要执行的操作。
  12. 一种基于操作者身份的语音识别装置,包括:
    处理器;
    存储器,其上存储有指令,所述指令在由所述处理器执行时,使得所述处理器:
    获取操作者的身份识别结果;
    基于操作者的身份识别结果,获取与所述操作者相对应的声学特征集;以及
    基于所获取的声学特征集,从所接收的声音中识别出所述操作者的语音。
  13. 根据权利要求12所述的语音识别装置,其中,所述指令在由所述处理器执行时还使得所述处理器:
    判断环境状态;以及
    根据所述环境状态进行获取操作者的身份识别结果的操作。
  14. 根据权利要求13所述的语音识别装置,其中,所述指令在由所述处理器执行时还使得所述处理器:
    接收来自至少一个环境传感器的传感器数据;
    根据所述传感器数据来确定是否需要激活身份识别功能;以及
    根据所确定的结果来返回指示是否需要激活身份识别功能的环境状态。
  15. 根据权利要求13所述的语音识别装置,其中,所述指令在由所述处理器执行时还使得所述处理器:
    在预设时段内未获得身份识别结果时,发出身份识别提醒。
  16. 根据权利要求12所述的语音识别装置,其中,所述指令在由所述处理器执行时还使得所述处理器:
    为所述操作者创建身份,并为所述操作者建立相对应的声学特征集。
  17. 根据权利要求16所述的语音识别装置,其中,所述指令在由所述处理器执行时还使得所述处理器:
    接收所述操作者的语音;
    从所接收的语音中提取所述操作者的声学特征;以及
    根据所提取的声学特征,建立与所述操作者相对应的声学特征集。
  18. 根据权利要求16所述的语音识别装置,其中,所述指令在由所述处理器执行时还使得所述处理器:
    接收所述操作者的语音;
    向服务器发送所接收的语音;以及
    从服务器接收与所述操作者相对应的声学特征集。
  19. 根据权利要求12所述的语音识别装置,其中,所述指令在由所述处理器执行时还使得所述处理器:
    从所接收的声音提取声学特征;
    将所提取的声学特征与所获取的与所述操作者相对应的声学特征集进行 匹配;
    如果匹配,则将所接收的声音识别为所述操作者的语音。
  20. 根据权利要求19所述的语音识别装置,其中,所述指令在由所述处理器执行时还使得所述处理器:
    基于所获取的与所述操作者相对应的声学特征集,计算所提取的声学特征的最大似然概率;
    当计算出的概率大于第一阈值时,确定所提取的声学特征与所述操作者的声学特征集匹配。
  21. 根据权利要求20所述的语音识别装置,其中,所述指令在由所述处理器执行时还使得所述处理器:
    更新单元,用于当计算出的概率大于第一阈值但小于第二阈值时,以所提取的声学特征更新所述操作者的声学特征集。
  22. 根据权利要求12所述的语音识别装置,其中,所述指令在由所述处理器执行时还使得所述处理器:
    指令识别单元,用于从所述操作者的语音中识别出将要执行的操作。
PCT/CN2018/076031 2017-06-20 2018-02-09 语音识别方法和语音识别装置 WO2018233300A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/327,319 US11355124B2 (en) 2017-06-20 2018-02-09 Voice recognition method and voice recognition apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710466754.XA CN109102801A (zh) 2017-06-20 2017-06-20 语音识别方法和语音识别装置
CN201710466754.X 2017-06-20

Publications (1)

Publication Number Publication Date
WO2018233300A1 true WO2018233300A1 (zh) 2018-12-27

Family

ID=64737440

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/076031 WO2018233300A1 (zh) 2017-06-20 2018-02-09 语音识别方法和语音识别装置

Country Status (3)

Country Link
US (1) US11355124B2 (zh)
CN (1) CN109102801A (zh)
WO (1) WO2018233300A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11010461B2 (en) * 2017-12-22 2021-05-18 Vmware, Inc. Generating sensor-based identifier
CN110473540B (zh) * 2019-08-29 2022-05-31 京东方科技集团股份有限公司 语音交互方法及系统、终端设备、计算机设备及介质
CN112017658A (zh) * 2020-08-28 2020-12-01 北京计算机技术及应用研究所 一种基于智能人机交互的操作控制系统
CN112878854A (zh) * 2021-01-29 2021-06-01 中国第一汽车股份有限公司 基于人脸识别及声音识别的行李箱盖自动开启系统及方法
CN112509587B (zh) * 2021-02-03 2021-04-30 南京大正智能科技有限公司 移动号码与声纹动态匹配及索引构建方法、装置与设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000080828A (ja) * 1998-09-07 2000-03-21 Denso Corp 車両制御装置
CN102645977A (zh) * 2012-03-26 2012-08-22 广东翼卡车联网服务有限公司 一种车载语音唤醒人机交互系统及方法
CN102915731A (zh) * 2012-10-10 2013-02-06 百度在线网络技术(北京)有限公司 一种个性化的语音识别的方法及装置
CN103871409A (zh) * 2012-12-17 2014-06-18 联想(北京)有限公司 一种语音识别的方法、信息处理的方法及电子设备
CN105957523A (zh) * 2016-04-22 2016-09-21 乐视控股(北京)有限公司 车载系统控制方法及装置
CN106218557A (zh) * 2016-08-31 2016-12-14 北京兴科迪科技有限公司 一种带语音识别控制的车载麦克风

Family Cites Families (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
IT1257073B (it) * 1992-08-11 1996-01-05 Ist Trentino Di Cultura Sistema di riconoscimento, particolarmente per il riconoscimento di persone.
US5897616A (en) * 1997-06-11 1999-04-27 International Business Machines Corporation Apparatus and methods for speaker verification/identification/classification employing non-acoustic and/or acoustic models and databases
US7158871B1 (en) * 1998-05-07 2007-01-02 Art - Advanced Recognition Technologies Ltd. Handwritten and voice control of vehicle components
DE10163814A1 (de) * 2001-12-22 2003-07-03 Philips Intellectual Property Verfahren und Einrichtung zur Nutzeridentifizierung
US7219062B2 (en) * 2002-01-30 2007-05-15 Koninklijke Philips Electronics N.V. Speech activity detection using acoustic and facial characteristics in an automatic speech recognition system
JP2005122128A (ja) * 2003-09-25 2005-05-12 Fuji Photo Film Co Ltd 音声認識システム及びプログラム
US9263034B1 (en) * 2010-07-13 2016-02-16 Google Inc. Adapting enhanced acoustic models
US9361885B2 (en) * 2013-03-12 2016-06-07 Nuance Communications, Inc. Methods and apparatus for detecting a voice command
US9336781B2 (en) * 2013-10-17 2016-05-10 Sri International Content-aware speaker recognition
CN104143326B (zh) * 2013-12-03 2016-11-02 腾讯科技(深圳)有限公司 一种语音命令识别方法和装置
CN103730120A (zh) * 2013-12-27 2014-04-16 深圳市亚略特生物识别科技有限公司 电子设备的语音控制方法及系统
CN103903613A (zh) * 2014-03-10 2014-07-02 联想(北京)有限公司 一种信息处理方法及电子设备
WO2015155875A1 (ja) * 2014-04-10 2015-10-15 三菱電機株式会社 携帯機器、車両遠隔操作システム、車両遠隔操作方法およびプログラム
CN104217152A (zh) * 2014-09-23 2014-12-17 陈包容 一种移动终端在待机状态下进入应用程序的实现方法和装置
KR20160045353A (ko) * 2014-10-17 2016-04-27 현대자동차주식회사 에이브이엔 장치, 차량, 및 에이브이엔 장치의 제어방법
KR101610151B1 (ko) 2014-10-17 2016-04-08 현대자동차 주식회사 개인음향모델을 이용한 음성 인식장치 및 방법
CN104881117B (zh) * 2015-05-22 2018-03-27 广东好帮手电子科技股份有限公司 一种通过手势识别激活语音控制模块的装置和方法
US20160366528A1 (en) * 2015-06-11 2016-12-15 Sony Mobile Communications, Inc. Communication system, audio server, and method for operating a communication system
US10178301B1 (en) * 2015-06-25 2019-01-08 Amazon Technologies, Inc. User identification based on voice and face
CN105096940B (zh) * 2015-06-30 2019-03-08 百度在线网络技术(北京)有限公司 用于进行语音识别的方法和装置
US20170011735A1 (en) * 2015-07-10 2017-01-12 Electronics And Telecommunications Research Institute Speech recognition system and method
CN106537493A (zh) * 2015-09-29 2017-03-22 深圳市全圣时代科技有限公司 语音识别系统及方法、客户端设备及云端服务器
US9704489B2 (en) * 2015-11-20 2017-07-11 At&T Intellectual Property I, L.P. Portable acoustical unit for voice recognition
CN105895096A (zh) * 2016-03-30 2016-08-24 乐视控股(北京)有限公司 一种身份识别与语音交互操作的方法及装置
US10474800B2 (en) * 2016-11-16 2019-11-12 Bank Of America Corporation Generating alerts based on vehicle system privacy mode
CN106682090B (zh) * 2016-11-29 2020-05-15 上海智臻智能网络科技股份有限公司 主动交互实现装置、方法及智能语音交互设备
US10573106B1 (en) * 2017-03-22 2020-02-25 Amazon Technologies, Inc. Personal intermediary access device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000080828A (ja) * 1998-09-07 2000-03-21 Denso Corp 車両制御装置
CN102645977A (zh) * 2012-03-26 2012-08-22 广东翼卡车联网服务有限公司 一种车载语音唤醒人机交互系统及方法
CN102915731A (zh) * 2012-10-10 2013-02-06 百度在线网络技术(北京)有限公司 一种个性化的语音识别的方法及装置
CN103871409A (zh) * 2012-12-17 2014-06-18 联想(北京)有限公司 一种语音识别的方法、信息处理的方法及电子设备
CN105957523A (zh) * 2016-04-22 2016-09-21 乐视控股(北京)有限公司 车载系统控制方法及装置
CN106218557A (zh) * 2016-08-31 2016-12-14 北京兴科迪科技有限公司 一种带语音识别控制的车载麦克风

Also Published As

Publication number Publication date
US20190180756A1 (en) 2019-06-13
US11355124B2 (en) 2022-06-07
CN109102801A (zh) 2018-12-28

Similar Documents

Publication Publication Date Title
US11694679B2 (en) Wakeword detection
WO2018233300A1 (zh) 语音识别方法和语音识别装置
US11232788B2 (en) Wakeword detection
CN111741884B (zh) 交通遇险和路怒症检测方法
CN110660201B (zh) 到站提醒方法、装置、终端及存储介质
CN110027409B (zh) 车辆控制装置、车辆控制方法以及计算机可读取记录介质
JP6977004B2 (ja) 車載装置、発声を処理する方法およびプログラム
US11514900B1 (en) Wakeword detection
US20160267909A1 (en) Voice recognition device for vehicle
CN110880328B (zh) 到站提醒方法、装置、终端及存储介质
US20210183362A1 (en) Information processing device, information processing method, and computer-readable storage medium
KR20210138181A (ko) 안내 로봇 및 안내 로봇의 동작 방법
CN112298104A (zh) 车辆控制的方法、装置、存储介质及电子设备和车辆
JP2019182244A (ja) 音声認識装置及び音声認識方法
CN114187637A (zh) 车辆控制方法、装置、电子设备及存储介质
CN116890786A (zh) 车辆车锁控制方法、设备和介质
JP2019174757A (ja) 音声認識装置
KR101531873B1 (ko) 운전자 판단 장치 및 방법
JP2019191477A (ja) 音声認識装置及び音声認識方法
US11922538B2 (en) Apparatus for generating emojis, vehicle, and method for generating emojis
WO2023137908A1 (zh) 声音识别方法、装置、介质、设备、程序产品及车辆
WO2023144573A1 (ja) 音声認識方法及び音声認識装置
US11531736B1 (en) User authentication as a service
US20210090591A1 (en) Security system
Khare et al. Multimodal interaction in modern automobiles

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18821419

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 27.05.2020)

122 Ep: pct application non-entry in european phase

Ref document number: 18821419

Country of ref document: EP

Kind code of ref document: A1