WO2023283965A1 - Method and apparatus for listening to speech by using device instead of ears, method and apparatus for generating speech recognition model, electronic device, and medium - Google Patents

Method and apparatus for listening to speech by using device instead of ears, method and apparatus for generating speech recognition model, electronic device, and medium Download PDF

Info

Publication number
WO2023283965A1
WO2023283965A1 PCT/CN2021/106942 CN2021106942W WO2023283965A1 WO 2023283965 A1 WO2023283965 A1 WO 2023283965A1 CN 2021106942 W CN2021106942 W CN 2021106942W WO 2023283965 A1 WO2023283965 A1 WO 2023283965A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
speech recognition
recognition model
model
user terminal
Prior art date
Application number
PCT/CN2021/106942
Other languages
French (fr)
Chinese (zh)
Inventor
殷实
黄韬
翟毅斌
伍朝晖
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN202180093163.7A priority Critical patent/CN117178320A/en
Priority to PCT/CN2021/106942 priority patent/WO2023283965A1/en
Publication of WO2023283965A1 publication Critical patent/WO2023283965A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility

Definitions

  • the present invention relates to the technical field of artificial intelligence, in particular to methods, devices, electronic equipment and media for voice listening and generating voice recognition models.
  • Embodiments of the present disclosure provide a solution for voice recognition, which realizes voice listening for personalized keywords.
  • a method for voice listening substitution is provided, the method is applied to a user terminal, and includes: acquiring a target voice recognition model corresponding to a target keyword, and the target voice recognition model is based on the The target keyword is constructed, the target keyword is obtained according to the travel information of the user; the local speech recognition model is updated according to the target speech recognition model, and the updated speech recognition model is obtained, and the local speech recognition model is the The speech recognition model stored in the user terminal; when the target condition is satisfied, the collected environmental sound is recognized according to the updated speech recognition model, and the recognition result is obtained. sound information collected in the surrounding environment; and when the recognition result indicates that the target keyword exists in the environmental sound, prompting the user.
  • the target keywords of the travel information in the ambient sound and remind the user when the ambient sound includes the voice of the target keyword, so as to realize the intelligent listening function of the device instead of the human ear.
  • obtaining the target speech recognition model corresponding to the target keyword includes: obtaining the travel information of the user; extracting the target keyword related to the travel mode of the user according to the travel information; sending the target keyword to the server, for the server to construct the target speech recognition model according to the target keyword; and obtain the target speech recognition model from the server.
  • targeted speech recognition models for personalized keywords can be generated and deployed without user interaction.
  • the user terminal is a first user terminal and is connected to a second user terminal
  • the method further includes sending identification information to the second user terminal, the identification information being used to identify the first user terminal terminal.
  • the acquiring the target speech recognition model corresponding to the target keyword specifically includes: receiving the target speech recognition model from the second user terminal based on the identification information, the target speech recognition model being the second user terminal according to The target keyword is obtained from the server; wherein the first user terminal is an audio playback device. In this way, intelligent proxy listening can be realized when the user uses an audio playback device (such as a headset).
  • the target speech recognition model is a decoding map generated based on an acoustic model, a target pronunciation dictionary, and a target language model
  • the decoding map is a set of decoding paths of grammatically constrained rules determined by the target keyword
  • the target pronunciation dictionary model is acquired based on the pronunciation sequence of the target keyword
  • the target language model is acquired based on the relationship between words of the target keyword.
  • the acoustic model is generated by training fusion features and text information of target speech data, the fusion features are generated based on target speech data and noise data, and the target speech data is text information including target speech content audio data, the noise data is audio data that does not include the target speech content.
  • the generated target speech recognition model can more accurately recognize speech in a high-noise and strong-reverberation environment, thereby realizing personalized intelligent listening substitution.
  • training is performed by fusing features and text information of the target speech data.
  • the text information of the target speech data can be direct text content or other labeled data corresponding to the text content, such as phoneme sequences.
  • the method further includes: acquiring location information associated with the user's travel mode according to the travel information; wherein when the target condition is met, the collected environmental sound is recognized according to the updated speech recognition model
  • the method includes: when the location of the user matches the location information, recognizing the collected environmental sound according to the updated speech recognition model. In this way, when the target condition of the geographic location is met, the updated speech recognition model is automatically used to determine whether the ambient sound contains keywords without user interaction, bringing a better user experience.
  • the method further includes: acquiring time information associated with the user's travel mode according to the travel information, wherein when the target condition is met, the collected environmental sound is recognized according to the updated speech recognition model
  • the method includes: when a time condition is satisfied, the collected environmental sound is recognized according to the updated speech recognition model, and the time condition is determined according to the time information.
  • the time condition may be that the current time is within a predetermined time period before the time information. In this way, when the target condition of the time is met, the updated speech recognition model is automatically used to determine whether the ambient sound contains keywords without user interaction, bringing a better user experience.
  • prompting the user includes playing a voice corresponding to the target keyword on the user terminal. In this way, the user can hear corresponding prompts for personalized keywords of interest.
  • the target keyword is train number or flight number.
  • the user terminal is one of a smart phone, a smart home appliance, a wearable device, an audio playback device, a tablet computer, and a notebook computer.
  • a device for listening on behalf of others including: a model acquisition unit, configured to acquire a target speech recognition model corresponding to a target keyword, and the target speech recognition model is based on the target key Word construction, the target keyword is obtained according to the travel information of the user; the update unit is used to update the local speech recognition model according to the target speech recognition model, and obtain the updated speech recognition model, and the local speech recognition model It is the speech recognition model stored in the user terminal; the sound recognition unit is used to recognize the collected environmental sound according to the updated speech recognition model when the target condition is met, and obtain the recognition result, and the environmental sound is the sound information collected in the environment where the user terminal is located; and a prompting unit, configured to prompt the user when the recognition result indicates that the target keyword exists in the environmental sound.
  • a model acquisition unit configured to acquire a target speech recognition model corresponding to a target keyword, and the target speech recognition model is based on the target key Word construction, the target keyword is obtained according to the travel information of the user
  • the update unit is used to update the local speech recognition
  • the device further includes: a target keyword acquisition unit, configured to acquire the travel information of the user; a target keyword acquisition unit, configured to extract a target related to the travel mode of the user according to the travel information keyword; and a sending unit, configured to send the target keyword in the travel information to a server, so that the server can construct it according to the target keyword.
  • the model acquiring unit is further configured to acquire the target speech recognition model from the server. In this way, targeted speech recognition models for personalized keywords can be generated and deployed without user interaction.
  • the user terminal is a first user terminal and is connected to a second user terminal.
  • the apparatus further includes a sending unit, configured to send identification information to the second user terminal, where the identification information is used to identify the first user terminal.
  • the model obtaining unit is further configured to receive the target speech recognition model from the second user terminal based on the identification information, and the target speech recognition model is obtained by the second user terminal from the server according to the target keyword .
  • the first user terminal is an audio playback device. In this way, it is possible to realize intelligent proxy listening when the user uses an audio playback device (such as earphones).
  • the target speech recognition model is a decoding map generated based on an acoustic model, a target pronunciation dictionary, and a target language model
  • the decoding map is a set of decoding paths of grammatically constrained rules determined by the target keyword
  • the target pronunciation dictionary model is acquired based on the pronunciation sequence of the target keyword
  • the target language model is acquired based on the relationship between words of the target keyword.
  • the acoustic model is generated by training fusion features and text information of target speech data, the fusion features are generated based on target speech data and noise data, and the target speech data is text information including target speech content audio data, the noise data is audio data that does not include the target speech content.
  • the generated target speech recognition model can more accurately recognize speech in a high-noise and strong-reverberation environment, thereby realizing personalized intelligent listening substitution.
  • the device further includes a travel location information acquiring unit, configured to acquire location information associated with the travel mode of the user according to the travel information.
  • the sound recognition unit is further configured to: when the location of the user matches the location information, recognize the collected environmental sound according to the updated speech recognition model. In this way, when the target condition of the geographic location is met, the updated speech recognition model is automatically used to determine whether the ambient sound contains keywords without user interaction, bringing a better user experience.
  • the device further includes a travel time information obtaining unit, configured to obtain time information associated with the travel mode of the user according to the travel information.
  • the sound recognition unit is further configured to: when a time condition is satisfied, recognize the collected environmental sound according to the updated speech recognition model, the time condition being determined according to the time information.
  • the time condition may be that the current time is within a predetermined time period before the time information. In this way, when the target condition of the time is met, the updated speech recognition model is automatically used to determine whether the ambient sound contains keywords without user interaction, bringing a better user experience.
  • the prompting unit is further configured to: play a voice corresponding to the target keyword on the user terminal. In this way, the user can hear corresponding prompts for personalized keywords of interest.
  • the target keyword is train number or flight number.
  • the user terminal is one of a smart phone, a smart home appliance, a wearable device, an audio playback device, a tablet computer, and a notebook computer.
  • a method for generating a speech recognition model including: generating fusion acoustic features based on target speech data and noise data, the target speech data being audio data including target speech content, the noise The data is audio data that does not include the target speech content; the acoustic model is generated by training through the fusion feature and the text information of the speech data; and constructing the acoustic model according to the acoustic model, pronunciation dictionary and language model. Speech recognition model.
  • the acoustic model trained using fused features can be used to accurately recognize speech in high-noise, strong-reverberation environments, thereby realizing personalized intelligent listening substitutes.
  • generating the fused acoustic feature includes: superimposing the target voice data and the noise data to obtain superimposed audio data; and obtaining the fused acoustic feature based on the superimposed audio data .
  • generating the fused acoustic feature includes: acquiring a first acoustic feature based on the target speech data; acquiring a second acoustic feature based on the noise data; Two acoustic features to obtain the fused acoustic features.
  • the acquiring the first acoustic feature based on the target speech data includes: generating a noisy acoustic feature from the target speech data; generating the first acoustic feature by enhancing the noisy acoustic data academic features.
  • enhancing the noisy acoustic feature includes: performing LASSO transformation on the noisy acoustic feature; and performing bottleneck network processing on the LASSO transformed acoustic feature to obtain the first acoustic feature.
  • the obtaining the fused acoustic feature based on the first acoustic feature and the second acoustic feature includes: superimposing the first acoustic feature and the second acoustic feature to obtain a superposition acoustic features; and generating the fused acoustic features by performing normalization processing on the superimposed acoustic features.
  • acquiring the fusion acoustic feature based on the first acoustic feature and the second acoustic feature includes: acquiring the number of frames of the first acoustic feature, and the number of frames of the first acoustic feature The number is determined according to the duration of the target speech data; according to the frame number of the first acoustic feature, a third acoustic feature is constructed based on the second acoustic feature; the first acoustic feature and the third acoustic feature are superimposed Features Get the fused acoustic features.
  • the acoustic model is a neural network model
  • the training includes: extracting sound source features from a hidden layer of the acoustic model; and using the sound source features and the fused acoustic features as the The input features of the acoustic model are used to train the acoustic model.
  • the constructing the speech recognition model according to the acoustic model, the pronunciation dictionary and the language model is specifically: receiving a target keyword from a user terminal;
  • the pronunciation dictionary obtains the target pronunciation dictionary model;
  • the target language model is obtained from the speech model; and by merging the acoustic model, the target pronunciation dictionary model and the target language model To build the speech recognition model.
  • a lightweight speech recognition model for specific keywords can be generated, which is suitable for user terminals with limited computing resources.
  • a device for generating a speech recognition model including: a fusion unit for generating fusion acoustic features based on target speech data and noise data, the target speech data being audio including target speech content data, the noise data is audio data that does not include the target speech content; a training unit is used to generate the acoustic model through training through the fusion feature and the text information of the speech data; and speech recognition model construction A unit, configured to construct the speech recognition model according to the acoustic model, pronunciation dictionary and language model.
  • a method for voice listening including: obtaining target keywords related to the user's travel mode in the user's travel information; constructing a target voice corresponding to the target keyword A recognition model; and sending the target speech recognition model to the user terminal, the target speech recognition model is used to identify the ambient sound at the user terminal when the target condition is met, so as to determine whether the ambient sound exists in the Describe the target keywords.
  • target speech recognition models for specific keywords can be generated and deployed to achieve intelligent voice listening for specific keywords.
  • a device for voice listening including: a target keyword acquisition unit, used to acquire target keywords related to the user's travel mode in the user's travel information; speech recognition model construction A unit for constructing a target speech recognition model corresponding to the target keyword; and a sending unit for sending the target speech recognition model to the user terminal, and the target speech recognition model is used to send the target speech recognition model to the user terminal when the target condition is met
  • the environmental sound is recognized, so as to determine whether the target keyword exists in the environmental sound.
  • a target speech recognition model for specific keywords can be generated and deployed, so as to realize intelligent voice listening for specific keywords.
  • an electronic device comprising: at least one computing unit; at least one memory, the at least one memory being coupled to the at least one computing unit and storing Unit-executable instructions that, when executed by the at least one computing unit, cause the electronic device to perform the method according to the first aspect, the third aspect or the fifth aspect of the present disclosure.
  • a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the program according to the first, third, or fifth aspect of the present disclosure can be implemented. described method.
  • a computer program product comprising computer-executable instructions, wherein said computer-executable instructions, when executed by a processor, implement the first, third or fifth aspect of the present disclosure. method described in the aspect.
  • FIG. 1 shows a schematic diagram of an example environment in which various embodiments according to the present disclosure can be implemented.
  • Fig. 2 shows a schematic block diagram of a speech recognition system according to an embodiment of the present disclosure.
  • Fig. 3 shows a schematic flowchart of a method for voice listening substitution according to an embodiment of the present disclosure.
  • FIG. 4 shows a schematic diagram of an example process of building and deploying a speech recognition model according to an embodiment of the present disclosure.
  • Fig. 5 shows a schematic flowchart of a method for generating an acoustic model according to an embodiment of the present disclosure.
  • Fig. 6 shows a schematic flow chart of a method for enhancing speech acoustic features according to an embodiment of the present disclosure.
  • Fig. 7 shows a schematic conceptual diagram of a method for generating fusion features according to an embodiment of the present disclosure.
  • Fig. 8 shows a schematic diagram of a feature fusion process according to an embodiment of the present disclosure.
  • FIG. 9 shows an architecture diagram for training an acoustic model according to an embodiment of the present disclosure.
  • Fig. 10 shows a schematic block diagram of an apparatus for voice listening substitution according to an embodiment of the present disclosure.
  • Fig. 11 shows a schematic block diagram of an apparatus for generating a speech recognition model according to an embodiment of the present disclosure.
  • Fig. 12 shows a schematic block diagram of an apparatus for voice listening substitution according to an embodiment of the present disclosure.
  • Figure 13 shows a schematic block diagram of an example device that may be used to implement embodiments of the present disclosure.
  • the present disclosure provides voice listening technology.
  • the user terminal obtains a voice recognition model for identifying personalized keywords, and uses the voice recognition model to monitor keywords of travel information in environmental sounds.
  • the voice recognition model recognizes Prompt the user when the target keyword is reached. That is to say, this speech recognition model replaces the user's monitoring of environmental sounds, provides users with prompts about travel information, and achieves a better intelligent experience.
  • FIG. 1 shows a schematic diagram of an example environment 100 in which various embodiments according to the present disclosure can be implemented.
  • the application scenario according to the embodiments of the present disclosure is that a user terminal (for example, a smart phone, a smart home appliance, a wearable device, an audio playback device, a tablet computer, a notebook computer, etc.) in an environment with high noise and strong reverberation can recognize Personalized content in non-human voice broadcasts, such as flight numbers, train numbers, etc., helps users monitor voice content in the environment. For example, when a user wears noise-canceling headphones and listens to music, it can identify keywords that the user is interested in in the external broadcast, and send a reminder to the user, thereby realizing intelligent listening.
  • a user terminal for example, a smart phone, a smart home appliance, a wearable device, an audio playback device, a tablet computer, a notebook computer, etc.
  • Personalized content in non-human voice broadcasts such as flight numbers, train numbers, etc.
  • the example environment 100 includes a first user terminal 110 and a second user terminal 120 located on the user side and a server 130 located on the cloud side.
  • the first user terminal 110 and the second user terminal 120 as a whole can connect and communicate with the server 130 via various wired or wireless communication technologies, including but not limited to Ethernet, cellular network (4G, 5G, etc.), wireless local area network (eg, WiFi), Internet, Bluetooth, Near Field Communication (NFC), Infrared (IR), etc.
  • the first user terminal 110 may not be directly connected to the server 130 and the second user terminal 120 may be connected to the server 130 .
  • the first user terminal 110 may connect to the server 130 via the second user terminal 120 and communicate with the server 130 .
  • the first user terminal 110 can be connected to the second user terminal 120 through Bluetooth, infrared, NFC and other short-distance communication, and at the same time, the second user terminal communicates and transmits data with the server 130 through a wireless local area network, the Internet, or a cellular network.
  • the first user terminal 110 may be directly connected to the server 130 .
  • the first user terminal 110 can communicate and transmit data with the server 130 through a wireless local area network, the Internet, or a cellular network.
  • the first user terminal 110 and the second user terminal 120 can communicate and transmit data with each other.
  • the second user terminal 120 may transmit target keywords to the server 130 on the cloud side, such as the train number or flight number of the travel information.
  • the first user terminal 110 may receive the target speech recognition model for the target keyword from the server 130 .
  • the server 130 may generate a target speech recognition model, such as a decoding graph, according to the received target keywords.
  • the decoding graph is a lightweight speech recognition model that is easy to deploy on user terminals with limited computing resources.
  • the target speech recognition model is sent to the user side for deployment in the user terminal or to update the local speech recognition model of the user terminal, so as to realize intelligent listening on the user side, that is, to monitor whether there is a speech corresponding to the target keyword in the ambient sound.
  • the target keyword is sent from the second user terminal 120 to the server 130, and the target speech recognition model is received by the first user terminal 110, it should be understood that the target keyword can be sent to the server from any user terminal 130, and the target speech recognition model can be sent and deployed on any user terminal.
  • the first user terminal 110 is a noise-canceling headset
  • the second user terminal 120 is a smartphone
  • the first user terminal 110 is connected to the second user terminal 120 via Bluetooth.
  • an application may be installed on the second user terminal 120, such as an application related to the user's travel, a short message service application, or any other application that stores the user's future travel information.
  • the personalized information that the user wants to be intelligently listened to can be obtained by accessing the application of the second user terminal 120 .
  • the second user terminal 120 can automatically obtain the personalized information desired by the user from the specified application installed on it, for example, the above-mentioned application related to the user's travel, short message service application, etc., and send it to the server 130 , to be used to generate the target speech recognition model for the personalized information.
  • the specified application installed on it for example, the above-mentioned application related to the user's travel, short message service application, etc.
  • the intelligent listening agent according to the embodiment of the present disclosure can be implemented using a single user terminal that sends personalized information to the server 130 and receives a target speech recognition model from the server 130 for use in a listening environment voice content.
  • FIG. 2 shows a schematic block diagram of a speech recognition system 200 according to an embodiment of the present disclosure.
  • the speech recognition system 200 is used to generate, deploy and use a target speech recognition model for personalized target keywords to detect whether the target keywords exist in ambient sounds.
  • the speech recognition system 200 includes a first user terminal 110 and a second user terminal 120 on the user side, and a server 130 on the cloud side.
  • the first user terminal 110 may be an audio playback device (for example, noise-canceling headphones, smart speaker, etc.), a wearable device (for example, a smart watch, a wristband, etc.), , infrared, etc. to connect to the second user terminal 120.
  • an audio playback device for example, noise-canceling headphones, smart speaker, etc.
  • a wearable device for example, a smart watch, a wristband, etc.
  • infrared etc.
  • the second user terminal 120 may be a smart phone, a smart home appliance, a tablet computer, a notebook computer, etc., and it can be connected to the server 130 via a wired or wireless manner such as a wireless local area network, the Internet, or a cellular network.
  • the server 130 is configured to receive the personalized target keyword fed back from the second user terminal 120, and generate a target speech recognition model for the target keyword. Exemplary functional modules of the first user terminal 110, the second user terminal 120, and the server 130 are described below.
  • the second user terminal 120 includes a transmission communication module 122 , a keyword acquisition module 124 and a storage module 126 .
  • the transmission communication module 122 is used to transmit and receive data to and from the first user terminal 110 and the server 130 .
  • it communicates with the first user terminal 110 through bluetooth, near-field communication, infrared, etc., and communicates with the server 130 through a cellular network, a wireless local area network, and the like.
  • the keyword acquisition module 124 is used to acquire keywords as personalized information.
  • user travel information can be read from text messages or travel applications, and target keywords can be extracted therefrom.
  • the keyword acquisition module 124 is configured to extract keywords in the travel information, such as flight number/train number, etc., through compliance schemes (eg, designated applications authorized by users, such as travel applications or short message services, etc.).
  • the keyword acquisition module 122 may regularly access a designated application to acquire future travel information.
  • the travel information usually includes the traveler's name, flight number or train number, time information, location information, and so on.
  • the flight number or train number usually includes a character string composed of numbers and letters. Therefore, the flight number or train number in the travel information can be determined as a target keyword to be used for speech recognition.
  • the target keyword can be determined by, for example, a regular expression.
  • time and location information, etc. can also be obtained from travel information.
  • the storage module 126 can be used to store the device identification of the second user terminal 120, the connection information of the first user terminal 110 connected to the second user terminal 120 (for example, the identification information, address, etc. of the first user terminal 110), 130 Received target speech recognition model and request identifier.
  • the request identifier can be used as a unique identifier for a request to the server for the target speech recognition model.
  • the second user terminal 120 can determine whether the target speech recognition model is requested by itself according to the request identifier, thereby determining whether to receive or not.
  • the first user terminal 110 includes a transmission communication module 112 , a speech recognition model 114 and a prompt module 116 .
  • the transmission communication module 112 is used for sending and receiving data to and from the second user terminal 120 .
  • communicate with the first user terminal 110 through bluetooth, near field communication, infrared and other means.
  • the transmission communication module 112 is also used for communicating with the server 130, for example via a cellular network or Wifi.
  • the speech recognition model 114 is generated based on one or more target keywords, and may be updated according to the target speech recognition model for new target keywords received from the server 130 .
  • the speech recognition model 114 may be configured to recognize a plurality of keywords by listening to whether the target keywords are included in the ambient sound at runtime. Updating the speech recognition model can enable the updated speech recognition model 114 to monitor whether the ambient sound includes a new target keyword, for example, when adding a new target keyword, or replacing one of the existing target keywords with a new target keyword, For example, the target keywords that have existed for the longest time.
  • the updated speech recognition model 114 can trigger the prompt module 116 to generate prompt information.
  • the prompt model 116 may cause the first user terminal 110 or the second user terminal to issue audible or visual prompts.
  • the server 130 includes a transmission communication module 132 , a speech recognition model building module 134 , an offline acoustic model training module 136 , and a model library 138 .
  • the transmission communication module 132 is configured to receive the target keyword transmitted by the keyword acquisition module 122 , and then forward it to the speech recognition model construction module 134 .
  • the speech recognition model construction module 134 is configured to construct a customized target speech recognition model according to the received target keywords and the model library 138, and transmit the constructed target speech recognition model to the first user terminal 110 or the second user terminal 120.
  • the offline acoustic model training module 134 is configured to pre-train the acoustic model offline according to the robust acoustic model training method according to the training criterion of the speech recognition acoustic model.
  • the trained acoustic models may be stored to model repository 138 . It should be noted that the operation of training the acoustic model can be performed offline, so it can be decoupled from the construction process of the speech recognition model constituting module 134 .
  • the acoustic model can be designed to be generated for an environment with high noise and strong reverberation, for example, based on fusion features to achieve more accurate speech recognition.
  • the model library 138 is configured to store trained models, including on-demand offline trained acoustic models (acquired through the above-mentioned offline acoustic model training model 124 ), pronunciation dictionaries, language models, and the like. These models can all be trained offline and used by the offline acoustic model training module 134 to construct a target speech recognition model for target keywords.
  • the speech recognition model construction module 134 can be configured to combine the pre-trained acoustic model in the model library 138, the pronunciation dictionary, the language model and the target keywords transmitted by the transmission communication module 132, and generate the target speech recognition model according to the keyword recognition model construction algorithm. Model. It should be noted that the process of building the target speech recognition model has no strong dependence on the training operation of the offline acoustic model and can be executed asynchronously. Therefore, the speech recognition model construction module 134 can obtain a pre-trained acoustic model from the model library 138 to construct the target speech recognition model.
  • first user terminal 110 and the second user terminal 120 are shown as separate devices in FIG. 2, they may also be implemented as the same device (as shown by the dotted line in the figure). (as shown by the dotted line in the figure) to implement the intelligent listening substitution solution according to the embodiment of the present disclosure.
  • target keywords are acquired from a single user terminal, and a speech recognition model for the target keywords is deployed on the same user terminal.
  • Fig. 3 shows a schematic flowchart of a method 300 for voice listening substitution according to an embodiment of the present disclosure.
  • the method 300 may be implemented on the user terminal 110 as shown in FIG. 1 and FIG. 2 .
  • the user terminal 110 may be, for example, a smart phone, a smart home appliance, a wearable device, an audio playback device, a tablet computer, a notebook computer, etc., which have a sensor capable of receiving sound, such as a microphone.
  • the user terminal 110 obtains the target speech recognition model corresponding to the target keyword, the target speech recognition model is constructed by the server 130 according to the target keyword, and the target keyword is obtained according to the travel information of the user.
  • the user terminal 110 may receive the target speech recognition model from the connected user terminal 120 (such as a smart phone) via a wireless connection such as Bluetooth.
  • the user terminal 110 may directly receive the target speech recognition model from the server 130 .
  • the user's travel information indicates that the user will go to airports, stations and other transportation places to travel by plane or train, or to pick up and drop off other people at airports or stations.
  • Travel information usually includes information such as flight number or train number, location of traffic place, departure time or arrival time.
  • the target keyword in the travel information can be a character string representing flight number or train number, usually composed of letters and numbers.
  • the travel information may include the following information: "June 2, 2021 at 7:45 am, G109, Beijing South to Shanghai Hongqiao", correspondingly, the target keyword is "G109", the location is "Beijing South Railway Station", The time is "June 2, 2021 at 7:45 AM”.
  • the target speech recognition model is constructed by the server 130 based on the received target keywords.
  • the target keyword may be obtained from another user terminal 120 connected to the user terminal 110 and sent to the server 130 .
  • a user terminal 110 eg, a noise-canceling headset
  • another user terminal 120 eg, a smart phone
  • the travel information of the user is obtained by accessing the travel application, SMS or other authorized applications of the user terminal 120 .
  • the user terminal may send target keywords in the travel information, such as flight number or train number, to the server 130 .
  • the server 130 may construct a target speech recognition model for recognizing the target keyword based on the received target keyword and transmit the constructed target speech recognition model to the user terminal 110 , which will be described below with reference to FIG. 4 .
  • the local speech recognition model is updated according to the target speech recognition model to obtain an updated speech recognition model, and the local speech recognition model is the user The speech recognition model stored in the terminal.
  • the local speech recognition model 114 of the user terminal 110 can recognize one or more keywords, and the target keyword can only be recognized after the update.
  • the local speech recognition model and the target speech recognition model may be, for example, decoding maps.
  • the decoding graph is a collection of decoding paths of grammatical constraint rules determined by keywords to be recognized. The details of the decoding graph will be described in the section "Generation and Deployment of Speech Recognition Model" below, which will not be described in detail for now.
  • the decoding path for the target keyword of the target speech recognition model is added to the local speech recognition model, so that the local speech recognition model is updated so that the target keyword can be recognized.
  • the existing decoding path in the local speech recognition model can be replaced with the decoding path of the target speech recognition model for the target keyword, for example, the longest existing path in the local speech recognition model The decoding path of the keywords.
  • the target speech recognition model may be directly deployed as a local speech recognition model.
  • the local speech recognition model is dedicated to recognize the corresponding target keywords and can be updated later.
  • the user terminal 110 determines whether the user terminal 110 satisfies the target condition. If the target condition is met, then at block 330, the collected environmental sound is recognized according to the updated speech recognition model, and a recognition result is obtained. That is to say, the updated speech recognition model is triggered to monitor broadcast sounds in the external environment only under proper conditions. Since the speech recognition model 114 may exist in the user terminal 110 earlier, it is not necessary to start listening to the ambient sound immediately at this time. It is allowed to trigger the execution of the local speech recognition model when certain target conditions are met, which meets the user's real listening needs and can also save the computing resources and power of the user terminal.
  • the target condition may be that the user's location matches the location information of the travel information.
  • travel information usually includes location information in addition to target keywords.
  • the travel information may include the following information: "June 2, 2021, 7:45 am, G109, Beijing South to Shanghai Hongqiao", then "Beijing South Railway Station” will be used as the location information.
  • the updated speech recognition model is activated to analyze the collected environmental sounds identify. In this way, when the geographical location condition is satisfied, the updated speech recognition model can be automatically used to recognize the keywords of the ambient sound without user interaction, bringing a better user experience.
  • the target condition may also be that when the current time is within a predetermined time period before the time information, the collected environmental sound is recognized according to the updated speech recognition model.
  • the time information is "7:45 AM on June 2, 2021".
  • the updated speech recognition model uses the updated speech recognition model to process the collected environmental sounds identify.
  • airports or stations will broadcast target keywords that busy users expect to monitor. In this way, when the time condition is met, the updated speech recognition model can be automatically used to recognize the keywords of the ambient sound without user interaction, bringing a better user experience.
  • the location information and time information of the user may be provided by the user terminal 110 itself, or may be obtained from other devices, such as another user terminal 120 connected to the user terminal 110 .
  • the execution of the speech recognition model may be triggered by the user terminal 110 itself or other terminals, such as the user terminal 120 (for example, a trigger signal is sent through a Bluetooth connection).
  • the above target conditions for triggering the speech recognition model may be used alone or in combination.
  • the execution of the speech recognition model may also be manually triggered by the user, for example through a manual button.
  • the button can be set on the user terminal 110 as an intelligent listening device, or on another user terminal 120, or in the application of the user terminals 110 and 120 .
  • the speech recognition model 114 of the user terminal 110 is capable of recognizing multiple keywords. In this case, the user may select some or all of them for identification, or automatically select and identify the latest updated target keywords.
  • the collected environmental sound is recognized according to the updated speech recognition model, and a recognition result is obtained.
  • the microphone of the user terminal 110 is turned on to collect the external ambient sound.
  • the collected ambient sound is recognized locally in the user terminal 110 through the speech recognition model.
  • the identification can be done in real time or near real time.
  • the collected environmental sound can be directly input to the speech recognition model, and the speech recognition model can judge whether it is the text of the target keyword, for example, through the decoding path of the decoding map.
  • the collected ambient sound can also be buffered at the user terminal 110 and then read into the speech recognition model, and the buffered sound can last for about 10 seconds, 20 seconds, 30 seconds or more. Over time, if no target keyword is identified, the cached ambient sound may be gradually removed or overwritten.
  • the initial value of the recognition result can be set to "No".
  • ambient sounds may be input to the speech recognition model frame by frame in time order.
  • the speech recognition model determines whether these speech frames correspond to the target keyword. If they match completely, it is determined that the target keyword is recognized, otherwise it is determined that the target keyword is not recognized, and the monitoring is restarted. For example, in the case that the target keyword is "G109", if the speech in the ambient sound includes "G107", then "G", "1", "0", "7” will be recognized in sequence. As an example, before recognizing "7", the speech recognition model sequentially determines that the ambient sound matches the first part of the target keyword (because "G", "1", "0" are consistent with the target keyword).
  • the speech recognition model immediately restarts listening and clears the recognized content "G", "1", and "0".
  • the associated cached data may be deleted and monitoring may be restarted.
  • the recognition result may be set as "Yes”.
  • the user device 110 prompts the user.
  • the form of the prompt may depend on the capabilities of the user terminal and user configuration.
  • prompts may include, but are not limited to, one or more of text, images, audio and video.
  • the prompt may be playing a specified reminder sound, a specific recording, or playing a voice corresponding to the target keyword.
  • the prompt may be a card pop-up window, a banner display, and the like.
  • the notification may be any one or a combination of some of the above.
  • the user terminal 110 may also provide the prompt to other connected user terminals 120 .
  • the reminder is provided via the Bluetooth communication protocol between the user terminal 110 and the user terminal 120 . In this way, notifications can be presented on user terminals deployed with speech recognition models or on other devices to achieve better notification effects.
  • the user terminal 110 as an intelligent listening device, but it should be understood that the intelligent listening function can also be implemented in other user terminals (for example, the user terminal 120).
  • the user terminal 120 sends the target keyword to the server 130, and receives the speech recognition model from the server 130, and uses the speech recognition model to listen to the speech content in the environment without forwarding the speech recognition model to the user Terminal 110.
  • the speech recognition model according to the embodiments of the present disclosure is a lightweight model deployed on user terminals with limited computing resources. Moreover, this speech recognition model is customized by the user and is aimed at a specific target keyword. The process of generating and deploying a speech recognition model according to an embodiment of the present disclosure is further described below with reference to FIG. 4 .
  • the speech recognition model for identifying target keywords is constructed by the server 130, and deployed on users such as smart phones, smart home appliances, wearable devices, audio playback devices, tablet computers, notebook computers, etc. Either of terminals 110 and 120.
  • the user terminals 110 and 120 can use the speech recognition model to recognize whether a speech containing keywords is played in the surrounding environment, especially in a high-noise environment.
  • FIG. 4 shows a schematic diagram of an example process 400 of building and deploying a speech recognition model according to an embodiment of the disclosure.
  • FIG. 4 shows an example of deploying a speech recognition model on a first user terminal 110 as shown in FIGS. 1 and 2 , wherein the first user terminal 110 is connected to a second user terminal 120 via short-range communication such as Bluetooth.
  • the speech recognition model may be deployed on the second user terminal 120, or on the single terminal if there is only one user terminal, without departing from the scope of the embodiments of the present disclosure.
  • the first user terminal 110 may send its own identification information to the second user terminal 120 when establishing a connection with the second user terminal.
  • the second user terminal 120 may store the identification information locally so as to transmit data to the first user terminal 110 later, such as a target speech recognition model or other information.
  • the second user terminal 120 may obtain 410 a target keyword that the user wants to identify.
  • the target keyword text may be a keyword in the user's travel information, such as the flight number or train number that the user will take.
  • the travel information may include the following information: "June 2, 2021 at 7:45 am, G109, from Beijing south to Shanghai Hongqiao", correspondingly, the target keyword is "G109".
  • the keywords in the travel information can be extracted through a compliance scheme (for example, a specified application authorized by the user, such as a travel application or a short message service, etc.), and it is also possible to access the short message from a specified sender ( For example, airlines or train operators) to obtain target keywords.
  • a compliance scheme for example, a specified application authorized by the user, such as a travel application or a short message service, etc.
  • a specified sender For example, airlines or train operators
  • the target keyword may be obtained automatically without manual input by the user.
  • the target can be extracted from the short message or message of the specified sender (for example, the vehicle operator) by accessing the short message of the smart phone or the message of the specified application. Key words.
  • the short message or message including flight number or train number may also include departure time information.
  • the keyword text can also be obtained according to such time information. For example, the nearest flight number or train number can be obtained as the target keyword. Alternatively, flight numbers or train numbers within a preset time period (for example, one day) from the current moment can also be used as the keyword text.
  • the second user terminal 120 may request 420 the speech recognition model for the target keyword from the server 130 .
  • the second user terminal 120 may send a request including the target keyword to the server 130 through a cellular network or a wireless local area network (such as WiFi).
  • the request may also include an identifier of the second user terminal 120 (e.g., IMSI, IMEI, or other unique identifier) and current connection information of the second user terminal, including but not limited to Bluetooth connection information (e.g., Bluetooth address, device identification, etc.), wireless local area network connection information (for example, wireless access point address, device identification, etc.), etc. Such information may be used to establish a point-to-point connection between the server 130 and the second user terminal 120 or the first user terminal 110 .
  • Bluetooth connection information e.g., Bluetooth address, device identification, etc.
  • wireless local area network connection information for example, wireless access point address, device identification, etc.
  • the request further includes a request identifier that can uniquely identify the request.
  • the request identifier may be generated by the second user terminal in any suitable manner, for example, according to the device identifier (for example, IMSI, IMEI, etc.) of the second device or other unique identifiers, the first device connected to the second user terminal 120 One or more of the connection information of the user terminal 110, the time stamp, etc. to generate the request identifier.
  • the request identifier can be used for the server 130 to broadcast the constructed speech recognition model.
  • the second user terminal 120 may locally create and maintain a mapping table.
  • the mapping table includes the associated stored device identifier of the second user terminal 120, the connection information of the first user terminal 110 connected to the second user terminal 120, and the generated request identifier.
  • the server 130 receives the request of the second user terminal 120, and builds 430 a speech recognition model for the target keyword based on the target keyword in the request.
  • the constructed speech recognition model is a lightweight decoding graph
  • the decoding graph is a set of decoding paths determined by grammatical constraint rules determined by target keywords.
  • the server 130 generates a decoding graph based on, for example, an HMLG (HMM+Context+Lexicon+Grammar) decoding graph construction process.
  • the server 130 builds a specific lightweight language model for the keyword based on grammatical rules and grammatical rules (for example, JSpeech Grammar Format, referred to as "JSGF" grammatical rules), n-gram statistical rules, etc., namely Target language model (G.fst).
  • JSGF JSpeech Grammar Format
  • G.fst Target language model
  • JSGF JSpeech Grammar Format
  • G.fst Target language model
  • the server 130 only constrains the transition probability and connection weight between words of the target keyword according to the target keyword, and ignores the transition probability and connection weight between other learning units.
  • the target language model is customized as a parameter set that only conforms to the grammatical constraints of the target keyword, so as to ensure the ability to recognize the target keyword. For example, the word combination of the target keyword is determined to have a higher occurrence probability, and the combination occurrence probability of other non-target keywords is set to 0.
  • the server 130 selects a specific pronunciation sequence from the pronunciation dictionary stored in the model storehouse 138 according to the target keyword, and construct the target pronunciation dictionary model (L.fst) in conjunction with the phoneme description file in the pronunciation dictionary, because the pronunciation sequence is based on the target keyword Compared with the original pronunciation dictionary, the scale of the retrieved words is also greatly reduced.
  • the server 130 also obtains an acoustic model, such as an HMM model (H.fst), through offline training.
  • the server 130 combines the target language model, the target pronunciation dictionary model and the acoustic model to obtain a speech recognition model.
  • the speech recognition model uses the original acoustic model, a lightweight target language model constructed according to the target keywords, and a lightweight pronunciation dictionary model retrieved from the target keywords, so the constructed speech recognition model has a lightweight structure.
  • Generalized speech recognition model which only includes the transition probability and connection weight for the target keyword, and the parameter scale has been greatly reduced.
  • the speech recognition model may be a decoding map as described above.
  • the server 130 merges the target language model (G.fst) and the pronunciation dictionary model (L.fst) constructed above to generate the merged pronunciation dictionary and language model (LG.fst), and then merges the generated pronunciation dictionary model.
  • the context model (C.fst) is used to generate CLG.fst, and finally the HMM model (H.fst) constructed above is combined to generate a decoding graph model (HCLG.fst) as a speech recognition model for target keywords.
  • Embodiments of the present disclosure provide an acoustic model, which is suitable for recognizing non-human voice far-field broadcast speech in an environment with high noise and strong reverberation, and can significantly improve the accuracy of speech recognition.
  • the acoustic model will be described below with reference to FIGS. 5 to 9 .
  • the acoustic model can be trained offline or online.
  • the present disclosure is not intended to limit pronunciation dictionaries, types of target language models, or training procedures.
  • the server 130 transmits 440 the constructed target speech recognition model to the second user terminal 120 .
  • the server 130 may transmit the target speech recognition model in a peer-to-peer manner.
  • the server 130 establishes a point-to-point connection with the second user terminal using a cellular or WiFi communication protocol according to the identifier of the second user terminal 120 included in the request 420, and transmits 440 the target speech recognition model to the a second user terminal.
  • the second user terminal 120 determines 450 the first user terminal 110 to deploy the speech recognition model according to the local connection information. Then, the second user terminal 120 transmits 460 the speech recognition model to the first user terminal 110 through the connection with the first user terminal 110 .
  • the server can also transmit the target speech recognition model through broadcasting.
  • the server 130 broadcasts the constructed target speech recognition model and the associated request identification.
  • the second user terminal 120 may compare the broadcast request identifier with the local mapping table to determine whether to receive the speech recognition model. If the request identifier cannot be found in the mapping table, the target speech recognition model is not received. If the request identifier is found, the corresponding target speech recognition model is received.
  • the second user terminal 120 may also determine the connected first user terminal 110 according to the request identifier.
  • the second user terminal 120 can use the request identifier to look up the connection information of the first user terminal 110 corresponding to the request identifier in the mapping table, such as the identification information of the first user terminal 110, so as to determine 450 the address of the target voice device model to be received.
  • a first user terminal 110 .
  • the second user terminal 120 sends 460 the target speech recognition model to the determined first user terminal 110 .
  • the first user terminal 110 may deploy the target speech recognition model or update the local speech recognition model based on the target speech recognition model, and start executing 470 the updated speech recognition model to monitor the environment when the target condition is satisfied. Whether the target keyword exists in the voice, as in the process 300 described above with reference to FIG. 3 .
  • FIG. 4 describes the process of transmitting the target speech recognition model from the server 130 to the first user terminal 110 via the second user terminal 120 .
  • the first user terminal 110 may have the capability of directly communicating with the server 130 . Therefore, it is also possible to directly transmit the target speech recognition model from the server 130 to the first user terminal 110 .
  • the server 130 can locate the first user equipment 110 by using the information of the first mobile terminal 110 reported by the second user terminal 120 (for example, Bluetooth connection information, wireless local area network connection information, etc.), and directly transmit the target speech recognition model to the first user. Terminal 110.
  • the second user terminal 120 may also not transmit the received speech recognition model to the first user terminal 110, but execute the speech recognition model by itself to realize the voice listening function.
  • the speech recognition model for target keywords is used to recognize a broadcast speech in ambient sound of an airport or a train station.
  • identifying such ambient sounds is challenging.
  • the airport broadcast is usually too far away from the user's pickup equipment, which has strong reverberation interference.
  • broadcast sounds are basically synthesized according to fixed templates, which are quite different from standard human voices in Mandarin.
  • there are various noises such as the conversations of other passengers in the lobby, and the environment is extremely complicated. Therefore, it is desirable to provide a solution for accurately identifying broadcast voice content in a complex background noise environment by using a user terminal in a noisy environment.
  • FIG. 5 shows a schematic diagram of a method 500 for generating an acoustic model according to an embodiment of the present disclosure.
  • Method 500 includes, at block 510, collecting sound data in a noisy location.
  • sound data is collected from the noisy environment to generate training data for training and building an acoustic model.
  • Sound collection locations may include, but are not limited to, counter halls, security check passages, waiting halls, convenience stores, dining areas, public restrooms, etc., so as to cover areas that users can reach.
  • R location coverage radius
  • the way of sound collection can be to turn off the front-end gain of the recording device, and record continuously (for example, for 24 hours), so as to ensure that background noise without broadcast sound can be recorded at various locations.
  • static recording can be adopted, and the sound collection device is fixed and continuously and uninterruptedly recorded.
  • dynamic recording can also be used, where a person or machine moves the acquisition device in a noisy place and records continuously and uninterruptedly.
  • the recording format may be, for example, wav format, 16kHz, 16bit, multi-channel, etc., but is not limited thereto.
  • Acoustic features can be acquired as described above, or by other means, such as accessing existing noisy speech features or various types of existing noise features, without the need for dedicated on-site acquisition
  • the voice data is preprocessed to obtain target voice data and noise data.
  • the collected original sound data includes broadcast voices in a part of time periods, while other time periods do not include broadcast voices.
  • the preprocessing may include manually or by machine dividing the original sound data into audio data including the target speech content and audio data not including the target speech content, and marking them respectively.
  • the target voice data is marked with the location information from which the data comes and the text of the target voice data, for example, including flight number or train number. For noisy data, only the location information of the noisy data is marked.
  • acoustic features of the speech data and noise data are extracted.
  • Acoustic features can be extracted by performing framing, windowing, FFT and other processing on the marked speech data and noise data.
  • the acoustic features can be represented by, for example, Mel-frequency cepstral coefficients (MFCC), but not limited thereto, which takes 10 ms as a frame, and each frame has a corresponding set of parameters, and each parameter has a value of 0 A value between 1 and 1. That is, both target speech data and noise data can be represented as a series of frames lasting for a period of time, each frame is characterized by a set of parameters with values between 0 and 1.
  • MFCC Mel-frequency cepstral coefficients
  • the acoustic features extracted from the target speech data after processing such as framing, windowing, and FFT are noisy features.
  • the noise characteristics can be enhanced to obtain the purest speech acoustic characteristics, thereby reducing the adverse effect of noise on recognition.
  • FIG. 6 it shows a schematic flowchart of a method 600 for enhancing speech acoustic features according to an embodiment of the present disclosure.
  • LASSO transform is performed on the input noisy speech acoustic features to perform reverberation suppression on the acoustic features.
  • Reverberation means that when sound waves are reflected and absorbed by obstacles such as walls, ceilings, and floors when they propagate indoors, after the sound source stops emitting sound waves, the sound waves will be reflected and absorbed many times in the room before disappearing. The phenomenon that the sound persists after the source stops sounding is called reverberation. Reverberation is not conducive to accurate recognition of content in speech.
  • LASSO transformation is also known as LASSO regression.
  • LASSO regression By limiting the conditions of the correlation between important variables in the acoustic features (that is, variables whose coefficients are not 0) and other variables, the acoustic features related to reverberation can be removed, thereby suppressing the adverse effects of reverberation.
  • bottleneck network processing is performed on the acoustic features of the reverberation-suppressed speech data.
  • a bottleneck network is a neural network model that includes a bottleneck layer.
  • the bottleneck layer has a better number of nodes than the previous layers, which can be used to obtain input representations with fewer dimensions.
  • the dimensionality of the acoustic features processed by the bottleneck network can be reduced, resulting in a better loss during training.
  • the coefficients of the bottleneck network can be precomputed or updated during training.
  • speech enhancement 600 as shown in FIG. 6 , speech acoustic features with background noise are transformed into speech features that are as pure as possible. Furthermore, clean speech features can be fused with noise features from multiple locations to generate fused features.
  • fusion features are generated from the speech acoustic features and the noise acoustic features. Fusion features can reduce the impact of the type and size differences of background noise in different places or different positions in the same place on the recognition accuracy. According to an embodiment of the present disclosure, fusion features are generated by aligning speech features and noise features frame by frame.
  • FIG. 7 shows a schematic conceptual diagram of a method 700 for generating fusion features according to an embodiment of the present disclosure.
  • the target speech data divided from the original data undergoes feature extraction 710 and speech enhancement 720 to generate enhanced speech features.
  • the noise data is uniformly sampled to obtain sampling noise at multiple positions (for example, position 1 to position N).
  • feature extraction 710 is performed on these sampled noises from multiple locations to generate noise features.
  • Feature extraction 710 may be performed as described with reference to block 530, including framing, windowing, FFT, and the like.
  • the acoustic features of the speech data and the acoustic features of the noise data may have the same frame size, for example, both are 10 ms, so that they can be fused frame by frame.
  • the enhanced speech acoustic features and noise features have the same frame size, say 10ms, so the frame-by-frame alignment of speech features and noise features can produce temporally aligned fused features.
  • all sampled noise features (for example, noise features from positions 1 to N) can be superimposed on the enhanced speech features frame by frame to form fusion features.
  • each frame is characterized by a set of parameters with values between 0 and 1, namely vectors, and superposition refers to adding the corresponding parameters of speech acoustic features and noise features through vector addition.
  • a frame in the fusion feature is also represented by a corresponding 40-dimensional vector.
  • the value of the superimposed parameter may exceed the range of 0 to 1.
  • a global normalization process can be performed so that the value of the parameter of the fusion feature is still in the range of 0 to 1.
  • the duration of speech data may be different from that of noise data, and the duration of noise data may also be different for each location. Therefore, feature fusion also includes the alignment of speech data and noise data.
  • FIG. 8 shows a schematic diagram of a feature fusion process 800 according to an embodiment of the present disclosure.
  • Enhanced speech acoustic features 810 for feature fusion and noise features 820-1, 820-2, ... 820-N (collectively 820) from multiple locations in FIG. 8 are shown in a sequence of frames.
  • the enhanced speech acoustic features 810 in FIG. 8 include L frames. Since the duration of speech acoustic feature 810 and noise acoustic feature 820 may be different, noise feature 820 may include the same or a different number of frames than L.
  • noise signature 820-N may include, for example, R frames.
  • the noise acoustic feature 820 can be adjusted according to the duration of the speech acoustic feature 810, for example, by selecting a part of the frames of the noise acoustic feature or expanding the frames of the noise acoustic feature, the relationship between the number of frames (or duration) and the speech Adjusted noise acoustic signature with the same acoustic signature. After the two are aligned, the speech acoustic features and the adjusted noise acoustic features are superimposed.
  • the speech acoustic feature 810 and the noise acoustic feature 820 are superimposed frame by frame.
  • the frame number of the enhanced speech acoustic feature 810 is less than the frame number of the noise acoustic feature 820 (L ⁇ R)
  • the first L frames of the noise acoustic feature 820 can be selected to be superimposed with the enhanced speech acoustic feature, and the rear R-L frames are discarded. do processing. It should be understood that the last L frames in the noise acoustic feature 820 , the middle L frame, or the L frames selected in any other way may also be selected to be superimposed on the speech acoustic feature 810 .
  • the frame number of the enhanced speech acoustic feature 810 is greater than the frame number (L>R) of the noise acoustic feature 820, then for the first frame of the noise acoustic feature 820 can be superimposed on the L-R frame of the enhanced speech acoustic feature, the first 2 frames are superimposed to L-R+1 frame, and so on, until all frames of speech acoustic feature 810 are superimposed with frames of noise feature 820 .
  • the frame number R of the noise feature 820-N is smaller than the frame number of the speech acoustic feature, so its first frame is again superimposed on the corresponding frame of the speech acoustic feature.
  • FIG. 8 is only schematic, and the frame numbers of speech acoustic features and noise features are not necessarily the same as those shown in FIG. 8 .
  • the first frame of the enhanced speech acoustic feature 810 and the first frame of the noise feature 820-1, 820-2, ... 820-N are superimposed to obtain the first frame of the fusion feature, and the second frame and the noise feature 1 , 2, ... N, the first frame is superimposed to obtain the second frame of the fusion feature 830, and so on, the fusion feature 830 with the number of L frames is generated.
  • Fused features 830 are used to train the acoustic model.
  • the target speech data and noise data obtained in block 520 may be superimposed to obtain superimposed audio data; then the fused acoustic features may be obtained based on the superimposed audio data.
  • the superposition of target speech data and noise data can also be performed based on frame number alignment, and the extraction of fused acoustic features can be similarly performed.
  • the acoustic model is trained using the fused features and the text of the speech data.
  • the acoustic model may be based on a deep neural network (DNN) architecture.
  • the text of the voice data is the text marked in step 520, for example, including flight number or train number.
  • the fusion feature is the input of the acoustic model, and the text or the phoneme corresponding to the text is the labeled data corresponding to the fusion feature.
  • the acoustic model uses a multi-task architecture, including the sound source recognition task of the sound source tag and the speech recognition task of the voice tag.
  • FIG. 9 shows an architecture diagram for training an acoustic model according to an embodiment of the present disclosure.
  • the architecture 900 includes a deep neural network 910, and the deep neural network 710 may include a plurality of hidden layers 912, 914, 916 and input and output layers (not shown). Deep neural network 710 may also include more or fewer hidden layers.
  • multi-task training can be performed on the deep neural network 910, specifically, the training target of the deep neural network 910 is modified, and another voiceprint recognition tag is added as the training target on the basis of the speech recognition tag.
  • the output from the last hidden layer 916 of the deep neural network 910 can be obtained as the sound source feature.
  • the fusion feature and the sound source feature are spliced together as the input of 910 of the deep neural network.
  • the Y-dimensional sound source feature can be spliced with the X-dimensional fusion feature to form an X+Y-dimensional training feature, which can be used as the input of the deep neural network.
  • each round of iteration updates the input features with the sound source features generated in the previous round until the final training ends.
  • the sound source features input in the first iteration may be all set to 0.
  • the sound source features of broadcast voices can be extracted from the deep neural network as compensation for acoustic model learning, thereby more accurately picking up non-human voice broadcast voices.
  • a speech recognition model is constructed based on the acoustic model, pronunciation dictionary and language model.
  • the process of constructing a speech recognition model may include receiving target keywords from a user terminal, generating a target language model and a target pronunciation dictionary model for the target keywords, by merging the target language model, the target pronunciation dictionary model, and An acoustic model is used to construct the speech recognition model, and more specifically, reference may be made to the above description about FIG. 4 .
  • the offline trained acoustic model may be stored in the model library of the server.
  • the acoustic model and other models in the model library can be used to construct a speech recognition model for recognizing the target keyword.
  • This speech recognition model dedicated to specific keywords is lightweight and suitable for deployment to user equipment or smart listening devices.
  • Fig. 10 shows a schematic block diagram of an apparatus 1000 for voice listening substitution according to an embodiment of the present disclosure.
  • the device 1000 may be applied to a user terminal, such as a first user terminal 110 or a second user device 120 .
  • the apparatus 1000 includes a model acquiring unit 1010, configured to acquire a target speech recognition model corresponding to a target keyword.
  • the target speech recognition model is constructed based on the target keywords, which are obtained based on the user's travel information.
  • the device 1000 also includes an updating unit 1020 .
  • the update unit is used to update the local speech recognition model according to the target speech recognition model to obtain the updated speech recognition model, where the local speech recognition model is the speech recognition model stored in the user terminal.
  • the device 1000 also includes a voice recognition unit 1020 .
  • the environment recognition unit 1020 is used to recognize the collected environmental sound according to the updated speech recognition model when the target condition is met, and obtain the recognition result.
  • the environmental sound is the sound information collected in the environment where the user terminal is located.
  • the device 1000 also includes a prompt unit 1030 .
  • the prompting unit 1030 is configured to prompt the user when the recognition result indicates that there is a voice corresponding to the target keyword in the ambient sound.
  • the device 1000 further includes a target keyword acquiring unit.
  • the target keyword acquisition unit is used to acquire target keywords in the user's travel information.
  • the device 1000 also includes a sending unit.
  • the sending unit is used to send the target keywords in the travel information to the server, so that the server can construct a target speech recognition model according to the target keywords.
  • the model obtaining unit 1010 is also used to obtain the target speech recognition model from the server.
  • the user terminal is a first user terminal and is connected to a second user terminal
  • the method includes: sending identification information to the second user terminal, the identification information being used to identify the first user terminal ;
  • the acquiring the target speech recognition model corresponding to the target keyword is specifically: receiving the target speech recognition model from the second user terminal based on the identification information, the target speech recognition model being the second user
  • the terminal obtains from the server according to the target keyword; wherein the first user terminal is an audio playback device.
  • the target speech recognition model is a decoding map generated based on an acoustic model, a target pronunciation dictionary model, and a target language model.
  • a decoding graph is a collection of decoding paths of grammatically constrained rules determined by target keywords.
  • the target pronunciation dictionary model is acquired based on the pronunciation sequence of the target keyword, and the target language model is acquired based on the relationship between words of the target keyword.
  • the acoustic model is generated by: generating fusion acoustic features based on target speech data and noise data, where the target speech data is audio data including the target speech content, and the noise data is audio data not including the target speech content ; generate an acoustic model by fusing features and textual information of speech data for training.
  • the travel information has associated location information
  • the sound recognition unit 1020 is further configured to perform the collected environmental sound based on the updated speech recognition model when the user's location matches the location information of the travel information. identify.
  • the travel information also has associated time information
  • the sound recognition unit 1020 is also used to recognize the collected environmental sound according to the updated speech recognition model when the current time is within a predetermined period of time before the time information .
  • the prompting unit 1030 is further configured to play the voice corresponding to the target keyword on the user terminal.
  • the target keyword is train number or flight number.
  • the user terminal is one of a smart phone, a smart home appliance, a wearable device, an audio playback device, a tablet computer and a notebook computer.
  • Fig. 11 shows a schematic block diagram of an apparatus 1100 for generating a speech recognition model according to an embodiment of the present disclosure.
  • Apparatus 1100 may be used in server 130, for example.
  • the device 1100 includes a fusion unit 1110 , a training unit 1120 and a speech recognition model construction unit 1130 .
  • the fusion unit 1110 is used to generate fusion acoustic features based on the target speech data and noise data.
  • the target speech data is audio data including the target speech content, and the noise data is audio data not including the target speech content.
  • the training unit 1120 is used for generating an acoustic model by performing training through the fusion feature and the text information of the speech data.
  • the speech recognition model construction unit 1130 is configured to construct the speech recognition model according to the acoustic model, pronunciation dictionary and language model.
  • the fusion unit 1110 is further configured to superimpose the target speech data and noise data to obtain superimposed audio data; and obtain fused acoustic features based on the superimposed audio data.
  • the fusion unit 1110 is also used to obtain the first acoustic feature based on the target speech data, and obtain the second acoustic feature based on the noise data; obtain the fusion acoustic feature based on the first acoustic feature and the second acoustic feature.
  • the fusion unit 1110 is further configured to generate noisy acoustic features from the target speech data; and generate first acoustic features by enhancing the noisy acoustic features.
  • the fusion unit 1110 is further configured to perform LASSO transformation on the acoustic features with noise, and perform bottleneck network processing on the acoustic features transformed by LASSO, so as to obtain the first acoustic features.
  • the fusion unit 1110 is further configured to superimpose the first acoustic feature and the second acoustic feature to obtain superimposed acoustic features; and generate a fusion acoustic feature by normalizing the superimposed acoustic features.
  • the fusion unit 1110 is also used to obtain the frame number of the first acoustic feature, the frame number of the first acoustic feature is determined according to the duration of the target voice data; according to the frame number of the first acoustic feature Constructing a third acoustic feature based on the second acoustic feature; and superimposing the first acoustic feature and the third acoustic feature to obtain a fusion acoustic feature.
  • the acoustic model is a neural network model
  • the training unit 1120 is used to extract sound source features from the hidden layer of the acoustic model; and use the sound source features and fusion acoustic features as the input features of the acoustic model to train the acoustic Model.
  • the speech recognition model construction unit 1130 is also used to receive the target keyword from the user terminal; obtain the target pronunciation dictionary model from the pronunciation dictionary according to the pronunciation sequence of the target keyword; obtaining a target language model from the speech model; and constructing the speech recognition model by combining the acoustic model, the target pronunciation dictionary model, and the target language model.
  • Fig. 12 shows an apparatus 1200 for voice listening substitution according to another embodiment of the present disclosure.
  • the apparatus 1200 can be applied to the server 130 .
  • the device 1200 includes a target keyword acquisition unit 1210 , a speech recognition model construction unit 1220 and a sending unit 1230 .
  • the target keyword acquisition unit 1210 is configured to acquire target keywords related to the user's travel mode in the user's travel information.
  • the speech recognition model construction unit 1220 is configured to construct a target speech recognition model corresponding to the target keyword.
  • the sending unit 1230 is configured to send the speech recognition model to the user terminal, and the speech recognition model is used to recognize the environmental sound at the user terminal when the target condition is met, so as to determine whether the target keywords.
  • FIG. 13 shows a schematic block diagram of an example device 1200 that may be used to implement embodiments of the present disclosure.
  • the device 1300 includes a central processing unit (CPU) 1301 that can be programmed according to computer program instructions stored in a read-only memory (ROM) 1302 or loaded from a storage unit 1308 into a random-access memory (RAM) 1303 program instructions to perform various appropriate actions and processes.
  • ROM read-only memory
  • RAM random-access memory
  • various programs and data necessary for the operation of the device 1300 can also be stored.
  • the CPU 1301, ROM 1302, and RAM 1303 are connected to each other through a bus 1304.
  • An input/output (I/O) interface 1305 is also connected to the bus 1304 .
  • I/O input/output
  • the I/O interface 1305 includes: an input unit 1306, such as a keyboard, a mouse, etc.; an output unit 1307, such as various types of displays, speakers, etc.; a storage unit 1308, such as a magnetic disk, an optical disk, etc. ; and a communication unit 1309, such as a network card, a modem, a wireless communication transceiver, and the like.
  • the communication unit 1309 allows the device 1300 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks.
  • the various procedures and processing described above can be executed by the processing unit 1201 .
  • the various procedures and processes described above may be implemented as computer software programs tangibly embodied on a machine-readable medium, such as the storage unit 1308 .
  • part or all of the computer program may be loaded and/or installed on the device 1200 via the ROM 1302 and/or the communication unit 1309.
  • the computer program is loaded into the RAM 1303 and executed by the CPU 1201, one or more actions of the procedures and processes described above may be performed.
  • the present disclosure may be a method, apparatus, system and/or computer program product.
  • a computer program product may include a computer-readable storage medium having computer-readable program instructions thereon for carrying out various aspects of the present disclosure.
  • a computer readable storage medium may be a tangible device that can retain and store instructions for use by an instruction execution device.
  • a computer readable storage medium may be, for example, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • Computer-readable storage media include: portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), or flash memory), static random access memory (SRAM), compact disc read only memory (CD-ROM), digital versatile disc (DVD), memory stick, floppy disk, mechanically encoded device, such as a printer with instructions stored thereon A hole card or a raised structure in a groove, and any suitable combination of the above.
  • RAM random access memory
  • ROM read-only memory
  • EPROM erasable programmable read-only memory
  • flash memory static random access memory
  • SRAM static random access memory
  • CD-ROM compact disc read only memory
  • DVD digital versatile disc
  • memory stick floppy disk
  • mechanically encoded device such as a printer with instructions stored thereon
  • a hole card or a raised structure in a groove and any suitable combination of the above.
  • computer-readable storage media are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., pulses of light through fiber optic cables), or transmitted electrical signals.
  • Computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or downloaded to an external computer or external storage device over a network, such as the Internet, a local area network, a wide area network, and/or a wireless network.
  • the network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers.
  • a network adapter card or a network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .
  • Computer program instructions for performing the operations of the present disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state setting data, or Source or object code written in any combination, including object-oriented programming languages—such as Smalltalk, C++, etc., and conventional procedural programming languages—such as “C” or similar programming languages.
  • Computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement.
  • the remote computer can be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as via the Internet using an Internet service provider). connect).
  • LAN local area network
  • WAN wide area network
  • an electronic circuit such as a programmable logic circuit, field programmable gate array (FPGA), or programmable logic array (PLA)
  • FPGA field programmable gate array
  • PDA programmable logic array
  • These computer readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine such that when executed by the processing unit of the computer or other programmable data processing apparatus , producing an apparatus for realizing the functions/actions specified in one or more blocks in the flowchart and/or block diagram.
  • These computer-readable program instructions can also be stored in a computer-readable storage medium, and these instructions cause computers, programmable data processing devices and/or other devices to work in a specific way, so that the computer-readable medium storing instructions includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks in flowcharts and/or block diagrams.
  • each block in a flowchart or block diagram may represent a module, a program segment, or a portion of an instruction that contains one or more executable instruction.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented by a dedicated hardware-based system that performs the specified function or action , or may be implemented by a combination of dedicated hardware and computer instructions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method for listening to a speech by using a device instead of ears. The method is applied to a user terminal (110, 120) and comprises: obtaining a speech recognition model corresponding to a target keyword, the speech recognition model being constructed according to the target keyword, and the target keyword being obtained according to travel information of a user (310); updating a local speech recognition model according to a target speech recognition model to obtain an updated speech recognition model, the local speech recognition model being a speech recognition model stored in a user terminal (320); when a target condition is satisfied (330), recognizing collected ambient sound according to the updated speech recognition model to obtain a recognition result (340), the ambient sound being sound information collected in an environment where the user terminal is located; and when the recognition result indicates that the target keyword exists in the ambient sound (350), prompting the user (360). According to the method for listening to a speech by using a device instead of ears, when the user cannot clearly hear the ambient sound, the user is helped to recognize the target keyword in the ambient sound, thereby implementing personal and intelligent listening by using a device instead of ears.

Description

用于语音代听和生成语音识别模型的方法、装置、电子设备和介质Method, device, electronic device and medium for voice listening and generating voice recognition model 技术领域technical field
本发明涉及人工智能技术领域,具体涉及用于语音代听和生成语音识别模型的方法、装置、电子设备和介质。The present invention relates to the technical field of artificial intelligence, in particular to methods, devices, electronic equipment and media for voice listening and generating voice recognition models.
背景技术Background technique
随着近年来深度学习技术和大规模集成电路、数字电路、信号处理、微电子技术的飞速发展,各类搭载语音识别技术的消费电子产品越来越普及。通过语言识别技术,电子产品可以接收语音指令,通过识别语音指令来执行用户想要的操作。With the rapid development of deep learning technology and large-scale integrated circuits, digital circuits, signal processing, and microelectronics technology in recent years, various consumer electronics products equipped with speech recognition technology are becoming more and more popular. Through language recognition technology, electronic products can receive voice commands, and perform operations desired by users by recognizing voice commands.
遗憾的是,现有的电子产品大多是识别厂商提供的语音命令,针对用户的个性化关键词的语音识别较为困难,而且通常需要人工输入待识别的个性化关键词,经机器掌握后才具备识别该关键词的能力。该方案依赖人工主动输入,在使用便利性上有所不足,而且需要较多计算资源。此外,现有语音识别技术在噪声环境下性能较差,例如,当希望在机场、火车站等高噪声、强混响环境下识别广播中的列车车次或航班号时,难以获取令人满意的效果。Unfortunately, most of the existing electronic products recognize the voice commands provided by the manufacturer. It is difficult to recognize the personalized keywords for users, and usually requires manual input of the personalized keywords to be recognized. The ability to identify the keyword. This solution relies on manual active input, which is not convenient for use, and requires more computing resources. In addition, the performance of existing speech recognition technology is poor in noisy environments. For example, when it is desired to recognize train numbers or flight numbers in broadcasts in airports, railway stations and other high noise and strong reverberation environments, it is difficult to obtain satisfactory results. Effect.
发明内容Contents of the invention
本公开的实施例提供了用于语音识别的方案,其实现了针对个性化关键词的语音代听。Embodiments of the present disclosure provide a solution for voice recognition, which realizes voice listening for personalized keywords.
根据本公开的第一方面,提供了一种用于语音代听的方法,所述方法应用于用户终端,包括:获取目标关键词对应的目标语音识别模型,所述目标语音识别模型为根据所述目标关键词构建,所述目标关键词为根据用户的出行信息获取;根据所述目标语音识别模型对本地语音识别模型进行更新,获得更新后的语音识别模型,所述本地语音识别模型为所述用户终端中存储的语音识别模型;当满足目标条件时,根据所述更新后的语音识别模型对采集到的环境声音进行识别,获得识别结果,所述环境声音为在所述用户终端所处的环境中采集到的声音信息;以及当所述识别结果指示所述环境声音中存在所述目标关键词时,对所述用户进行提示。以此方式,能够在环境声音中检测出行信息的目标关键词,并检测到环境声音包括目标关键词语音时提醒用户,从而实现设备代替人耳的智能代听功能。According to the first aspect of the present disclosure, a method for voice listening substitution is provided, the method is applied to a user terminal, and includes: acquiring a target voice recognition model corresponding to a target keyword, and the target voice recognition model is based on the The target keyword is constructed, the target keyword is obtained according to the travel information of the user; the local speech recognition model is updated according to the target speech recognition model, and the updated speech recognition model is obtained, and the local speech recognition model is the The speech recognition model stored in the user terminal; when the target condition is satisfied, the collected environmental sound is recognized according to the updated speech recognition model, and the recognition result is obtained. sound information collected in the surrounding environment; and when the recognition result indicates that the target keyword exists in the environmental sound, prompting the user. In this way, it is possible to detect the target keywords of the travel information in the ambient sound, and remind the user when the ambient sound includes the voice of the target keyword, so as to realize the intelligent listening function of the device instead of the human ear.
在一些实施例中,获取目标关键词对应的目标语音识别模型包括:获取所述用户的出行信息;根据所述出行信息提取用户出行方式相关的目标关键词;向服务器发送所述目标关键词,以用于所述服务器根据所述目标关键词构建所述目标语音识别模型;以及从所述服务器获取所述目标语音识别模型。以此方式,能够在不需要用户交互的情况下,生成和部署针对个性化关键词的目标语音识别模型。In some embodiments, obtaining the target speech recognition model corresponding to the target keyword includes: obtaining the travel information of the user; extracting the target keyword related to the travel mode of the user according to the travel information; sending the target keyword to the server, for the server to construct the target speech recognition model according to the target keyword; and obtain the target speech recognition model from the server. In this way, targeted speech recognition models for personalized keywords can be generated and deployed without user interaction.
在一些实施例中,所述用户终端是第一用户终端并且连接到第二用户终端,该方法还包括向所述第二用户终端发送标识信息,所述标识信息用于标识所述第一用户终端。所述获取目标关键词对应的目标语音识别模型,具体为:基于所述标识信息从所述第二用户终端接收所述目标语音识别模型,所述目标语音识别模型为所述第二用户终端根据所述目标关键词从所述服务器获取;其中所述第一用户终端是音频播放设备。以此方式,能够在用户使用音频 播放设备(例如耳机)的情况下实现智能代听。In some embodiments, the user terminal is a first user terminal and is connected to a second user terminal, and the method further includes sending identification information to the second user terminal, the identification information being used to identify the first user terminal terminal. The acquiring the target speech recognition model corresponding to the target keyword specifically includes: receiving the target speech recognition model from the second user terminal based on the identification information, the target speech recognition model being the second user terminal according to The target keyword is obtained from the server; wherein the first user terminal is an audio playback device. In this way, intelligent proxy listening can be realized when the user uses an audio playback device (such as a headset).
在一些实施例中,所述目标语音识别模型是基于声学模型、目标发音字典和目标语言模型而生成的解码图,所述解码图是由所述目标关键词确定的语法约束规则的解码路径集合,所述目标发音字典模型是基于所述目标关键词的发音序列而获取的,并且所述目标语言模型是基于所述目标关键词的字之间的关系而获取的。以此方式,能够生产轻量化的目标语音识别模型,以便于部署到具有较少计算资源的用户终端上。In some embodiments, the target speech recognition model is a decoding map generated based on an acoustic model, a target pronunciation dictionary, and a target language model, and the decoding map is a set of decoding paths of grammatically constrained rules determined by the target keyword , the target pronunciation dictionary model is acquired based on the pronunciation sequence of the target keyword, and the target language model is acquired based on the relationship between words of the target keyword. In this way, a lightweight target speech recognition model can be produced for easy deployment on user terminals with less computing resources.
在一些实施例中,所述声学模型通过融合特征和目标语音数据的文本信息进行训练而生成,所述融合特征基于目标语音数据和噪声数据而生成,所述目标语音数据为包括目标语音内容的音频数据,所述噪声数据为不包括所述目标语音内容的音频数据。以此方式,所生成的目标语音识别模型更能够精确地识别高噪声、强混响环境中的语音,从而实现了个性化智能代听。In some embodiments, the acoustic model is generated by training fusion features and text information of target speech data, the fusion features are generated based on target speech data and noise data, and the target speech data is text information including target speech content audio data, the noise data is audio data that does not include the target speech content. In this way, the generated target speech recognition model can more accurately recognize speech in a high-noise and strong-reverberation environment, thereby realizing personalized intelligent listening substitution.
在一些实施例中,通过融合特征和目标语音数据的文本信息进行训练,目标语音数据的文本信息可以是直接的文本内容也可以是对应于文本内容的其他标注数据,例如音素序列。In some embodiments, training is performed by fusing features and text information of the target speech data. The text information of the target speech data can be direct text content or other labeled data corresponding to the text content, such as phoneme sequences.
在一些实施例中,该方法还包括:根据所述出行信息获取与用户出行方式关联的位置信息;其中当满足目标条件时,根据所述更新后的语音识别模型对采集到的环境声音进行识别包括:当所述用户的位置与所述位置信息匹配时,根据所述更新后的语音识别模型对采集到的环境声音进行识别。以此方式,在满足地理位置的目标条件时,自动使用更新后的语音识别模型来判断环境声音中是否包含关键词,而不需要用户交互,带来更好的使用体验。In some embodiments, the method further includes: acquiring location information associated with the user's travel mode according to the travel information; wherein when the target condition is met, the collected environmental sound is recognized according to the updated speech recognition model The method includes: when the location of the user matches the location information, recognizing the collected environmental sound according to the updated speech recognition model. In this way, when the target condition of the geographic location is met, the updated speech recognition model is automatically used to determine whether the ambient sound contains keywords without user interaction, bringing a better user experience.
在一些实施例中,该方法还包括:根据所述出行信息获取与用户出行方式关联的时间信息,其中当满足目标条件时,根据所述更新后的语音识别模型对采集到的环境声音进行识别包括:当满足时间条件时,根据所述更新后的语音识别模型对采集到的环境声音进行识别,所述时间条件为根据所述时间信息确定。在一些实施例中,时间条件可以是当前时间在所述时间信息之前的预定时间段内。以此方式,在满足时间的目标条件时,自动使用更新后的语音识别模型来判断环境声音中是否包含关键词,而不需要用户交互,带来更好的使用体验。In some embodiments, the method further includes: acquiring time information associated with the user's travel mode according to the travel information, wherein when the target condition is met, the collected environmental sound is recognized according to the updated speech recognition model The method includes: when a time condition is satisfied, the collected environmental sound is recognized according to the updated speech recognition model, and the time condition is determined according to the time information. In some embodiments, the time condition may be that the current time is within a predetermined time period before the time information. In this way, when the target condition of the time is met, the updated speech recognition model is automatically used to determine whether the ambient sound contains keywords without user interaction, bringing a better user experience.
在一些实施例中,对所述用户进行提示包括在所述用户终端上播放与所述目标关键词对应的语音。以此方式,用户能够针对感兴趣的个性化关键词收听到对应的提示。In some embodiments, prompting the user includes playing a voice corresponding to the target keyword on the user terminal. In this way, the user can hear corresponding prompts for personalized keywords of interest.
在一些实施例中,所述目标关键词是列车车次或航班号。In some embodiments, the target keyword is train number or flight number.
在一些实施例中,所述用户终端是智能手机、智能家电、可穿戴设备、音频播放设备、平板电脑和笔记本电脑之一。In some embodiments, the user terminal is one of a smart phone, a smart home appliance, a wearable device, an audio playback device, a tablet computer, and a notebook computer.
根据本公开的第二方面,提供了一种用于代听的装置,包括:模型获取单元,用于获取目标关键词对应的目标语音识别模型,所述目标语音识别模型为根据所述目标关键词构建,所述目标关键词为根据用户的出行信息获取;更新单元,用于根据所述目标语音识别模型对本地语音识别模型进行更新,获得更新后的语音识别模型,所述本地语音识别模型为所述用户终端中存储的语音识别模型;声音识别单元,用于当满足目标条件时,根据所述更新后的语音识别模型对采集到的环境声音进行识别,获得识别结果,所述环境声音为在所述用户终端所处的环境中采集到的声音信息;以及提示单元,用于当所述识别结果指示所述环境声音中存在所述目标关键词时,对所述用户进行提示。以此方式,能够在环境声音中检测出行信息的目标关键词,并检测到环境声音包括目标关键词语音时提醒用户,从而实现设备代替人耳的智能代听功能。According to a second aspect of the present disclosure, there is provided a device for listening on behalf of others, including: a model acquisition unit, configured to acquire a target speech recognition model corresponding to a target keyword, and the target speech recognition model is based on the target key Word construction, the target keyword is obtained according to the travel information of the user; the update unit is used to update the local speech recognition model according to the target speech recognition model, and obtain the updated speech recognition model, and the local speech recognition model It is the speech recognition model stored in the user terminal; the sound recognition unit is used to recognize the collected environmental sound according to the updated speech recognition model when the target condition is met, and obtain the recognition result, and the environmental sound is the sound information collected in the environment where the user terminal is located; and a prompting unit, configured to prompt the user when the recognition result indicates that the target keyword exists in the environmental sound. In this way, it is possible to detect the target keywords of the travel information in the ambient sound, and remind the user when the ambient sound includes the voice of the target keyword, so as to realize the intelligent listening function of the device instead of the human ear.
在一些实施例中,所述装置还包括:目标关键词获取单元,用于获取所述用户的所述出行信息;目标关键词获取单元,用于根据所述出行信息提取用户出行方式相关的目标关键词;和发送单元,用于向服务器发送所述出行信息中的所述目标关键词,以用于所述服务器根据所述目标关键词构建。模型获取单元还用于从所述服务器获取所述目标语音识别模型。以此方式,能够在不需要用户交互的情况下,生成和部署针对个性化关键词的目标语音识别模型。In some embodiments, the device further includes: a target keyword acquisition unit, configured to acquire the travel information of the user; a target keyword acquisition unit, configured to extract a target related to the travel mode of the user according to the travel information keyword; and a sending unit, configured to send the target keyword in the travel information to a server, so that the server can construct it according to the target keyword. The model acquiring unit is further configured to acquire the target speech recognition model from the server. In this way, targeted speech recognition models for personalized keywords can be generated and deployed without user interaction.
在一些实施例中,所述用户终端是第一用户终端并且连接到第二用户终端。该装置还包括发送单元,用于向所述第二用户终端发送标识信息,所述标识信息用于标识所述第一用户终端。模型获取单元还用于基于所述标识信息从所述第二用户终端接收所述目标语音识别模型,所述目标语音识别模型为所述第二用户终端根据所述目标关键词从所述服务器获取。所述第一用户终端是音频播放设备。以此方式,能够在用户使用音频播放设备(例如耳机)的情况下实现智能代听。In some embodiments, the user terminal is a first user terminal and is connected to a second user terminal. The apparatus further includes a sending unit, configured to send identification information to the second user terminal, where the identification information is used to identify the first user terminal. The model obtaining unit is further configured to receive the target speech recognition model from the second user terminal based on the identification information, and the target speech recognition model is obtained by the second user terminal from the server according to the target keyword . The first user terminal is an audio playback device. In this way, it is possible to realize intelligent proxy listening when the user uses an audio playback device (such as earphones).
在一些实施例中,所述目标语音识别模型是基于声学模型、目标发音字典和目标语言模型而生成的解码图,所述解码图是由所述目标关键词确定的语法约束规则的解码路径集合,所述目标发音字典模型是基于所述目标关键词的发音序列而获取的,并且所述目标语言模型是基于所述目标关键词的字之间的关系而获取的。以此方式,能够生产轻量化的语音识别模型,以便于部署具有较少计算资源的用户终端上。In some embodiments, the target speech recognition model is a decoding map generated based on an acoustic model, a target pronunciation dictionary, and a target language model, and the decoding map is a set of decoding paths of grammatically constrained rules determined by the target keyword , the target pronunciation dictionary model is acquired based on the pronunciation sequence of the target keyword, and the target language model is acquired based on the relationship between words of the target keyword. In this way, lightweight speech recognition models can be produced for easy deployment on user terminals with less computing resources.
在一些实施例中,所述声学模型通过融合特征和目标语音数据的文本信息进行训练而生成,所述融合特征基于目标语音数据和噪声数据而生成,所述目标语音数据为包括目标语音内容的音频数据,所述噪声数据为不包括所述目标语音内容的音频数据。以此方式,所生成的目标语音识别模型更能够精确地识别高噪声、强混响环境中的语音,从而实现了个性化智能代听。In some embodiments, the acoustic model is generated by training fusion features and text information of target speech data, the fusion features are generated based on target speech data and noise data, and the target speech data is text information including target speech content audio data, the noise data is audio data that does not include the target speech content. In this way, the generated target speech recognition model can more accurately recognize speech in a high-noise and strong-reverberation environment, thereby realizing personalized intelligent listening substitution.
在一些实施例中,该装置还包括出行位置信息获取单元,用于根据出行信息获取与所述用户出行方式关联的位置信息。声音识别单元还用于:当所述用户的位置与所述位置信息匹配时,根据所述更新后的语音识别模型对采集到的环境声音进行识别。以此方式,在满足地理位置的目标条件时,自动使用更新后的语音识别模型来判断环境声音中是否包含关键词,而不需要用户交互,带来更好的使用体验。In some embodiments, the device further includes a travel location information acquiring unit, configured to acquire location information associated with the travel mode of the user according to the travel information. The sound recognition unit is further configured to: when the location of the user matches the location information, recognize the collected environmental sound according to the updated speech recognition model. In this way, when the target condition of the geographic location is met, the updated speech recognition model is automatically used to determine whether the ambient sound contains keywords without user interaction, bringing a better user experience.
在一些实施例中,该装置还包括出行时间信息获取单元,用于根据所述出行信息获取与所述用户出行方式关联的时间信息。声音识别单元还用于:当满足时间条件时,根据所述更新后的语音识别模型对采集到的环境声音进行识别,所述时间条件为根据所述时间信息确定。在一些实施例中,时间条件可以是当前时间在所述时间信息之前的预定时间段内。以此方式,在满足时间的目标条件时,自动使用更新后的语音识别模型来判断环境声音中是否包含关键词,而不需要用户交互,带来更好的使用体验。In some embodiments, the device further includes a travel time information obtaining unit, configured to obtain time information associated with the travel mode of the user according to the travel information. The sound recognition unit is further configured to: when a time condition is satisfied, recognize the collected environmental sound according to the updated speech recognition model, the time condition being determined according to the time information. In some embodiments, the time condition may be that the current time is within a predetermined time period before the time information. In this way, when the target condition of the time is met, the updated speech recognition model is automatically used to determine whether the ambient sound contains keywords without user interaction, bringing a better user experience.
在一些实施例中,所述提示单元还用于:在所述用户终端上播放与所述目标关键词对应的语音。以此方式,用户能够针对感兴趣的个性化关键词收听到对应的提示。In some embodiments, the prompting unit is further configured to: play a voice corresponding to the target keyword on the user terminal. In this way, the user can hear corresponding prompts for personalized keywords of interest.
在一些实施例中,所述目标关键词是列车车次或航班号。In some embodiments, the target keyword is train number or flight number.
在一些实施例中,所述用户终端是智能手机、智能家电、可穿戴设备、音频播放设备、平板电脑和笔记本电脑之一。In some embodiments, the user terminal is one of a smart phone, a smart home appliance, a wearable device, an audio playback device, a tablet computer, and a notebook computer.
根据本公开的第三方面,提供了一种生成语音识别模型的方法,包括:基于目标语音数据和噪声数据生成融合声学特征,所述目标语音数据为包括目标语音内容的音频数据,所述 噪声数据为不包括所述目标语音内容的音频数据;通过所述融合特征和所述语音数据的文本信息进行训练来生成所述声学模型;以及根据所述声学模型、发音字典和语言模型构建所述语音识别模型。以此方式,使用融合特征来训练的声学模型能够用于精确识别高噪声、强混响环境中的语音,从而实现了个性化智能代听。According to a third aspect of the present disclosure, there is provided a method for generating a speech recognition model, including: generating fusion acoustic features based on target speech data and noise data, the target speech data being audio data including target speech content, the noise The data is audio data that does not include the target speech content; the acoustic model is generated by training through the fusion feature and the text information of the speech data; and constructing the acoustic model according to the acoustic model, pronunciation dictionary and language model. Speech recognition model. In this way, the acoustic model trained using fused features can be used to accurately recognize speech in high-noise, strong-reverberation environments, thereby realizing personalized intelligent listening substitutes.
在一些实施例中,生成所述融合声学特征包括:对所述目标语音数据和所述噪声数据进行叠加来获取叠加后的音频数据;以及基于所述叠加后的音频数据获取所述融合声学特征。In some embodiments, generating the fused acoustic feature includes: superimposing the target voice data and the noise data to obtain superimposed audio data; and obtaining the fused acoustic feature based on the superimposed audio data .
在一些实施例中,生成所述融合声学特征包括:基于所述目标语音数据获取第一声学特征;基于所述噪声数据获取第二声学特征;基于所述第一声学特征和所述第二声学特征获取所述融合声学特征。In some embodiments, generating the fused acoustic feature includes: acquiring a first acoustic feature based on the target speech data; acquiring a second acoustic feature based on the noise data; Two acoustic features to obtain the fused acoustic features.
在一些实施例中,所述基于所述目标语音数据获取第一声学特征,包括:从所述目标语音数据生成带噪声学特征;通过增强所述带噪声学数据来生成所述第一声学特征。In some embodiments, the acquiring the first acoustic feature based on the target speech data includes: generating a noisy acoustic feature from the target speech data; generating the first acoustic feature by enhancing the noisy acoustic data academic features.
在一些实施例中,增强所述带噪声学特征包括:对所述带噪声学特征进行LASSO变换;以及对经LASSO变换的声学特征进行bottleneck网络处理,以获取所述第一声学特征。In some embodiments, enhancing the noisy acoustic feature includes: performing LASSO transformation on the noisy acoustic feature; and performing bottleneck network processing on the LASSO transformed acoustic feature to obtain the first acoustic feature.
在一些实施例中,所述基于所述第一声学特征和所述第二声学特征获取所述融合声学特征包括:叠加所述第一声学特征和所述第二声学特征,以得到叠加的声学特征;以及通过对所述叠加的声学特征进行归一化处理,生成所述融合声学特征。In some embodiments, the obtaining the fused acoustic feature based on the first acoustic feature and the second acoustic feature includes: superimposing the first acoustic feature and the second acoustic feature to obtain a superposition acoustic features; and generating the fused acoustic features by performing normalization processing on the superimposed acoustic features.
在一些实施例中,基于所述第一声学特征和所述第二声学特征获取所述融合声学特征包括:获取所述第一声学特征的帧数,所述第一声学特征的帧数根据所述目标语音数据的持续时间确定;根据所述第一声学特征的帧数基于所述第二声学特征构建第三声学特征;叠加所述第一声学特征和所述第三声学特征获取所述融合声学特征。In some embodiments, acquiring the fusion acoustic feature based on the first acoustic feature and the second acoustic feature includes: acquiring the number of frames of the first acoustic feature, and the number of frames of the first acoustic feature The number is determined according to the duration of the target speech data; according to the frame number of the first acoustic feature, a third acoustic feature is constructed based on the second acoustic feature; the first acoustic feature and the third acoustic feature are superimposed Features Get the fused acoustic features.
在一些实施例中,所述声学模型是神经网络模型,并且所述训练包括:从所述声学模型的隐藏层提取声源特征;以及将所述声源特征和所述融合声学特征作为所述声学模型的输入特征来训练所述声学模型。In some embodiments, the acoustic model is a neural network model, and the training includes: extracting sound source features from a hidden layer of the acoustic model; and using the sound source features and the fused acoustic features as the The input features of the acoustic model are used to train the acoustic model.
在一些实施例中,所述根据所述声学模型,发音字典和语言模型构建所述语音识别模型,具体为:接收来自用户终端的目标关键词;根据所述目标关键词的发音序列从所述发音字典获取目标发音字典模型;根据所述目标关键词的字之间的关系从所述语音模型获取目标语言模型;以及通过合并所述声学模型、所述目标发音字典模型和所述目标语言模型来构建所述语音识别模型。以此方式,能够生成针对特定关键词的轻量级语音识别模型,以适用于计算资源有限的用户终端。In some embodiments, the constructing the speech recognition model according to the acoustic model, the pronunciation dictionary and the language model is specifically: receiving a target keyword from a user terminal; The pronunciation dictionary obtains the target pronunciation dictionary model; According to the relationship between the words of the target keyword, the target language model is obtained from the speech model; and by merging the acoustic model, the target pronunciation dictionary model and the target language model To build the speech recognition model. In this way, a lightweight speech recognition model for specific keywords can be generated, which is suitable for user terminals with limited computing resources.
根据本公开的第四方面,提供了一种生成语音识别模型的装置,包括:融合单元,用于基于目标语音数据和噪声数据生成融合声学特征,所述目标语音数据为包括目标语音内容的音频数据,所述噪声数据为不包括所述目标语音内容的音频数据;训练单元,用于通过所述融合特征和所述语音数据的文本信息进行训练来生成所述声学模型;以及语音识别模型构建单元,用于根据所述声学模型、发音字典和语言模型构建所述语音识别模型。According to a fourth aspect of the present disclosure, there is provided a device for generating a speech recognition model, including: a fusion unit for generating fusion acoustic features based on target speech data and noise data, the target speech data being audio including target speech content data, the noise data is audio data that does not include the target speech content; a training unit is used to generate the acoustic model through training through the fusion feature and the text information of the speech data; and speech recognition model construction A unit, configured to construct the speech recognition model according to the acoustic model, pronunciation dictionary and language model.
根据本公开的第五方面,提供了一种用于语音代听的方法,包括:获取用户的出行信息中的与用户出行方式相关的目标关键词;构建与所述目标关键词对应的目标语音识别模型;以及向用户终端发送所述目标语音识别模型,所述目标语音识别模型用于当满足目标条件时对用户终端处的环境声音进行识别,以用于确定所述环境声音中是否存在所述目标关键词。以此方式,能够生成和部署针对特定关键词的目标语音识别模型,以实现针对特定关键词的 智能语音代听。According to a fifth aspect of the present disclosure, there is provided a method for voice listening, including: obtaining target keywords related to the user's travel mode in the user's travel information; constructing a target voice corresponding to the target keyword A recognition model; and sending the target speech recognition model to the user terminal, the target speech recognition model is used to identify the ambient sound at the user terminal when the target condition is met, so as to determine whether the ambient sound exists in the Describe the target keywords. In this way, target speech recognition models for specific keywords can be generated and deployed to achieve intelligent voice listening for specific keywords.
根据本公开的第六方面,提供了一种用于语音代听的装置,包括:目标关键词获取单元,用于获取用户出行信息中的与用户出行方式相关的目标关键词;语音识别模型构建单元,用于构建与所述目标关键词对应的目标语音识别模型;以及发送单元,向用户终端发送所述目标语音识别模型,所述目标语音识别模型用于当满足目标条件时对用户终端处的环境声音进行识别,以用于确定所述环境声音中是否存在所述目标关键词。以此方式,能够生成和部署针对特定关键词的目标语音识别模型,以实现针对特定关键词的智能语音代听。According to the sixth aspect of the present disclosure, there is provided a device for voice listening, including: a target keyword acquisition unit, used to acquire target keywords related to the user's travel mode in the user's travel information; speech recognition model construction A unit for constructing a target speech recognition model corresponding to the target keyword; and a sending unit for sending the target speech recognition model to the user terminal, and the target speech recognition model is used to send the target speech recognition model to the user terminal when the target condition is met The environmental sound is recognized, so as to determine whether the target keyword exists in the environmental sound. In this way, a target speech recognition model for specific keywords can be generated and deployed, so as to realize intelligent voice listening for specific keywords.
根据本公开的第七方面,提供了一种电子设备,包括:至少一个计算单元;至少一个存储器,所述至少一个存储器被耦合到所述至少一个计算单元并且存储用于由所述至少一个计算单元执行的指令,所述指令当由所述至少一个计算单元执行时,使得所述电子设备执行根据本公开的第一方面、第三方面或第五方面所述的方法。According to a seventh aspect of the present disclosure, there is provided an electronic device, comprising: at least one computing unit; at least one memory, the at least one memory being coupled to the at least one computing unit and storing Unit-executable instructions that, when executed by the at least one computing unit, cause the electronic device to perform the method according to the first aspect, the third aspect or the fifth aspect of the present disclosure.
根据本公开的第八方面,提供了一种计算机可读存储介质,其上存储有计算机程序,所述程序被处理器执行时实现根据本公开的第一方面、第三方面或第五方面所述的方法。According to an eighth aspect of the present disclosure, there is provided a computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the program according to the first, third, or fifth aspect of the present disclosure can be implemented. described method.
根据本公开的第九方面,提供了一种计算机程序产品,包括计算机可执行指令,其中所述计算机可执行指令在被处理器执行时实现根据本本公开的第一方面、第三方面或第五方面所述的方法。According to a ninth aspect of the present disclosure, there is provided a computer program product comprising computer-executable instructions, wherein said computer-executable instructions, when executed by a processor, implement the first, third or fifth aspect of the present disclosure. method described in the aspect.
附图说明Description of drawings
结合附图并参考以下详细说明,本公开各实施例的上述和其他特征、优点及方面将变得更加明显。在附图中,相同或相似的附图标注表示相同或相似的元素,其中:The above and other features, advantages and aspects of the various embodiments of the present disclosure will become more apparent with reference to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, the same or similar reference numerals indicate the same or similar elements, wherein:
图1示出了根据本公开的多个实施例能够在其中实现的示例环境的示意图。FIG. 1 shows a schematic diagram of an example environment in which various embodiments according to the present disclosure can be implemented.
图2示出了根据本公开的实施例的语音识别系统的示意框图。Fig. 2 shows a schematic block diagram of a speech recognition system according to an embodiment of the present disclosure.
图3示出了根据本公开的实施例的用于语音代听的方法的示意流程图。Fig. 3 shows a schematic flowchart of a method for voice listening substitution according to an embodiment of the present disclosure.
图4示出了根据本公开的实施例的构建和部署语音识别模型的示例过程的示意图。FIG. 4 shows a schematic diagram of an example process of building and deploying a speech recognition model according to an embodiment of the present disclosure.
图5示出了根据本公开的实施例的用于生成声学模型的方法的示意流程图。Fig. 5 shows a schematic flowchart of a method for generating an acoustic model according to an embodiment of the present disclosure.
图6示出了根据本公开的实施例的用于增强语音声学特征的方法的示意流程图。Fig. 6 shows a schematic flow chart of a method for enhancing speech acoustic features according to an embodiment of the present disclosure.
图7示出了根据本公开的实施例的用于生成融合特征的方法的示意概念图。Fig. 7 shows a schematic conceptual diagram of a method for generating fusion features according to an embodiment of the present disclosure.
图8示出了根据本公开的实施例的特征融合过程的示意图。Fig. 8 shows a schematic diagram of a feature fusion process according to an embodiment of the present disclosure.
图9示出了根据本公开的实施例的用于训练声学模型的架构图。FIG. 9 shows an architecture diagram for training an acoustic model according to an embodiment of the present disclosure.
图10示出了根据本公开的实施例的用于语音代听的装置的示意框图。Fig. 10 shows a schematic block diagram of an apparatus for voice listening substitution according to an embodiment of the present disclosure.
图11示出了根据本公开的实施例的用于生成语音识别模型的装置的示意框图。Fig. 11 shows a schematic block diagram of an apparatus for generating a speech recognition model according to an embodiment of the present disclosure.
图12示出了根据本公开的实施例的用于语音代听的装置的示意框图。Fig. 12 shows a schematic block diagram of an apparatus for voice listening substitution according to an embodiment of the present disclosure.
图13示出了可以用来实施本公开的实施例的示例设备的示意性框图。Figure 13 shows a schematic block diagram of an example device that may be used to implement embodiments of the present disclosure.
具体实施方式detailed description
下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although certain embodiments of the present disclosure are shown in the drawings, it should be understood that the disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein; A more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for exemplary purposes only, and are not intended to limit the protection scope of the present disclosure.
在本公开的实施例的描述中,术语“包括”及其类似用语应当理解为开放性包含,即“包括但不限于”。术语“基于”应当理解为“至少部分地基于”。术语“一个实施例”或“该实施例”应当理解为“至少一个实施例”。术语“第一”、“第二”等等可以指代不同的或相同的对象。下文还可能包括其他明确的和隐含的定义。In the description of the embodiments of the present disclosure, the term "comprising" and its similar expressions should be interpreted as an open inclusion, that is, "including but not limited to". The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be read as "at least one embodiment". The terms "first", "second", etc. may refer to different or the same object. Other definitions, both express and implied, may also be included below.
随着诸如智能手机、耳机、智能手表或手环等各种用户终端的普及,用户在例如戴着耳机或其他用户终端时常常难以听清楚外界环境中的声音。在一些场景中会给用户带来各种不便。例如,当用户在机场、火车站等候航班或车次、同时戴着耳机听音乐或观看视频时,用户可能不能清楚听见在这些场所播放的广播信息,导致错过航班或车次。With the popularity of various user terminals such as smart phones, earphones, smart watches or wristbands, it is often difficult for users to hear sounds in the external environment clearly when wearing earphones or other user terminals, for example. In some scenarios, various inconveniences may be caused to users. For example, when a user is waiting for a flight or a train at an airport or a train station while listening to music or watching a video while wearing earphones, the user may not be able to clearly hear the broadcast information played in these places, resulting in missing the flight or train.
如上所述,虽然已经有一些电子产品能够识别语音,但是大多数用于识别厂商提供的语音命令,针对用户的个性化关键词的语音识别较为困难。因此,无法监听广播中的航班号和车次。另外,一些个性化语音识别技术要求人工输入待识别的关键词,经机器掌握后才具备识别该关键词的能力,使用便利性上有所不足,而且需要较多计算资源。有鉴于此,本公开提供了语音代听技术,用户终端获取用于识别个性化关键词的语音识别模型,使用该语音识别模型来监听环境声音中的出行信息的关键词,语音识别模型在识别到目标关键词时提示用户。也就是说,由这种语音识别模型来代替用户监听环境声音,为用户提供关于出行信息的提示,实现了更好的智能化体验。As mentioned above, although some electronic products have been able to recognize speech, most of them are used to recognize speech commands provided by manufacturers, and it is relatively difficult to recognize personalized keywords for users. Therefore, it is impossible to monitor the flight number and train number in the broadcast. In addition, some personalized speech recognition technologies require manual input of keywords to be recognized, and the ability to recognize the keywords can only be recognized after the machine has mastered them, which is not convenient to use and requires more computing resources. In view of this, the present disclosure provides voice listening technology. The user terminal obtains a voice recognition model for identifying personalized keywords, and uses the voice recognition model to monitor keywords of travel information in environmental sounds. The voice recognition model recognizes Prompt the user when the target keyword is reached. That is to say, this speech recognition model replaces the user's monitoring of environmental sounds, provides users with prompts about travel information, and achieves a better intelligent experience.
示例环境和系统Example environments and systems
图1示出了根据本公开的多个实施例能够在其中实现的示例环境100的示意图。根据本公开的实施例的应用场景是,在高噪声、强混响的环境中的用户终端(例如,智能手机、智能家电、可穿戴设备、音频播放设备、平板电脑、笔记本电脑等)能够识别非人声广播中的个性化内容,例如航班号、车次等,帮助用户监听环境中的语音内容。例如,在用户佩戴降噪耳机收听音乐时,能够识别到外界广播中的用户感兴趣的关键词,并且向用户发出提醒,从而实现智能代听。FIG. 1 shows a schematic diagram of an example environment 100 in which various embodiments according to the present disclosure can be implemented. The application scenario according to the embodiments of the present disclosure is that a user terminal (for example, a smart phone, a smart home appliance, a wearable device, an audio playback device, a tablet computer, a notebook computer, etc.) in an environment with high noise and strong reverberation can recognize Personalized content in non-human voice broadcasts, such as flight numbers, train numbers, etc., helps users monitor voice content in the environment. For example, when a user wears noise-canceling headphones and listens to music, it can identify keywords that the user is interested in in the external broadcast, and send a reminder to the user, thereby realizing intelligent listening.
如图1所示,示例环境100包括位于用户侧的第一用户终端110和第二用户终端120以及位于云侧的服务器130。第一用户终端110和第二用户终端120作为整体可以与服务器130可以经由各种有线或无线通信技术彼此连接和通信,包括但不限于,以太网、蜂窝网络(4G、5G等)、无线局域网(例如,WiFi)、互联网、蓝牙、近场通信(NFC)、红外(IR)等。As shown in FIG. 1 , the example environment 100 includes a first user terminal 110 and a second user terminal 120 located on the user side and a server 130 located on the cloud side. The first user terminal 110 and the second user terminal 120 as a whole can connect and communicate with the server 130 via various wired or wireless communication technologies, including but not limited to Ethernet, cellular network (4G, 5G, etc.), wireless local area network (eg, WiFi), Internet, Bluetooth, Near Field Communication (NFC), Infrared (IR), etc.
根据本公开的实施例,服务器130可以是实现在云计算环境中的分布式或集中式的计算设备或计算设备集群。根据本公开的实施例,第一用户终端110和第二用户终端120可以包括智能手机、智能家电、可穿戴设备、音频播放设备、平板电脑、笔记本电脑等中任一个或多个设备,二者的类型可以相同或不同。According to an embodiment of the present disclosure, the server 130 may be a distributed or centralized computing device or a computing device cluster implemented in a cloud computing environment. According to an embodiment of the present disclosure, the first user terminal 110 and the second user terminal 120 may include any one or more of smart phones, smart home appliances, wearable devices, audio playback devices, tablet computers, notebook computers, etc., both The types can be the same or different.
在一些实施例中,第一用户终端110可以不直接连接到服务器130,并且第二用户终端120可以连接到服务器130。在这种情况下,第一用户终端110可以经由第二用户终端120连接到服务器130,并与服务器130通信。例如,第一用户终端110可以通过蓝牙、红外、NFC等短距离通信连接到第二用户终端120,与此同时,第二用户终端通过无线局域网、互联网、蜂窝网络与服务器130通信和传输数据。In some embodiments, the first user terminal 110 may not be directly connected to the server 130 and the second user terminal 120 may be connected to the server 130 . In this case, the first user terminal 110 may connect to the server 130 via the second user terminal 120 and communicate with the server 130 . For example, the first user terminal 110 can be connected to the second user terminal 120 through Bluetooth, infrared, NFC and other short-distance communication, and at the same time, the second user terminal communicates and transmits data with the server 130 through a wireless local area network, the Internet, or a cellular network.
在一些实施例中,第一用户终端110可以直接连接到服务器130。例如,第一用户终端110可以通过无线局域网、互联网、蜂窝网络与服务器130通信和传输数据。另外,当第一 用户终端110和第二用户终端120连接到相同的无线局域网时,第一用户终端110和第二用户终端120可以彼此通信和传输数据。In some embodiments, the first user terminal 110 may be directly connected to the server 130 . For example, the first user terminal 110 can communicate and transmit data with the server 130 through a wireless local area network, the Internet, or a cellular network. In addition, when the first user terminal 110 and the second user terminal 120 are connected to the same wireless local area network, the first user terminal 110 and the second user terminal 120 can communicate and transmit data with each other.
如图所示,第二用户终端120可以向云侧的服务器130传输目标关键词,例如出行信息的车次或航班号。并且第一用户终端110可以从服务器130接收针对该目标关键词的目标语音识别模型。服务器130可以根据接收到的目标关键词生成目标语音识别模型,例如解码图。解码图是一种轻量级的语音识别模型,便于部署在计算资源有限的用户终端。目标语音识别模型被发送到用户侧,用于部署在用户终端或更新用户终端的本地语音识别模型,从而实现用户侧的智能代听,即,监听环境声音中是否存在对应目标关键词的语音。虽然在图1示出了从第二用户终端120发送目标关键词到服务器130,并且由第一用户终端110接收目标语音识别模型,但是应理解,目标关键词可以从任一个用户终端发送给服务器130,并且目标语音识别模型可以被发送和部署在任一个用户终端。As shown in the figure, the second user terminal 120 may transmit target keywords to the server 130 on the cloud side, such as the train number or flight number of the travel information. And the first user terminal 110 may receive the target speech recognition model for the target keyword from the server 130 . The server 130 may generate a target speech recognition model, such as a decoding graph, according to the received target keywords. The decoding graph is a lightweight speech recognition model that is easy to deploy on user terminals with limited computing resources. The target speech recognition model is sent to the user side for deployment in the user terminal or to update the local speech recognition model of the user terminal, so as to realize intelligent listening on the user side, that is, to monitor whether there is a speech corresponding to the target keyword in the ambient sound. Although it is shown in Fig. 1 that the target keyword is sent from the second user terminal 120 to the server 130, and the target speech recognition model is received by the first user terminal 110, it should be understood that the target keyword can be sent to the server from any user terminal 130, and the target speech recognition model can be sent and deployed on any user terminal.
作为示例而非限制,第一用户终端110是降噪耳机,第二用户终端120是智能手机,第一用户终端110经由蓝牙连接到第二用户终端120。在这种情况下,第二用户终端120上可以安装有应用,例如与用户出行有关的应用、短消息服务应用、或者存储用户未来行程信息的任何其他应用。可以通过访问第二用户终端120的应用来获取用户想要被智能代听的个性化信息。根据本公开的实施例,第二用户终端120可以从其上安装的指定应用,例如,上述与用户出行有关的应用、短消息服务应用等,自动获取用户期望的个性化信息并且发送到服务器130,以用于生成针对该个性化信息的目标语音识别模型。By way of example and not limitation, the first user terminal 110 is a noise-canceling headset, the second user terminal 120 is a smartphone, and the first user terminal 110 is connected to the second user terminal 120 via Bluetooth. In this case, an application may be installed on the second user terminal 120, such as an application related to the user's travel, a short message service application, or any other application that stores the user's future travel information. The personalized information that the user wants to be intelligently listened to can be obtained by accessing the application of the second user terminal 120 . According to the embodiment of the present disclosure, the second user terminal 120 can automatically obtain the personalized information desired by the user from the specified application installed on it, for example, the above-mentioned application related to the user's travel, short message service application, etc., and send it to the server 130 , to be used to generate the target speech recognition model for the personalized information.
尽管图1中第一用户终端110和第二用户终端被示为单独的设备,但是它们也可以被实现为同一设备(如图中的虚线所示)。换句话说,可以使用单个用户终端来实现根据本公开的实施例的智能代听,该单个用户终端向服务器130发送个性化信息,并且从服务器130接收目标语音识别模型,以用于监听环境中的语音内容。Although the first user terminal 110 and the second user terminal are shown as separate devices in FIG. 1, they may also be implemented as the same device (as shown by the dotted line in the figure). In other words, the intelligent listening agent according to the embodiment of the present disclosure can be implemented using a single user terminal that sends personalized information to the server 130 and receives a target speech recognition model from the server 130 for use in a listening environment voice content.
图2示出了根据本公开的实施例的语音识别系统200的示意框图。语音识别系统200被用于生成、部署和使用针对个性化的目标关键词的目标语音识别模型,以检测环境声音中是否存在目标关键词。如图2所示,语音识别系统200包括用户侧的第一用户终端110和第二用户终端120、以及位于云侧的服务器130。作为示例而非限制,第一用户终端110可以是音频播放设备(例如,降噪耳机、智能音箱等)、可穿戴设备(例如,智能手表、手环等),其经由诸如蓝牙、近场通信、红外等方式连接到第二用户终端120。第二用户终端120可以是智能手机、智能家电、平板电脑、笔记本电脑等,其能够经由无线局域网、互联网、蜂窝网等有线或无线方式连接到服务器130。服务器130用于接收从第二用户终端120反馈的个性化的目标关键词,并生成针对该目标关键词的目标语音识别模型。以下描述第一用户终端110、第二用户终端120、服务器130的示例性功能模块。FIG. 2 shows a schematic block diagram of a speech recognition system 200 according to an embodiment of the present disclosure. The speech recognition system 200 is used to generate, deploy and use a target speech recognition model for personalized target keywords to detect whether the target keywords exist in ambient sounds. As shown in FIG. 2 , the speech recognition system 200 includes a first user terminal 110 and a second user terminal 120 on the user side, and a server 130 on the cloud side. As an example and not limitation, the first user terminal 110 may be an audio playback device (for example, noise-canceling headphones, smart speaker, etc.), a wearable device (for example, a smart watch, a wristband, etc.), , infrared, etc. to connect to the second user terminal 120. The second user terminal 120 may be a smart phone, a smart home appliance, a tablet computer, a notebook computer, etc., and it can be connected to the server 130 via a wired or wireless manner such as a wireless local area network, the Internet, or a cellular network. The server 130 is configured to receive the personalized target keyword fed back from the second user terminal 120, and generate a target speech recognition model for the target keyword. Exemplary functional modules of the first user terminal 110, the second user terminal 120, and the server 130 are described below.
第二用户终端120包括传输通信模块122、关键词获取模块124和存储模块126。传输通信模块122用于向第一用户终端110和服务器130发送和从它们接收数据。例如,通过蓝牙、近场通信、红外等方式与第一用户终端110通信,并且通过蜂窝网络、无线局域网等方式与服务器130通信。The second user terminal 120 includes a transmission communication module 122 , a keyword acquisition module 124 and a storage module 126 . The transmission communication module 122 is used to transmit and receive data to and from the first user terminal 110 and the server 130 . For example, it communicates with the first user terminal 110 through bluetooth, near-field communication, infrared, etc., and communicates with the server 130 through a cellular network, a wireless local area network, and the like.
关键词获取模块124用于获取关键词作为个性化信息。例如,可以从短信或出行应用读取用户出行信息,并从中提取目标关键词。关键词获取模块124被配置用于通过合规方案(例如,经用户授权的指定应用,例如,出行应用或短消息服务等)提取出行信息中的关键词, 例如航班号/车次等。例如,关键词获取模块122可以定期访问指定的应用来获取与未来的出行信息。出行信息通常可以包括出行人的名称、航班号或车次、时间信息、位置信息等。航班号或车次通常包括由数字字母构成的字符串,因此,可以将出行信息中的航班号或车次确定为将要用于语音识别的目标关键词。可以通过例如正则表达式等来确定目标关键词。此外,还可以从出行信息获得时间和位置信息等。The keyword acquisition module 124 is used to acquire keywords as personalized information. For example, user travel information can be read from text messages or travel applications, and target keywords can be extracted therefrom. The keyword acquisition module 124 is configured to extract keywords in the travel information, such as flight number/train number, etc., through compliance schemes (eg, designated applications authorized by users, such as travel applications or short message services, etc.). For example, the keyword acquisition module 122 may regularly access a designated application to acquire future travel information. The travel information usually includes the traveler's name, flight number or train number, time information, location information, and so on. The flight number or train number usually includes a character string composed of numbers and letters. Therefore, the flight number or train number in the travel information can be determined as a target keyword to be used for speech recognition. The target keyword can be determined by, for example, a regular expression. In addition, time and location information, etc. can also be obtained from travel information.
存储模块126可以用于存储第二用户终端120的设备标识、与第二用户终端120连接的第一用户终端110的连接信息(例如,第一用户终端110的标识信息、地址等),从服务器130接收到的目标语音识别模型、以及请求标识。请求标识可以用作向服务器请求目标语音识别模型的请求的唯一标识符。在服务器130广播发送目标语音识别模型的情况下,第二用户终端120可以根据该请求标识来确定目标语音识别模型是否是自己请求的,由此确定接收或不接收。The storage module 126 can be used to store the device identification of the second user terminal 120, the connection information of the first user terminal 110 connected to the second user terminal 120 (for example, the identification information, address, etc. of the first user terminal 110), 130 Received target speech recognition model and request identifier. The request identifier can be used as a unique identifier for a request to the server for the target speech recognition model. In the case that the server 130 broadcasts the target speech recognition model, the second user terminal 120 can determine whether the target speech recognition model is requested by itself according to the request identifier, thereby determining whether to receive or not.
第一用户终端110包括传输通信模块112、语音识别模型114以及提示模块116。传输通信模块112用于向第二用户终端120发送和从第二用户终端120接收数据。例如,通过蓝牙、近场通信、红外等方式与第一用户终端110通信。在第一用户终端具有与服务器130直接通信的能力的情况下,传输通信模块112还用于与服务器130通信,例如通过蜂窝网络或Wifi。The first user terminal 110 includes a transmission communication module 112 , a speech recognition model 114 and a prompt module 116 . The transmission communication module 112 is used for sending and receiving data to and from the second user terminal 120 . For example, communicate with the first user terminal 110 through bluetooth, near field communication, infrared and other means. In the case that the first user terminal has the capability of directly communicating with the server 130, the transmission communication module 112 is also used for communicating with the server 130, for example via a cellular network or Wifi.
语音识别模型114是基于一个或更多目标关键词生成的,并且可以根据从服务器130接收的针对新目标关键词的目标语音识别模型而更新。例如,语音识别模型114可以被配置为用于识别多个关键词,其在运行时监听环境声音中是否包括这些目标关键词。更新语音识别模型可以使得更新后的语音识别模型114能够监听环境声音是否包括新目标关键词,例如,在增加新目标关键词,或者用新目标关键词替换已有的目标关键词中的一个,例如,存在时间最长的目标关键词。当检测到目标关键词时,更新后的语音识别模型114可以触发提示模块116产生提示信息。提示模型116可以使第一用户终端110或第二用户终端发出声音或视觉上的提示。The speech recognition model 114 is generated based on one or more target keywords, and may be updated according to the target speech recognition model for new target keywords received from the server 130 . For example, the speech recognition model 114 may be configured to recognize a plurality of keywords by listening to whether the target keywords are included in the ambient sound at runtime. Updating the speech recognition model can enable the updated speech recognition model 114 to monitor whether the ambient sound includes a new target keyword, for example, when adding a new target keyword, or replacing one of the existing target keywords with a new target keyword, For example, the target keywords that have existed for the longest time. When the target keyword is detected, the updated speech recognition model 114 can trigger the prompt module 116 to generate prompt information. The prompt model 116 may cause the first user terminal 110 or the second user terminal to issue audible or visual prompts.
服务器130包括传输通信模块132、语音识别模型构建模块134、离线声学模型训练模块136、以及模型库138。在服务器130中,传输通信模块132被配置用于接收获取关键词获取模块122传输的目标关键词,然后转发给语音识别模型构建模块134。语音识别模型构建模块134被配置为根据接收到的目标关键词和模型库138来构建定制的目标语音识别模型,并将构建好的目标语音识别模型传输到第一用户终端110或第二用户终端120。The server 130 includes a transmission communication module 132 , a speech recognition model building module 134 , an offline acoustic model training module 136 , and a model library 138 . In the server 130 , the transmission communication module 132 is configured to receive the target keyword transmitted by the keyword acquisition module 122 , and then forward it to the speech recognition model construction module 134 . The speech recognition model construction module 134 is configured to construct a customized target speech recognition model according to the received target keywords and the model library 138, and transmit the constructed target speech recognition model to the first user terminal 110 or the second user terminal 120.
离线声学模型训练模块134被配置用于依据语音识别声学模型的训练准则,按照鲁棒性声学模型训练方法预先离线训练声学模型。经训练的声学模型可以被存储到模型库138。需要注意的是,训练声学模型的操作可以是离线执行的,因此与语音识别模型构成模块134构建过程是可解耦的。根据本公开的实施例,声学模型可以被设计为针对高噪声、强混响的环境而生成,例如基于融合特征,以实现更精确的语音识别。The offline acoustic model training module 134 is configured to pre-train the acoustic model offline according to the robust acoustic model training method according to the training criterion of the speech recognition acoustic model. The trained acoustic models may be stored to model repository 138 . It should be noted that the operation of training the acoustic model can be performed offline, so it can be decoupled from the construction process of the speech recognition model constituting module 134 . According to an embodiment of the present disclosure, the acoustic model can be designed to be generated for an environment with high noise and strong reverberation, for example, based on fusion features to achieve more accurate speech recognition.
模型库138被配置用于存储训练好的模型,包括按需离线训练的声学模型(通过上述离线声学模型训练模型124获取)、发音字典、语言模型等。这些模型均可以是离线训练的,并被离线声学模型训练模块134使用以构建针对目标关键词的目标语音识别模型。The model library 138 is configured to store trained models, including on-demand offline trained acoustic models (acquired through the above-mentioned offline acoustic model training model 124 ), pronunciation dictionaries, language models, and the like. These models can all be trained offline and used by the offline acoustic model training module 134 to construct a target speech recognition model for target keywords.
语音识别模型构建模块134可以被配置为结合模型库138中预先训练好的声学模型、发音字典、语言模型和传输通信模块132传输过来的目标关键词,依据关键词识别模型构建算法生成目标语音识别模型。需要注意的是,构建目标语音识别模型的过程与离线声学模型的 训练操作没有强依赖关系,可以异步执行。因此,语音识别模型构建模块134可以从模型库138获取预先训练好的声学模型来构建目标语音识别模型。The speech recognition model construction module 134 can be configured to combine the pre-trained acoustic model in the model library 138, the pronunciation dictionary, the language model and the target keywords transmitted by the transmission communication module 132, and generate the target speech recognition model according to the keyword recognition model construction algorithm. Model. It should be noted that the process of building the target speech recognition model has no strong dependence on the training operation of the offline acoustic model and can be executed asynchronously. Therefore, the speech recognition model construction module 134 can obtain a pre-trained acoustic model from the model library 138 to construct the target speech recognition model.
尽管在图2中第一用户终端110和第二用户终端120被示出单独的设备,但是它们也可以被实现为同一设备(如图中的虚线所示)。(如图中的虚线所示)中以实现根据本公开的实施例的智能代听方案。在这种情况下,从单个用户终端获取目标关键词,并且在同一用户终端上部署针对目标关键词的语音识别模型。Although the first user terminal 110 and the second user terminal 120 are shown as separate devices in FIG. 2, they may also be implemented as the same device (as shown by the dotted line in the figure). (as shown by the dotted line in the figure) to implement the intelligent listening substitution solution according to the embodiment of the present disclosure. In this case, target keywords are acquired from a single user terminal, and a speech recognition model for the target keywords is deployed on the same user terminal.
智能语音代听Intelligent voice listening
图3示出了根据本公开的实施例的用于语音代听的方法300的示意流程图。方法300可以用于在如图1和图2所示的用户终端110上实施。用户终端110可以是诸如智能手机、智能家电、可穿戴设备、音频播放设备、平板电脑、笔记本电脑等,其具有能够接收声音的传感器,例如麦克风。Fig. 3 shows a schematic flowchart of a method 300 for voice listening substitution according to an embodiment of the present disclosure. The method 300 may be implemented on the user terminal 110 as shown in FIG. 1 and FIG. 2 . The user terminal 110 may be, for example, a smart phone, a smart home appliance, a wearable device, an audio playback device, a tablet computer, a notebook computer, etc., which have a sensor capable of receiving sound, such as a microphone.
在框310,用户终端110获取目标关键词对应的目标语音识别模型,目标语音识别模型由服务器130为根据目标关键词构建,目标关键词为根据用户的出行信息获取。根据本公开的实施例,如上所述,用户终端110可以经由诸如蓝牙的无线连接从与连接的用户终端120(如智能手机)接收目标语音识别模型。备选地,在用户终端110具有与服务器130的直连通信能力的情况下,用户终端110可以直接从服务器130接收目标语音识别模型。In block 310, the user terminal 110 obtains the target speech recognition model corresponding to the target keyword, the target speech recognition model is constructed by the server 130 according to the target keyword, and the target keyword is obtained according to the travel information of the user. According to an embodiment of the present disclosure, as described above, the user terminal 110 may receive the target speech recognition model from the connected user terminal 120 (such as a smart phone) via a wireless connection such as Bluetooth. Alternatively, in the case that the user terminal 110 has a direct communication capability with the server 130 , the user terminal 110 may directly receive the target speech recognition model from the server 130 .
如上所述,用户的出行信息表示用户将到机场、车站等交通场所乘坐飞机或列车出行、或者到机场或车站接送其他人。出行信息通常包括航班号或车次、交通场所的位置、发出时间或到达时间等信息。出行信息中的目标关键词可以是表示航班号或列车号的字符串,通常由字母和数字组成。例如,出行信息可能包括如下信息:“2021年6月2日上午7点45分,G109,北京南至上海虹桥”,对应地,目标关键词是“G109”、位置是“北京南站”、时间是“2021年6月2日上午7点45分”。As mentioned above, the user's travel information indicates that the user will go to airports, stations and other transportation places to travel by plane or train, or to pick up and drop off other people at airports or stations. Travel information usually includes information such as flight number or train number, location of traffic place, departure time or arrival time. The target keyword in the travel information can be a character string representing flight number or train number, usually composed of letters and numbers. For example, the travel information may include the following information: "June 2, 2021 at 7:45 am, G109, Beijing South to Shanghai Hongqiao", correspondingly, the target keyword is "G109", the location is "Beijing South Railway Station", The time is "June 2, 2021 at 7:45 AM".
目标语音识别模型由服务器130基于接收到的目标关键词来构建。在一些实施例中,目标关键词可以从与用户终端110连接的另一用户终端120获取,并且被发送到服务器130。例如,用户终端110(例如,降噪耳机)通过蓝牙或其他短距离通信方式连接到另一用户终端120(例如智能手机)。通过访问用户终端120的出行应用、短信或其他经授权的应用,获取到用户的出行信息。用户终端可以将出行信息中的目标关键词,例如航班号或车次,发送到服务器130。由此,服务器130可以基于收到的目标关键词来构建用于识别该目标关键词的目标语音识别模型并将构建的目标语音识别模型传输给用户终端110,下文将参照图4描述。The target speech recognition model is constructed by the server 130 based on the received target keywords. In some embodiments, the target keyword may be obtained from another user terminal 120 connected to the user terminal 110 and sent to the server 130 . For example, a user terminal 110 (eg, a noise-canceling headset) is connected to another user terminal 120 (eg, a smart phone) via Bluetooth or other short-range communication methods. The travel information of the user is obtained by accessing the travel application, SMS or other authorized applications of the user terminal 120 . The user terminal may send target keywords in the travel information, such as flight number or train number, to the server 130 . Thus, the server 130 may construct a target speech recognition model for recognizing the target keyword based on the received target keyword and transmit the constructed target speech recognition model to the user terminal 110 , which will be described below with reference to FIG. 4 .
在用户终端110从服务器130接收到目标语音识别模型之后,在框320,根据目标语音识别模型对本地语音识别模型进行更新,获得更新后的语音识别模型,所述本地语音识别模型为所述用户终端中存储的语音识别模型。在更新之前,用户终端110本地的语音识别模型114可以识别一个或多个关键词,需要在更新之后才能识别目标关键词。在一些实施例中,本地语音识别模型和目标语音识别模式可以是例如解码图。解码图是由待识别关键词确定的语法约束规则的解码路径集合,下文将“语音识别模型的生成和部署”的小节中描述解码图的细节,这暂不详述。将目标语音识别模型的针对目标关键词的解码路径添加到本地语音识别模型,使得本地语音识别模型被更新,从而能够识别目标关键词。备选地,考虑到模型大 小约束,可以用目标语音识别模型的针对目标关键词的解码路径替换本地语音识别模型中的已有解码路径,例如,替换到在本地语音识别模型中存在时间最长的关键词的解码路径。After the user terminal 110 receives the target speech recognition model from the server 130, at block 320, the local speech recognition model is updated according to the target speech recognition model to obtain an updated speech recognition model, and the local speech recognition model is the user The speech recognition model stored in the terminal. Before the update, the local speech recognition model 114 of the user terminal 110 can recognize one or more keywords, and the target keyword can only be recognized after the update. In some embodiments, the local speech recognition model and the target speech recognition model may be, for example, decoding maps. The decoding graph is a collection of decoding paths of grammatical constraint rules determined by keywords to be recognized. The details of the decoding graph will be described in the section "Generation and Deployment of Speech Recognition Model" below, which will not be described in detail for now. The decoding path for the target keyword of the target speech recognition model is added to the local speech recognition model, so that the local speech recognition model is updated so that the target keyword can be recognized. Alternatively, considering the model size constraints, the existing decoding path in the local speech recognition model can be replaced with the decoding path of the target speech recognition model for the target keyword, for example, the longest existing path in the local speech recognition model The decoding path of the keywords.
应理解,如果用户终端110本地没有语音识别模型,则可以直接将目标语音识别模型部署为本地语音识别模型。在这种情况下,本地语音识别模型专用识别对应的目标关键词,并且以后可以被更新。It should be understood that if the user terminal 110 does not have a local speech recognition model, the target speech recognition model may be directly deployed as a local speech recognition model. In this case, the local speech recognition model is dedicated to recognize the corresponding target keywords and can be updated later.
在框330,判断用户终端110是否满足目标条件。如果满足目标条件,则在框330,根据更新后的语音识别模型对采集到的环境声音进行识别,获得识别结果。也就是说,在适当的条件下,才触发更新后的语音识别模型来对外部环境中的广播音进行监听。由于语音识别模型114可能较早存在于用户终端110,此时不需要立即开始监听环境声音。在满足一定目标条件的情况下允许触发本地语音识别模型的执行,这符合用户的真实代听需求,也能够节省用户终端的计算资源和电量。At block 330, it is determined whether the user terminal 110 satisfies the target condition. If the target condition is met, then at block 330, the collected environmental sound is recognized according to the updated speech recognition model, and a recognition result is obtained. That is to say, the updated speech recognition model is triggered to monitor broadcast sounds in the external environment only under proper conditions. Since the speech recognition model 114 may exist in the user terminal 110 earlier, it is not necessary to start listening to the ambient sound immediately at this time. It is allowed to trigger the execution of the local speech recognition model when certain target conditions are met, which meets the user's real listening needs and can also save the computing resources and power of the user terminal.
在一些实施例中,目标条件可以是用户位置与出行信息的位置信息匹配。如上所述,除了目标关键词,出行信息通常还包括位置信息。例如,出行信息可能包括如下信息:“2021年6月2日上午7点45分,G109,北京南至上海虹桥”,则“北京南站”将作为位置信息。当用户的位置与“北京南站”匹配时,例如,根据用户终端的GPS信息或其他定位信息确定用户在北京南站内或附近时,启用更新后的语音识别模型来对采集到的环境声音进行识别。这样,在满足地理位置条件时,可以自动使用更新后的语音识别模型来识别环境声音的关键词,而不需要用户交互,带来更好的使用体验。In some embodiments, the target condition may be that the user's location matches the location information of the travel information. As mentioned above, travel information usually includes location information in addition to target keywords. For example, the travel information may include the following information: "June 2, 2021, 7:45 am, G109, Beijing South to Shanghai Hongqiao", then "Beijing South Railway Station" will be used as the location information. When the user's location matches "Beijing South Railway Station", for example, when it is determined that the user is in or near Beijing South Railway Station according to the GPS information or other positioning information of the user terminal, the updated speech recognition model is activated to analyze the collected environmental sounds identify. In this way, when the geographical location condition is satisfied, the updated speech recognition model can be automatically used to recognize the keywords of the ambient sound without user interaction, bringing a better user experience.
在一些实施例中,目标条件还可以是当前时间在所述时间信息之前的预定时间段内时,根据更新后的语音识别模型对采集到的环境声音进行识别。仍以上述示例的出行信息为例,时间信息是“2021年6月2日上午7点45分”。例如,当当前时间在“2021年6月2日上午7点45分”之前的半小时、1个小时、或其他时间段内时,用更新后的语音识别模型来对采集到的环境声音进行识别。通常,在这些时间段内机场或车站内会广播忙用户期望监听的目标关键词。这样,在满足时间条件时,可以自动使用更新后的语音识别模型来识别环境声音的关键词,而不需要用户交互,带来更好的使用体验。In some embodiments, the target condition may also be that when the current time is within a predetermined time period before the time information, the collected environmental sound is recognized according to the updated speech recognition model. Still taking the travel information in the above example as an example, the time information is "7:45 AM on June 2, 2021". For example, when the current time is half an hour, 1 hour, or other time periods before "7:45 am on June 2, 2021", use the updated speech recognition model to process the collected environmental sounds identify. Usually, during these time periods, airports or stations will broadcast target keywords that busy users expect to monitor. In this way, when the time condition is met, the updated speech recognition model can be automatically used to recognize the keywords of the ambient sound without user interaction, bringing a better user experience.
用户的位置信息和时间信息可以由用户终端110自身提供,也可以从其他设备处获取,例如从与用户终端110连接的另一个用户终端120处获取。另外,可以由用户终端110自身或其他终端,例如用户终端120来触发语音识别模型的执行(例如,通过蓝牙连接发送触发信号)。在一些实施例中,上述触发语音识别模型的目标条件可以单独或组合地使用。The location information and time information of the user may be provided by the user terminal 110 itself, or may be obtained from other devices, such as another user terminal 120 connected to the user terminal 110 . In addition, the execution of the speech recognition model may be triggered by the user terminal 110 itself or other terminals, such as the user terminal 120 (for example, a trigger signal is sent through a Bluetooth connection). In some embodiments, the above target conditions for triggering the speech recognition model may be used alone or in combination.
备选地,还可以由用户手动地触发语音识别模型的执行,例如通过手动按钮来触发。特别地,当触发监听的方式为手动方式时,按钮可以设置在作为智能代听设备的用户终端110上、也可以设置在另一用户终端120上、或者设置为用户终端110和120的应用中。Alternatively, the execution of the speech recognition model may also be manually triggered by the user, for example through a manual button. In particular, when the method of triggering monitoring is manual, the button can be set on the user terminal 110 as an intelligent listening device, or on another user terminal 120, or in the application of the user terminals 110 and 120 .
在一些实施例中,用户终端110的语音识别模型114能够识别多个关键词。在这种情况下可以由用户选择其中的一部分或全部来进行识别,或者自动选择识别最新更新的目标关键词。In some embodiments, the speech recognition model 114 of the user terminal 110 is capable of recognizing multiple keywords. In this case, the user may select some or all of them for identification, or automatically select and identify the latest updated target keywords.
在框340,根据更新后的语音识别模型对采集到的环境声音进行识别,获得识别结果。为了对采集到的环境声音进行识别,首先打开用户终端110的麦克风开始采集外部环境声音。然后,通过语音识别模型在用户终端110的本地对采集到的环境声音进行识别。可以实时或者接近实时地识别。采集到的环境声音可以直接被输入到语音识别模型,由语音识别模型判 断是否是目标关键词的文字,例如通过解码图的解码路径。采集到的环境声音也可以在用户终端110被缓存然后被读取到语音识别模型,缓存的声音可以持续例如大约10秒、20秒、30秒或更多。随着时间推移,如果没有识别到目标关键词,则缓存环境声音可以被逐步地移除或覆盖。In block 340, the collected environmental sound is recognized according to the updated speech recognition model, and a recognition result is obtained. In order to recognize the collected ambient sound, firstly, the microphone of the user terminal 110 is turned on to collect the external ambient sound. Then, the collected ambient sound is recognized locally in the user terminal 110 through the speech recognition model. The identification can be done in real time or near real time. The collected environmental sound can be directly input to the speech recognition model, and the speech recognition model can judge whether it is the text of the target keyword, for example, through the decoding path of the decoding map. The collected ambient sound can also be buffered at the user terminal 110 and then read into the speech recognition model, and the buffered sound can last for about 10 seconds, 20 seconds, 30 seconds or more. Over time, if no target keyword is identified, the cached ambient sound may be gradually removed or overwritten.
识别结果初始值可以置为“否”。根据本公开的实施例,环境声音可以按照时间顺序被逐帧地输入到语音识别模型。语音识别模型确定这些语音帧是否对应于目标关键词,如果完全匹配,则确定识别到目标关键词,否则确定没有识别到目标关键词,从重新开始监听。例如,在目标关键词是“G109”的情况下,如果环境声音中的语音包括“G107”,则将依次识别出“G”、“1”、“0”、“7”。作为示例,在识别出“7”之前,语音识别模型依次确定环境声音和目标关键词的前部分是匹配的(因为“G”、“1”、“0”与目标关键词一致)。但是,一旦识别出与目标关键词中的“9”不一致的“7”,语音识别模型立即重新开始监听,并清除已经识别出内容“G”、“1”、“0”。在一些实施例中,一旦识别到与关键词不匹配的语音,就可以将相关联的缓存数据删除,并重新开始监测。实际上,只要环境声音中的语音的第一个字不是目标关键词的第一个字,就将重新开始监测。根据本公开的实施例,当检测到完整的目标关键词时,识别结果可以被置为“是”。The initial value of the recognition result can be set to "No". According to an embodiment of the present disclosure, ambient sounds may be input to the speech recognition model frame by frame in time order. The speech recognition model determines whether these speech frames correspond to the target keyword. If they match completely, it is determined that the target keyword is recognized, otherwise it is determined that the target keyword is not recognized, and the monitoring is restarted. For example, in the case that the target keyword is "G109", if the speech in the ambient sound includes "G107", then "G", "1", "0", "7" will be recognized in sequence. As an example, before recognizing "7", the speech recognition model sequentially determines that the ambient sound matches the first part of the target keyword (because "G", "1", "0" are consistent with the target keyword). However, once the "7" that is inconsistent with the "9" in the target keyword is recognized, the speech recognition model immediately restarts listening and clears the recognized content "G", "1", and "0". In some embodiments, once a speech that does not match a keyword is recognized, the associated cached data may be deleted and monitoring may be restarted. In fact, as long as the first word of the speech in the ambient sound is not the first word of the target keyword, monitoring will be restarted. According to an embodiment of the present disclosure, when the complete target keyword is detected, the recognition result may be set as "Yes".
在框350,确定是否存在目标关键词的语音。如果识别结果为“否”,则确定不存在目标关键词的语音,继续监听环境声音。如果识别结果为“是”,前进到框360。At block 350, it is determined whether there is a speech of the target keyword. If the recognition result is "no", it is determined that there is no speech of the target keyword, and the ambient sound is continued to be monitored. If the recognition result is "Yes", proceed to block 360 .
在框360,用户设备110对用户进行提示。提示的形式可以取决于该用户终端的能力和用户配置。在一些实施例中,提示可以包括但不限于文本、图像、音频和视频中的一种或多种形式。例如,当用户终端110为具有扬声器的设备时,响应于检测到环境声音包括目标关键词,提示可以是播放指定提醒音、特定录音、或播放与目标关键词对应的语音等。当用户终端为具有屏幕的设备时,提示可以是卡片弹窗、横幅显示等。用户终端110具有扬声器和屏幕二者时,通知可以是上述任一种或某几种的组合。通过各种类型的提醒方式,实现了用户终端上的智能代听。At block 360, the user device 110 prompts the user. The form of the prompt may depend on the capabilities of the user terminal and user configuration. In some embodiments, prompts may include, but are not limited to, one or more of text, images, audio and video. For example, when the user terminal 110 is a device with a speaker, in response to detecting that the ambient sound includes the target keyword, the prompt may be playing a specified reminder sound, a specific recording, or playing a voice corresponding to the target keyword. When the user terminal is a device with a screen, the prompt may be a card pop-up window, a banner display, and the like. When the user terminal 110 has both a speaker and a screen, the notification may be any one or a combination of some of the above. Through various types of reminding methods, the intelligent proxy listening on the user terminal is realized.
在一些实施例中,用户终端110还可以向连接的其他用户终端120提供该提示。例如,经由用户终端110和用户终端120之间的蓝牙通信协议提供提示。以此方式,可以在部署语音识别模型的用户终端或在其他设备上呈现通知,以达到更好的通知效果。In some embodiments, the user terminal 110 may also provide the prompt to other connected user terminals 120 . For example, the reminder is provided via the Bluetooth communication protocol between the user terminal 110 and the user terminal 120 . In this way, notifications can be presented on user terminals deployed with speech recognition models or on other devices to achieve better notification effects.
上文描述了用户终端110作为智能代听设备,但是应理解,智能代听功能还可以实现在其他用户终端(例如,用户终端120)。在这种情况下,用户终端120向服务器130发送目标关键词,并且从服务器130接收语音识别模型,并且使用语音识别模型来于监听环境中的语音内容,而不需要将语音识别模型转发给用户终端110。The above describes the user terminal 110 as an intelligent listening device, but it should be understood that the intelligent listening function can also be implemented in other user terminals (for example, the user terminal 120). In this case, the user terminal 120 sends the target keyword to the server 130, and receives the speech recognition model from the server 130, and uses the speech recognition model to listen to the speech content in the environment without forwarding the speech recognition model to the user Terminal 110.
通过以上描述的实施例,能够在公共交通场所的环境声音中检测出行信息的目标关键词,并提醒用户,从而实现设备代替人耳的智能代听功能。Through the embodiments described above, it is possible to detect the target keyword of the travel information in the ambient sound of the public transportation place, and remind the user, thereby realizing the intelligent listening function of the device replacing the human ear.
语音识别模型的生成和部署Speech Recognition Model Generation and Deployment
如上所述,根据本公开的实施例的语音识别模型是一种部署在计算资源有限的用户终端的轻量级模型。而且,这种语音识别模型是由用户定制的、并且针对特定目标关键词的模型。以下参照图4进一步描述根据本公开的实施例的语音识别模型的生成和部署的过程。As described above, the speech recognition model according to the embodiments of the present disclosure is a lightweight model deployed on user terminals with limited computing resources. Moreover, this speech recognition model is customized by the user and is aimed at a specific target keyword. The process of generating and deploying a speech recognition model according to an embodiment of the present disclosure is further described below with reference to FIG. 4 .
根据本公开的实施例,由服务器130构建用于识别目标关键词的语音识别模型,并将其 部署在诸如智能手机、智能家电、可穿戴设备、音频播放设备、平板电脑、笔记本电脑等的用户终端110和120中的任一个。用户终端110和120可以利用该语音识别模型来识别周围环境、尤其是高噪声环境中是否播放了包含关键词的语音。According to an embodiment of the present disclosure, the speech recognition model for identifying target keywords is constructed by the server 130, and deployed on users such as smart phones, smart home appliances, wearable devices, audio playback devices, tablet computers, notebook computers, etc. Either of terminals 110 and 120. The user terminals 110 and 120 can use the speech recognition model to recognize whether a speech containing keywords is played in the surrounding environment, especially in a high-noise environment.
图4示出了根据本公开的实施例的构建和部署语音识别模型的示例过程400的示意图。图4示出了在如图1和图2所示的第一用户终端110上的部署语音识别模型的实例,其中第一用户终端110经由诸如蓝牙的短距离通信连接到第二用户终端120。应当理解,可以在第二用户终端120上部署语音识别模型,或者在只有一个用户终端的情况下在该单个终端上部署语音识别模型,而不脱离本公开的实施例的范围。FIG. 4 shows a schematic diagram of an example process 400 of building and deploying a speech recognition model according to an embodiment of the disclosure. FIG. 4 shows an example of deploying a speech recognition model on a first user terminal 110 as shown in FIGS. 1 and 2 , wherein the first user terminal 110 is connected to a second user terminal 120 via short-range communication such as Bluetooth. It should be understood that the speech recognition model may be deployed on the second user terminal 120, or on the single terminal if there is only one user terminal, without departing from the scope of the embodiments of the present disclosure.
第一用户终端110可以在创建与第二用户终端的连接时,向第二用户终端120发送自身的标识信息。第二用户终端120可以将该标识信息存储在本地,以便后续向第一用户终端110传输数据,例如目标语音识别模型或其他信息。The first user terminal 110 may send its own identification information to the second user terminal 120 when establishing a connection with the second user terminal. The second user terminal 120 may store the identification information locally so as to transmit data to the first user terminal 110 later, such as a target speech recognition model or other information.
如图所示,第二用户终端120可以获取410用户想要识别的目标关键词。目标关键词文本可以是用户的出行信息中的关键词,例如用户将要乘坐的航班号或列车号等。例如,出行信息可能包括如下信息:“2021年6月2日上午7点45分,G109,北京南至上海虹桥”,对应地,目标关键词是“G109”。在一些实施例中,可以通过合规方案(例如,经用户授权的指定应用,例如,出行应用或短消息服务等)提取出行信息中的关键词,也可以访问短信息中来自指定发送方(例如,航空公司或列车运营方)的消息来获取目标关键词。As shown in the figure, the second user terminal 120 may obtain 410 a target keyword that the user wants to identify. The target keyword text may be a keyword in the user's travel information, such as the flight number or train number that the user will take. For example, the travel information may include the following information: "June 2, 2021 at 7:45 am, G109, from Beijing south to Shanghai Hongqiao", correspondingly, the target keyword is "G109". In some embodiments, the keywords in the travel information can be extracted through a compliance scheme (for example, a specified application authorized by the user, such as a travel application or a short message service, etc.), and it is also possible to access the short message from a specified sender ( For example, airlines or train operators) to obtain target keywords.
根据本公开的实施例,目标关键词可以是自动获取的,而不需要用户手动输入。例如,如果第二用户终端120是智能手机,则经授权后,可以通过访问智能手机的短信或指定应用的消息,从指定的发送方(例如,交通工具运营方)的短信或消息中提取目标关键词。应理解,包括航班号或列车号的短信或消息还可以包括出发时间信息。在一些实施例中,还可以根据这种时间信息来获取关键词文本。例如,可以获取最近的航班号或列车号来作为目标关键词。备选地,还可以从当前时刻起的预设时间段内(例如,一天)的航班号或列车号来作为关键词文本。According to an embodiment of the present disclosure, the target keyword may be obtained automatically without manual input by the user. For example, if the second user terminal 120 is a smart phone, after authorization, the target can be extracted from the short message or message of the specified sender (for example, the vehicle operator) by accessing the short message of the smart phone or the message of the specified application. Key words. It should be understood that the short message or message including flight number or train number may also include departure time information. In some embodiments, the keyword text can also be obtained according to such time information. For example, the nearest flight number or train number can be obtained as the target keyword. Alternatively, flight numbers or train numbers within a preset time period (for example, one day) from the current moment can also be used as the keyword text.
然后,第二用户终端120可以向服务器130请求420针对目标关键词的语音识别模型。第二用户终端120可以通过蜂窝网络或无线局域网(例如WiFi)等向服务器130发送包括目标关键词的请求。Then, the second user terminal 120 may request 420 the speech recognition model for the target keyword from the server 130 . The second user terminal 120 may send a request including the target keyword to the server 130 through a cellular network or a wireless local area network (such as WiFi).
在一些实施例中,请求还可以包括第二用户终端120的标识符(例如,IMSI、IMEI或其他唯一标识符)以及第二用户终端的当前连接信息,包括但不限于蓝牙连接信息(例如蓝牙地址、设备标识等)、无线局域网连接信息(例如,无线接入点地址、设备标识等)等。这些信息可以用于建立服务器130与第二用户终端120或第一用户终端110之间的点对点连接。In some embodiments, the request may also include an identifier of the second user terminal 120 (e.g., IMSI, IMEI, or other unique identifier) and current connection information of the second user terminal, including but not limited to Bluetooth connection information (e.g., Bluetooth address, device identification, etc.), wireless local area network connection information (for example, wireless access point address, device identification, etc.), etc. Such information may be used to establish a point-to-point connection between the server 130 and the second user terminal 120 or the first user terminal 110 .
备选地,请求还包括可以唯一标识该请求的请求标识。请求标识可以由第二用户终端使用任何合适的方式来生成,例如,可以根据第二设备的设备标识(例如,IMSI、IMEI等)或其他唯一性标识、与第二用户终端120连接的第一用户终端110的连接信息、时间戳等中一项或多项来生成该请求标识。请求标识可以用于服务器130以广播方式传输构建的语音识别模型。为此,第二用户终端120可以在本地创建和维护一个映射表。映射表中包括相关联存储的第二用户终端120的设备标识、与第二用户终端120连接的第一用户终端110的连接信息、以及所生成的请求标识。Alternatively, the request further includes a request identifier that can uniquely identify the request. The request identifier may be generated by the second user terminal in any suitable manner, for example, according to the device identifier (for example, IMSI, IMEI, etc.) of the second device or other unique identifiers, the first device connected to the second user terminal 120 One or more of the connection information of the user terminal 110, the time stamp, etc. to generate the request identifier. The request identifier can be used for the server 130 to broadcast the constructed speech recognition model. To this end, the second user terminal 120 may locally create and maintain a mapping table. The mapping table includes the associated stored device identifier of the second user terminal 120, the connection information of the first user terminal 110 connected to the second user terminal 120, and the generated request identifier.
服务器130接收第二用户终端120的请求,并且基于请求中的目标关键词来构建430针 对该目标关键词的语音识别模型。根据本公开的实施例,所构建的语音识别模型是轻量级解码图,解码图是目标关键词确定的语法约束规则的解码路径集合。服务器130例如基于HCLG(HMM+Context+Lexicon+Grammar)解码图构建过程来生成解码图。The server 130 receives the request of the second user terminal 120, and builds 430 a speech recognition model for the target keyword based on the target keyword in the request. According to the embodiments of the present disclosure, the constructed speech recognition model is a lightweight decoding graph, and the decoding graph is a set of decoding paths determined by grammatical constraint rules determined by target keywords. The server 130 generates a decoding graph based on, for example, an HMLG (HMM+Context+Lexicon+Grammar) decoding graph construction process.
在一些实施例中,服务器130基于语法规则和文法规则(例如,JSpeech Grammar Format,简称为“JSGF”文法规则)、n-gram统计规则等构建针对该关键词的特定轻量级语言模型,即目标语言模型(G.fst)。区别于传统语言模型构建,依赖大规模海量数据的训练文本,让机器尽可能充分地学习所有满足自然语言逻辑的字、词、句子、段落间的关系,从而让语言模型近乎全覆盖的包含所有学习单元(字、词、句子、段落)间的转移概率和连接权重,服务器130仅根据目标关键词来约束目标关键词的字与字之间的转移概率和连接权重,而忽略其他学习单元间的关系和连接,进而将目标语言模型定制为只符合该目标关键词文法约束规范的参数集合,以保证对该目标关键词有识别能力。例如,将目标关键词的字组合确定为具有更高的出现概率,而将其他非目标关键词的组合出现概率置为0。In some embodiments, the server 130 builds a specific lightweight language model for the keyword based on grammatical rules and grammatical rules (for example, JSpeech Grammar Format, referred to as "JSGF" grammatical rules), n-gram statistical rules, etc., namely Target language model (G.fst). Different from traditional language model construction, it relies on large-scale massive data training texts, allowing machines to learn as much as possible the relationship between words, words, sentences, and paragraphs that satisfy natural language logic, so that the language model can cover almost all The transition probability and connection weight between learning units (characters, words, sentences, paragraphs), the server 130 only constrains the transition probability and connection weight between words of the target keyword according to the target keyword, and ignores the transition probability and connection weight between other learning units. relationship and connection, and then customize the target language model as a parameter set that only conforms to the grammatical constraints of the target keyword, so as to ensure the ability to recognize the target keyword. For example, the word combination of the target keyword is determined to have a higher occurrence probability, and the combination occurrence probability of other non-target keywords is set to 0.
然后,根据该目标关键词从存储于模型库138的发音词典里选择出特定发音序列,结合发音词典里的音素描述文件构建目标发音词典模型(L.fst),由于该发音序列是根据目标关键词检索得到,相比于原始发音词典,规模也大大减小。另外,服务器130还通过离线训练得到声学模型,例如HMM模型(H.fst)。Then, select a specific pronunciation sequence from the pronunciation dictionary stored in the model storehouse 138 according to the target keyword, and construct the target pronunciation dictionary model (L.fst) in conjunction with the phoneme description file in the pronunciation dictionary, because the pronunciation sequence is based on the target keyword Compared with the original pronunciation dictionary, the scale of the retrieved words is also greatly reduced. In addition, the server 130 also obtains an acoustic model, such as an HMM model (H.fst), through offline training.
服务器130对目标语言模型、目标发音字典模型和声学模型进行模型合并,以获得语音识别模型。该语音识别模型使用了原始声学模型,根据目标关键词构建的轻量级目标语言模型以及由目标关键词检索到的轻量级发音字典模型,故构建所得语音识别模型具有轻量化结构,相比广义语音识别模型,该模型仅包含针对目标关键词的转移概率和连接权重,参数规模得到了极大的缩减。该语音识别模型可以为如上所述的解码图。具体地,服务器130合并上述构建的目标语言模型(G.fst)和发音字典模型(L.fst),生成合并后的发音词典和语言模型(LG.fst),接着合并由发音词典模型生成的上下文模型(C.fst)以生成CLG.fst,最后合并上述构建的HMM模型(H.fst)即可生成解码图模型(HCLG.fst),作为针对目标关键词的语音识别模型。The server 130 combines the target language model, the target pronunciation dictionary model and the acoustic model to obtain a speech recognition model. The speech recognition model uses the original acoustic model, a lightweight target language model constructed according to the target keywords, and a lightweight pronunciation dictionary model retrieved from the target keywords, so the constructed speech recognition model has a lightweight structure. Generalized speech recognition model, which only includes the transition probability and connection weight for the target keyword, and the parameter scale has been greatly reduced. The speech recognition model may be a decoding map as described above. Specifically, the server 130 merges the target language model (G.fst) and the pronunciation dictionary model (L.fst) constructed above to generate the merged pronunciation dictionary and language model (LG.fst), and then merges the generated pronunciation dictionary model. The context model (C.fst) is used to generate CLG.fst, and finally the HMM model (H.fst) constructed above is combined to generate a decoding graph model (HCLG.fst) as a speech recognition model for target keywords.
本公开的实施例提供了声学模型,其适用于高噪声、强混响的环境下对非人声的远场广播语音进行识别,能够显著提高语音识别的准确度。该声学模型下文将参照图5至图9描述。在一些实施例中,声学模型可以采用离线训练或在线的训练方式。此外,本公开不旨在对于发音字典、目标语言模型的类型或训练过程进行限定。Embodiments of the present disclosure provide an acoustic model, which is suitable for recognizing non-human voice far-field broadcast speech in an environment with high noise and strong reverberation, and can significantly improve the accuracy of speech recognition. The acoustic model will be described below with reference to FIGS. 5 to 9 . In some embodiments, the acoustic model can be trained offline or online. In addition, the present disclosure is not intended to limit pronunciation dictionaries, types of target language models, or training procedures.
然后,服务器130将构建好的目标语音识别模型传输440给第二用户终端120。Then, the server 130 transmits 440 the constructed target speech recognition model to the second user terminal 120 .
如上所述,服务器130可以通过点对点方式传输目标语音识别模型。在一些实施例中,服务器130根据请求420中包括的第二用户终端120的标识符,使用蜂窝或WiFi通信协议建立与第二用户终端之间的点对点连接,并且将目标语音识别模型传输440给第二用户终端。As mentioned above, the server 130 may transmit the target speech recognition model in a peer-to-peer manner. In some embodiments, the server 130 establishes a point-to-point connection with the second user terminal using a cellular or WiFi communication protocol according to the identifier of the second user terminal 120 included in the request 420, and transmits 440 the target speech recognition model to the a second user terminal.
接下来,第二用户终端120根据本地的连接信息来确定450将用于部署语音识别模型的第一用户终端110。然后,第二用户终端120通过与第一用户终端110之间的连接将语音识别模型传输460至第一用户终端110。Next, the second user terminal 120 determines 450 the first user terminal 110 to deploy the speech recognition model according to the local connection information. Then, the second user terminal 120 transmits 460 the speech recognition model to the first user terminal 110 through the connection with the first user terminal 110 .
另外,服务器还可以通过广播方式传输目标语音识别模型。服务器130广播所构建的目标语音识别模型和相关联的请求标识。第二用户终端120可以将广播的请求标识与本地的映射表进行对比,来确定是否要接收语音识别模型。如果在映射表中找不到该请求标识,则不 接收目标语音识别模型。如果找到该请求标识,则接收对应的目标语音识别模型。In addition, the server can also transmit the target speech recognition model through broadcasting. The server 130 broadcasts the constructed target speech recognition model and the associated request identification. The second user terminal 120 may compare the broadcast request identifier with the local mapping table to determine whether to receive the speech recognition model. If the request identifier cannot be found in the mapping table, the target speech recognition model is not received. If the request identifier is found, the corresponding target speech recognition model is received.
第二用户终端120还可以根据请求标识来确定连接的第一用户终端110。第二用户终端120可以使用请求标识在映射表中查找与该请求标识对应的第一用户终端110的连接信息,例如第一用户终端110的标识信息等,从而确定450要接收目标语音设备模型的第一用户终端110。然后,第二用户终端120向所确定的第一用户终端110发送460目标语音识别模型。The second user terminal 120 may also determine the connected first user terminal 110 according to the request identifier. The second user terminal 120 can use the request identifier to look up the connection information of the first user terminal 110 corresponding to the request identifier in the mapping table, such as the identification information of the first user terminal 110, so as to determine 450 the address of the target voice device model to be received. A first user terminal 110 . Then, the second user terminal 120 sends 460 the target speech recognition model to the determined first user terminal 110 .
在接收到语音识别模型后,第一用户终端110可以部署目标语音识别模型或者基于目标语音识别模型来更新本地的语音识别模型,在满足目标条件时开始执行470更新后的语音识别模型以监听环境声音中是否存在目标关键词,如上文参照图3描述的过程300。After receiving the speech recognition model, the first user terminal 110 may deploy the target speech recognition model or update the local speech recognition model based on the target speech recognition model, and start executing 470 the updated speech recognition model to monitor the environment when the target condition is satisfied. Whether the target keyword exists in the voice, as in the process 300 described above with reference to FIG. 3 .
图4描述了从服务器130经由第二用户终端120向第一用户终端110传输目标语音识别模型的过程。在一些实施例中,第一用户终端110可以具有与服务器130直接通信的能力。因此,还可以从服务器130直接向第一用户终端110传输目标语音识别模型。服务器130可以使用第二用户终端120上报的第一移动终端110的信息(例如,蓝牙连接信息、无线局域网连接信息等)来定位第一用户设备110,直接将目标语音识别模型传输到第一用户终端110。FIG. 4 describes the process of transmitting the target speech recognition model from the server 130 to the first user terminal 110 via the second user terminal 120 . In some embodiments, the first user terminal 110 may have the capability of directly communicating with the server 130 . Therefore, it is also possible to directly transmit the target speech recognition model from the server 130 to the first user terminal 110 . The server 130 can locate the first user equipment 110 by using the information of the first mobile terminal 110 reported by the second user terminal 120 (for example, Bluetooth connection information, wireless local area network connection information, etc.), and directly transmit the target speech recognition model to the first user. Terminal 110.
此外,第二用户终端120也可以不向第一用户终端110传输接收到的语音识别模型,而是由自己来执行语音识别模型以实现语音代听功能。In addition, the second user terminal 120 may also not transmit the received speech recognition model to the first user terminal 110, but execute the speech recognition model by itself to realize the voice listening function.
声学模型acoustic model
根据本公开的实施例的针对目标关键词的语音识别模型被用于识别机场或火车站的环境声音中的广播语音。然而,识别这种环境声音是有挑战的。首先,机场广播通常距离用户的拾音设备过远,有较强混响干扰。其次,广播音基本都是根据固定模板合成的,与标准人声普通话有较大区别。最后,大厅中有其他旅客的交谈声等各式噪声,环境异常复杂。因此,希望在提供一种利用用户终端在噪声环境下准确识别复杂背景噪声环境中的广播语音内容的方案。The speech recognition model for target keywords according to an embodiment of the present disclosure is used to recognize a broadcast speech in ambient sound of an airport or a train station. However, identifying such ambient sounds is challenging. First of all, the airport broadcast is usually too far away from the user's pickup equipment, which has strong reverberation interference. Secondly, broadcast sounds are basically synthesized according to fixed templates, which are quite different from standard human voices in Mandarin. Finally, there are various noises such as the conversations of other passengers in the lobby, and the environment is extremely complicated. Therefore, it is desirable to provide a solution for accurately identifying broadcast voice content in a complex background noise environment by using a user terminal in a noisy environment.
本公开利用深度学习技术,通过离线训练来获得能够在诸如机场、火车站等高噪声、强混响环境下识别广播内容的声学模型。图5示出了根据本公开的实施例的用于生成声学模型的方法500的示意程图。The present disclosure utilizes deep learning technology to obtain an acoustic model capable of recognizing broadcast content in environments with high noise and strong reverberation, such as airports and railway stations, through offline training. FIG. 5 shows a schematic diagram of a method 500 for generating an acoustic model according to an embodiment of the present disclosure.
方法500包括,在框510,在噪声场所采集声音数据。为了使声音适于检测噪声环境下的语音,从噪声环境采集声音数据以产生用于训练和构建声学模型的训练数据。 Method 500 includes, at block 510, collecting sound data in a noisy location. In order to adapt the sound to detect speech in a noisy environment, sound data is collected from the noisy environment to generate training data for training and building an acoustic model.
例如,可以使用各种类型的手机、具有录音功能的耳机、录音笔等设备在机场、火车站的多个位置处采集环境声音。声音采集地点可以包括但不限于柜台大厅、安检通道、候机厅、便利店、餐饮区域、公共卫生间等位置,以便覆盖用户能够到达的区域。具体地,可以根据采集位置所在区域(如航站楼)的大小,以一个位置覆盖半径为R(R>0)米的圆形面积为标准,设置若干个采集位置。声音采集方式可以是关闭录音设备的前端增益,连续不间断录音(例如,持续二十四小时),确保能在各个位置将不含广播音的背景噪声录制。在一些实施例中,可以采用静态录音,将声音采集设备固定并连续不间断录音。备选地,还可以采用动态录音,由人或机器持采集设备在噪声场所内移动,并连续不间断录音。此外,录音格式可以是例如wav格式、16kHz、16bit、多通道等,但不限于此。For example, various types of mobile phones, earphones with recording functions, recording pens and other equipment can be used to collect ambient sounds at multiple locations in airports and train stations. Sound collection locations may include, but are not limited to, counter halls, security check passages, waiting halls, convenience stores, dining areas, public restrooms, etc., so as to cover areas that users can reach. Specifically, according to the size of the area where the collection location is located (such as a terminal building), several collection locations can be set based on a circular area with a location coverage radius of R (R>0) meters as a standard. The way of sound collection can be to turn off the front-end gain of the recording device, and record continuously (for example, for 24 hours), so as to ensure that background noise without broadcast sound can be recorded at various locations. In some embodiments, static recording can be adopted, and the sound collection device is fixed and continuously and uninterruptedly recorded. Alternatively, dynamic recording can also be used, where a person or machine moves the acquisition device in a noisy place and records continuously and uninterruptedly. In addition, the recording format may be, for example, wav format, 16kHz, 16bit, multi-channel, etc., but is not limited thereto.
以上描述了获取语音数据和噪声数据的声学特征的示例性过程。声学特征可以按照上述方式来获取,也可以通过其他方式来获取,例如,访问已有的带噪语音特征或各种类型的已 有噪声特征,而不需要专门现场采集An exemplary process of acquiring acoustic features of voice data and noise data is described above. Acoustic features can be acquired as described above, or by other means, such as accessing existing noisy speech features or various types of existing noise features, without the need for dedicated on-site acquisition
在框520,预处理声音数据,得到目标语音数据和噪声数据。根据本公开的实施例,由于连续不间断录音,采集到的原始声音数据在一部分时间段内包括广播语音,而其他时间段不包括广播语音。预处理可以包括手动或通过机器将原始声音数据划分为包括目标语音内容的音频数据和不包括目标语音内容的音频数据,并分别进行标注。在一些实施例中,目标语音数据被标注了该数据来自的位置信息以及该目标语音数据的文本,例如,包括航班号或车次。对于噪声数据,仅标注噪声数据的位置信息。In block 520, the voice data is preprocessed to obtain target voice data and noise data. According to the embodiments of the present disclosure, due to the continuous and uninterrupted recording, the collected original sound data includes broadcast voices in a part of time periods, while other time periods do not include broadcast voices. The preprocessing may include manually or by machine dividing the original sound data into audio data including the target speech content and audio data not including the target speech content, and marking them respectively. In some embodiments, the target voice data is marked with the location information from which the data comes and the text of the target voice data, for example, including flight number or train number. For noisy data, only the location information of the noisy data is marked.
在框530,提取语音数据和噪声数据的声学特征。可以通过对标注后的语音数据和噪声数据进行分帧、加窗、FFT等处理,来提取声学特征。在一些实施例中,声学特征可以通过例如梅尔频率倒谱系数(MFCC)来表示,但不限于此,其以10ms为一帧,每一帧具有对应的一组参数,每个参数具有0至1之间的值。也就是说,目标语音数据和噪声数据均可以被表示为持续一段时间的一系列帧,每一帧由一组值在0至1之间的参数来表征。At block 530, acoustic features of the speech data and noise data are extracted. Acoustic features can be extracted by performing framing, windowing, FFT and other processing on the marked speech data and noise data. In some embodiments, the acoustic features can be represented by, for example, Mel-frequency cepstral coefficients (MFCC), but not limited thereto, which takes 10 ms as a frame, and each frame has a corresponding set of parameters, and each parameter has a value of 0 A value between 1 and 1. That is, both target speech data and noise data can be represented as a series of frames lasting for a period of time, each frame is characterized by a set of parameters with values between 0 and 1.
目标语音数据经过分帧、加窗、FFT等处理提取到的声学特征是带噪声学特征。带噪声学特征可以被增强,得到尽可能纯净的语音声学特征,从而减少噪声给识别带来的不利影响。参照图6,其示出了根据本公开的实施例的用于增强语音声学特征的方法600的示意流程图。The acoustic features extracted from the target speech data after processing such as framing, windowing, and FFT are noisy features. The noise characteristics can be enhanced to obtain the purest speech acoustic characteristics, thereby reducing the adverse effect of noise on recognition. Referring to FIG. 6 , it shows a schematic flowchart of a method 600 for enhancing speech acoustic features according to an embodiment of the present disclosure.
在框610,对输入的带噪语音声学特征进行LASSO变换,以对声学特征进行混响抑制。混响是指,当声波在室内传播时被墙壁、天花板、地板等障碍物反射和吸收,在声源停止发射声波后,声波在室内经过多次反射和吸收,最后才会消失,这种声源停止发声后的声音仍然存在的现象称为混响。混响不利于准确识别语音中的内容。In block 610, LASSO transform is performed on the input noisy speech acoustic features to perform reverberation suppression on the acoustic features. Reverberation means that when sound waves are reflected and absorbed by obstacles such as walls, ceilings, and floors when they propagate indoors, after the sound source stops emitting sound waves, the sound waves will be reflected and absorbed many times in the room before disappearing. The phenomenon that the sound persists after the source stops sounding is called reverberation. Reverberation is not conducive to accurate recognition of content in speech.
LASSO变换也称为LASSO回归。通过限制声学特征中的重要变量(也就是系数不为0的变量)与其他变量的相关关系的条件,可以去除与混响有关的声学特征,从而抑制混响带来的不利影响。LASSO transformation is also known as LASSO regression. By limiting the conditions of the correlation between important variables in the acoustic features (that is, variables whose coefficients are not 0) and other variables, the acoustic features related to reverberation can be removed, thereby suppressing the adverse effects of reverberation.
在框620,针对混响抑制后的语音数据的声学特征进行bottleneck网络处理。bottleneck网络是一种神经网络模型,包括bottleneck层。bottleneck层相比于前面的层具有更好的节点数,其可以用于获取维度更少的输入表示。在一些实施例中,经过bottleneck网络处理的声学特征的维度可以减少,从而在训练期间获得更好的损失。bottleneck网络的系数可以是预先计算的,也可以在训练过程中更新。In block 620, bottleneck network processing is performed on the acoustic features of the reverberation-suppressed speech data. A bottleneck network is a neural network model that includes a bottleneck layer. The bottleneck layer has a better number of nodes than the previous layers, which can be used to obtain input representations with fewer dimensions. In some embodiments, the dimensionality of the acoustic features processed by the bottleneck network can be reduced, resulting in a better loss during training. The coefficients of the bottleneck network can be precomputed or updated during training.
通过如图6所示的语音增强600,带有背景噪声的语音声学特征被转换为尽可能纯净的语音特征。进一步地,纯净语音特征可以与来源于多个位置的噪声特征融合以生成融合特征。Through speech enhancement 600 as shown in FIG. 6 , speech acoustic features with background noise are transformed into speech features that are as pure as possible. Furthermore, clean speech features can be fused with noise features from multiple locations to generate fused features.
返回图5,在框540,根据语音声学特征和噪声声学特征来生成融合特征。融合特征能够减少在不同场所或同一场所的不同位置处的背景噪声的类型差别、大小差别等对识别准确率的影响。根据本公开的实施例,通过将语音特征和噪声特征逐帧对齐来生成融合特征。Returning to FIG. 5 , at block 540 , fusion features are generated from the speech acoustic features and the noise acoustic features. Fusion features can reduce the impact of the type and size differences of background noise in different places or different positions in the same place on the recognition accuracy. According to an embodiment of the present disclosure, fusion features are generated by aligning speech features and noise features frame by frame.
图7示出了根据本公开的实施例的用于生成融合特征的方法700的示意概念图。如图所示,从原始数据划分得到的目标语音数据经过特征提取710、语音增强720之后产生增强语音特征。并且,噪声数据经过均匀采样后得到在多个位置(例如位置1至位置N)的采样噪声。类似地,对这些来自多个位置处的采样噪声进行特征提取710,以产生噪声特征。特征提取710可以按照参照框530描述的处理来执行,包括分帧、加窗、FFT等处理。根据本公开的实施例,语音数据的声学特征和噪声数据的声学特征可以具有相同的帧大小,例如均为10ms,以便可以逐帧融合。FIG. 7 shows a schematic conceptual diagram of a method 700 for generating fusion features according to an embodiment of the present disclosure. As shown in the figure, the target speech data divided from the original data undergoes feature extraction 710 and speech enhancement 720 to generate enhanced speech features. Moreover, the noise data is uniformly sampled to obtain sampling noise at multiple positions (for example, position 1 to position N). Similarly, feature extraction 710 is performed on these sampled noises from multiple locations to generate noise features. Feature extraction 710 may be performed as described with reference to block 530, including framing, windowing, FFT, and the like. According to an embodiment of the present disclosure, the acoustic features of the speech data and the acoustic features of the noise data may have the same frame size, for example, both are 10 ms, so that they can be fused frame by frame.
如上所述,增强的语音声学特征和噪声特征具有相同大小的帧,例如10ms,因此语音特征和噪声特征的逐帧对齐可以产生时间对齐的融合特征。具体地,可以逐帧将所有采样得到的噪声特征(例如来源于位置1至N的噪声特征)叠加到增强后的语音特征上来形成融合特征。如上所述,每一帧由一组值在0至1之间的参数,即向量来表征,叠加是指通过向量加法将语音声学特征和噪声特征的对应参数相加。例如,在语音声学特征和噪声声学特征中的每一帧均由40维向量表示的情况下,融合特征中的一个帧同样由对应的40维向量来表示。As mentioned above, the enhanced speech acoustic features and noise features have the same frame size, say 10ms, so the frame-by-frame alignment of speech features and noise features can produce temporally aligned fused features. Specifically, all sampled noise features (for example, noise features from positions 1 to N) can be superimposed on the enhanced speech features frame by frame to form fusion features. As mentioned above, each frame is characterized by a set of parameters with values between 0 and 1, namely vectors, and superposition refers to adding the corresponding parameters of speech acoustic features and noise features through vector addition. For example, in the case that each frame in the speech acoustic feature and the noise acoustic feature is represented by a 40-dimensional vector, a frame in the fusion feature is also represented by a corresponding 40-dimensional vector.
应理解,叠加后的参数的值可能超出了0至1的范围。在这种情况下,可以进行全局归一化处理,以便使得融合特征的参数的值仍然在0至1的范围内。It should be understood that the value of the superimposed parameter may exceed the range of 0 to 1. In this case, a global normalization process can be performed so that the value of the parameter of the fusion feature is still in the range of 0 to 1.
在一些情况下,语音数据的时长可能不同于噪声数据的时长,并且各个位置的噪声数据的时长也可能不同。因此,特征融合还包括语音数据和噪声数据的对齐。In some cases, the duration of speech data may be different from that of noise data, and the duration of noise data may also be different for each location. Therefore, feature fusion also includes the alignment of speech data and noise data.
图8示出了根据本公开的实施例的特征融合过程800的示意图。图8中用于特征融合的增强的语音声学特征810和来源于多个位置的噪声特征820-1、820-2、……820-N(统称为820),按照帧序列被示出。在图8的增强的语音声学特征810包括L个帧。由于语音声学特征810和噪声声学特征820的持续时间可以不同,噪声特征820可以包括与L相同或不同的帧数。例如,噪声特征820-N可以包括例如R个帧。FIG. 8 shows a schematic diagram of a feature fusion process 800 according to an embodiment of the present disclosure. Enhanced speech acoustic features 810 for feature fusion and noise features 820-1, 820-2, ... 820-N (collectively 820) from multiple locations in FIG. 8 are shown in a sequence of frames. The enhanced speech acoustic features 810 in FIG. 8 include L frames. Since the duration of speech acoustic feature 810 and noise acoustic feature 820 may be different, noise feature 820 may include the same or a different number of frames than L. For example, noise signature 820-N may include, for example, R frames.
在一些实施例中,可以根据语音声学特征810的持续时间来调整噪声声学特征820,例如,通过选择噪声声学特征的一部分帧或者扩展噪声声学特征的帧,得到帧数(或持续时间)与语音声学特征相同的经调整的噪声声学特征。在二者对齐之后,叠加语音声学特征和经调整的噪声声学特征。In some embodiments, the noise acoustic feature 820 can be adjusted according to the duration of the speech acoustic feature 810, for example, by selecting a part of the frames of the noise acoustic feature or expanding the frames of the noise acoustic feature, the relationship between the number of frames (or duration) and the speech Adjusted noise acoustic signature with the same acoustic signature. After the two are aligned, the speech acoustic features and the adjusted noise acoustic features are superimposed.
具体地,如果增强的语音声学特征810的帧数和噪声声学特征820的帧数相同(L=R),则逐帧地叠加语音声学特征810和噪声声学特征820。Specifically, if the number of frames of the enhanced speech acoustic feature 810 and the noise acoustic feature 820 are the same (L=R), the speech acoustic feature 810 and the noise acoustic feature 820 are superimposed frame by frame.
如果增强的语音声学特征810的帧数小于噪声声学特征820的帧数(L<R),则可以选择噪声声学特征820的前L帧来与增强的语音声学特征叠加,后R-L帧舍去不做处理。应理解,也可以选择噪声声学特征820中的后L帧、位于中间的L帧、或以任何其他方式选择的L帧来与语音声学特征810叠加。If the frame number of the enhanced speech acoustic feature 810 is less than the frame number of the noise acoustic feature 820 (L<R), then the first L frames of the noise acoustic feature 820 can be selected to be superimposed with the enhanced speech acoustic feature, and the rear R-L frames are discarded. do processing. It should be understood that the last L frames in the noise acoustic feature 820 , the middle L frame, or the L frames selected in any other way may also be selected to be superimposed on the speech acoustic feature 810 .
如果增强的语音声学特征810的帧数大于噪声声学特征820的帧数(L>R),则对于可以将噪声声学特征820的第1帧叠加到增强后的语音声学特征的第L-R帧,第2帧叠加到L-R+1帧,以此类推,直到语音声学特征810的所有帧都被叠加噪声特征820的帧。例如,如图8所示,噪声特征820-N的帧数R小于语音声学特征的帧数,因此,其第1帧再一次地被叠加到语音声学特征的相应帧。应理解,图8仅是示意性的,语音声学特征和噪声特征的帧数不一定是图8所示的情况。If the frame number of the enhanced speech acoustic feature 810 is greater than the frame number (L>R) of the noise acoustic feature 820, then for the first frame of the noise acoustic feature 820 can be superimposed on the L-R frame of the enhanced speech acoustic feature, the first 2 frames are superimposed to L-R+1 frame, and so on, until all frames of speech acoustic feature 810 are superimposed with frames of noise feature 820 . For example, as shown in FIG. 8, the frame number R of the noise feature 820-N is smaller than the frame number of the speech acoustic feature, so its first frame is again superimposed on the corresponding frame of the speech acoustic feature. It should be understood that FIG. 8 is only schematic, and the frame numbers of speech acoustic features and noise features are not necessarily the same as those shown in FIG. 8 .
按照上述方式,增强的语音声学特征810的第1帧与噪声特征820-1、820-2、…820-N的第1帧叠加,得到融合特征的第1帧,第2帧与噪声特征1、2、…N的第1帧叠加,得到融合特征830的第2帧,以此类推,生成了帧数为L的融合特征830。融合特征830被用于训练声学模型。According to the above method, the first frame of the enhanced speech acoustic feature 810 and the first frame of the noise feature 820-1, 820-2, ... 820-N are superimposed to obtain the first frame of the fusion feature, and the second frame and the noise feature 1 , 2, ... N, the first frame is superimposed to obtain the second frame of the fusion feature 830, and so on, the fusion feature 830 with the number of L frames is generated. Fused features 830 are used to train the acoustic model.
借助语音声学特征和噪声声学特征的这种融合方式,可以生成大量融合特征来作为声学模型的训练数据,并且所生成的融合特征能够真实地模拟特定真实噪声场所的环境声音,使得经过其训练的声学模型具有更高的准确率。With the help of this fusion method of speech acoustic features and noise acoustic features, a large number of fusion features can be generated as the training data of the acoustic model, and the generated fusion features can truly simulate the environmental sound of a specific real noise place, so that the trained Acoustic models have higher accuracy.
以上描述了通过叠加目标语音数据的声学特征和噪声数据的声学特征来得到融合特定过 程。在另一些实施例中,可以对在框520得到的目标语音数据和噪声数据进行叠加来获取叠加后的音频数据;然后基于叠加后的音频数据获取融合声学特征。在这种情况下,针对目标语音数据和噪声数据的叠加同样可以基于帧数对齐的方式进行,并且提取融合声学特征的可以类似地进行。The above describes the fusion specific process by superimposing the acoustic features of the target speech data and the acoustic features of the noise data. In some other embodiments, the target speech data and noise data obtained in block 520 may be superimposed to obtain superimposed audio data; then the fused acoustic features may be obtained based on the superimposed audio data. In this case, the superposition of target speech data and noise data can also be performed based on frame number alignment, and the extraction of fused acoustic features can be similarly performed.
返回图5,在框550,使用融合特征和语音数据的文本来训练声学模型。根据本公开的实施例,声学模型可以基于深度神经网络(DNN)架构。语音数据的文本是在步骤520标注的文本,例如,包括航班号或车次。在训练时,融合特征是声学模型的输入,而文本或者对应于文本的音素是对应于融合特征的标注数据。为了更好拾取机场/火车站等高噪声、强混响环境中的非人声广播音,声学模型使用多任务架构,包含声源标签的声源识别任务和语音标签的语音识别任务。Returning to FIG. 5, at block 550, the acoustic model is trained using the fused features and the text of the speech data. According to an embodiment of the present disclosure, the acoustic model may be based on a deep neural network (DNN) architecture. The text of the voice data is the text marked in step 520, for example, including flight number or train number. During training, the fusion feature is the input of the acoustic model, and the text or the phoneme corresponding to the text is the labeled data corresponding to the fusion feature. In order to better pick up non-human voice broadcasting sounds in high-noise, strong reverberation environments such as airports/train stations, the acoustic model uses a multi-task architecture, including the sound source recognition task of the sound source tag and the speech recognition task of the voice tag.
图9示出了根据本公开的实施例的用于训练声学模型的架构图。架构900包括深度神经网络910,深度神经网络710可以包括多个隐层912、914、916以及输入层和输出层(未示出)。深度神经网络710还可以包括更多或更少的隐层。FIG. 9 shows an architecture diagram for training an acoustic model according to an embodiment of the present disclosure. The architecture 900 includes a deep neural network 910, and the deep neural network 710 may include a plurality of hidden layers 912, 914, 916 and input and output layers (not shown). Deep neural network 710 may also include more or fewer hidden layers.
根据本公开的实施例,可以对深度神经网络910进行多任务训练,具体地,修改深度神经网络910的训练目标,在语音识别标签的基础上增加另一个声纹识别标签作为训练目标。如图所示,可以从深度神经网络910的最后一个隐层916得到输出作为声源特征。然后,将融合特征与声源特征拼接在一起作为深度神经网络的910的输入。例如,可以将Y维声源特征与X维融合特征拼接,形成X+Y维的训练特征,作为深度神经网络的输入。在训练过程中,每轮迭代都用前一轮生成的声源特征更新输入特征,直至最终训练结束。在一些实施例中,首轮迭代输入的声源特征可以被全部置0。According to the embodiment of the present disclosure, multi-task training can be performed on the deep neural network 910, specifically, the training target of the deep neural network 910 is modified, and another voiceprint recognition tag is added as the training target on the basis of the speech recognition tag. As shown in the figure, the output from the last hidden layer 916 of the deep neural network 910 can be obtained as the sound source feature. Then, the fusion feature and the sound source feature are spliced together as the input of 910 of the deep neural network. For example, the Y-dimensional sound source feature can be spliced with the X-dimensional fusion feature to form an X+Y-dimensional training feature, which can be used as the input of the deep neural network. During the training process, each round of iteration updates the input features with the sound source features generated in the previous round until the final training ends. In some embodiments, the sound source features input in the first iteration may be all set to 0.
由此,利用本公开的结合声纹特征的多任务学习,可以从深度神经网络中提取到广播语音的声源特征作为声学模型学习的补偿,从而更精准地拾取到非人声广播音。Thus, using the multi-task learning combined with voiceprint features of the present disclosure, the sound source features of broadcast voices can be extracted from the deep neural network as compensation for acoustic model learning, thereby more accurately picking up non-human voice broadcast voices.
返回图5,在框560,根据声学模型、发音字典和语言模型构建语音识别模型。在一些实施例中,构建语音识别模型的过程可以包括接收来自用户终端的目标关键词,生成针对目标关键词的目标语言模型和目标发音字典模型,通过合并目标语言模型、目标发音字典模型、以及声学模型来构建所述语音识别模型,更具体地,可以参照上文关于图4的描述。Returning to FIG. 5, at block 560, a speech recognition model is constructed based on the acoustic model, pronunciation dictionary and language model. In some embodiments, the process of constructing a speech recognition model may include receiving target keywords from a user terminal, generating a target language model and a target pronunciation dictionary model for the target keywords, by merging the target language model, the target pronunciation dictionary model, and An acoustic model is used to construct the speech recognition model, and more specifically, reference may be made to the above description about FIG. 4 .
根据本公开的实施例,经过离线训练的声学模型可以被存储到服务器的模型库中。当服务器从用户终端接收到目标关键词时,可以利用该声学模型、以及模型库中的其他模型(例如发音字典、语言模型)来构建用于识别该目标关键词的语音识别模型。这种专用于特定关键词的语音识别模型是轻量级的,适合部署到用户设备或智能代听设备。According to an embodiment of the present disclosure, the offline trained acoustic model may be stored in the model library of the server. When the server receives the target keyword from the user terminal, the acoustic model and other models in the model library (such as pronunciation dictionary and language model) can be used to construct a speech recognition model for recognizing the target keyword. This speech recognition model dedicated to specific keywords is lightweight and suitable for deployment to user equipment or smart listening devices.
示例装置和设备Example Apparatus and Equipment
图10示出了根据本公开的实施例的用于语音代听的装置1000的示意框图。装置1000可以应用于用户终端,例如第一用户终端110或第二用户装置120。装置1000包括模型获取单元1010,用于获取目标关键词对应的目标语音识别模型。目标语音识别模型为根据目标关键词构建的,目标关键词为根据用户的出行信息获取。装置1000还包括更新单元1020。更新单元用于根据目标语音识别模型对本地语音识别模型进行更新,获得更新后的语音识别模型,本地语音识别模型为用户终端中存储的语音识别模型。装置1000还包括声音识别单元1020。环境识别单元1020用于当满足目标条件时,根据更新后的语音识别模型对采集到的环境声音 进行识别,获得识别结果,环境声音为在用户终端所处的环境中采集到的声音信息。装置1000还包括提示单元1030。提示单元1030用于当识别结果指示环境声音中存在目标关键词对应的语音时,对用户进行提示。Fig. 10 shows a schematic block diagram of an apparatus 1000 for voice listening substitution according to an embodiment of the present disclosure. The device 1000 may be applied to a user terminal, such as a first user terminal 110 or a second user device 120 . The apparatus 1000 includes a model acquiring unit 1010, configured to acquire a target speech recognition model corresponding to a target keyword. The target speech recognition model is constructed based on the target keywords, which are obtained based on the user's travel information. The device 1000 also includes an updating unit 1020 . The update unit is used to update the local speech recognition model according to the target speech recognition model to obtain the updated speech recognition model, where the local speech recognition model is the speech recognition model stored in the user terminal. The device 1000 also includes a voice recognition unit 1020 . The environment recognition unit 1020 is used to recognize the collected environmental sound according to the updated speech recognition model when the target condition is met, and obtain the recognition result. The environmental sound is the sound information collected in the environment where the user terminal is located. The device 1000 also includes a prompt unit 1030 . The prompting unit 1030 is configured to prompt the user when the recognition result indicates that there is a voice corresponding to the target keyword in the ambient sound.
在一些实施例中,装置1000还包括目标关键词获取单元。目标关键词获取单元用于获取用户的出行信息中的目标关键词。装置1000还包括发送单元。发送单元用于向服务器发送出行信息中的目标关键词,以用于服务器根据目标关键词构建目标语音识别模型。模型获取单元1010还用于从服务器获取目标语音识别模型。In some embodiments, the device 1000 further includes a target keyword acquiring unit. The target keyword acquisition unit is used to acquire target keywords in the user's travel information. The device 1000 also includes a sending unit. The sending unit is used to send the target keywords in the travel information to the server, so that the server can construct a target speech recognition model according to the target keywords. The model obtaining unit 1010 is also used to obtain the target speech recognition model from the server.
在一些实施例中,用户终端是第一用户终端并且连接到第二用户终端,所述方法包括:向所述第二用户终端发送标识信息,所述标识信息用于标识所述第一用户终端;其中所述获取目标关键词对应的目标语音识别模型,具体为:基于所述标识信息从所述第二用户终端接收所述目标语音识别模型,所述目标语音识别模型为所述第二用户终端根据所述目标关键词从所述服务器获取;其中所述第一用户终端是音频播放设备。In some embodiments, the user terminal is a first user terminal and is connected to a second user terminal, and the method includes: sending identification information to the second user terminal, the identification information being used to identify the first user terminal ; Wherein the acquiring the target speech recognition model corresponding to the target keyword is specifically: receiving the target speech recognition model from the second user terminal based on the identification information, the target speech recognition model being the second user The terminal obtains from the server according to the target keyword; wherein the first user terminal is an audio playback device.
在一些实施例中,目标语音识别模型是基于声学模型、目标发音字典模型和目标语言模型而生成的解码图。解码图是由目标关键词确定的语法约束规则的解码路径集合。所述目标发音字典模型是基于所述目标关键词的发音序列而获取的,并且所述目标语言模型是基于所述目标关键词的字之间的关系而获取的。In some embodiments, the target speech recognition model is a decoding map generated based on an acoustic model, a target pronunciation dictionary model, and a target language model. A decoding graph is a collection of decoding paths of grammatically constrained rules determined by target keywords. The target pronunciation dictionary model is acquired based on the pronunciation sequence of the target keyword, and the target language model is acquired based on the relationship between words of the target keyword.
在一些实施例中,声学模型通过如下方式生成:基于目标语音数据和噪声数据生成融合声学特征,目标语音数据为包括目标语音内容的音频数据,噪声数据为不包括所述目标语音内容的音频数据;通过融合特征和语音数据的文本信息进行训练来生成声学模型。In some embodiments, the acoustic model is generated by: generating fusion acoustic features based on target speech data and noise data, where the target speech data is audio data including the target speech content, and the noise data is audio data not including the target speech content ; generate an acoustic model by fusing features and textual information of speech data for training.
在一些实施例中,出行信息具有关联的位置信息,其中声音识别单元1020还用于当用户的位置与出行信息的位置信息匹配时,根更新后的述语音识别模型对采集到的环境声音进行识别。In some embodiments, the travel information has associated location information, wherein the sound recognition unit 1020 is further configured to perform the collected environmental sound based on the updated speech recognition model when the user's location matches the location information of the travel information. identify.
一些实施例中,出行信息还具有关联的时间信息,声音识别单元1020还用于当当前时间在时间信息之前的预定时间段内时,根据更新后的语音识别模型对采集到的环境声音进行识别。In some embodiments, the travel information also has associated time information, and the sound recognition unit 1020 is also used to recognize the collected environmental sound according to the updated speech recognition model when the current time is within a predetermined period of time before the time information .
在一些实施例中,提示单元1030还用于在用户终端上播放与目标关键词对应的语音。In some embodiments, the prompting unit 1030 is further configured to play the voice corresponding to the target keyword on the user terminal.
在一些实施例中,目标关键词是列车车次或航班号。In some embodiments, the target keyword is train number or flight number.
在一些实施例中,用户终端是智能手机、智能家电、可穿戴设备、音频播放设备、平板电脑和笔记本电脑之一。In some embodiments, the user terminal is one of a smart phone, a smart home appliance, a wearable device, an audio playback device, a tablet computer and a notebook computer.
图11示出了根据本公开的实施例的用于生成语音识别模型的装置1100的示意框图。装置1100可以用于例如服务器130。装置1100包括融合单元1110、训练单元1120和语音识别模型构建单元1130。融合单元1110用于基于目标语音数据和噪声数据生成融合声学特征。目标语音数据为包括目标语音内容的音频数据,噪声数据为不包括目标语音内容的音频数据。训练单元1120用于通过所述融合特征和所述语音数据的文本信息进行训练来生成声学模型。语音识别模型构建单元1130用于根据所述声学模型、发音字典和语言模型构建所述语音识别模型。Fig. 11 shows a schematic block diagram of an apparatus 1100 for generating a speech recognition model according to an embodiment of the present disclosure. Apparatus 1100 may be used in server 130, for example. The device 1100 includes a fusion unit 1110 , a training unit 1120 and a speech recognition model construction unit 1130 . The fusion unit 1110 is used to generate fusion acoustic features based on the target speech data and noise data. The target speech data is audio data including the target speech content, and the noise data is audio data not including the target speech content. The training unit 1120 is used for generating an acoustic model by performing training through the fusion feature and the text information of the speech data. The speech recognition model construction unit 1130 is configured to construct the speech recognition model according to the acoustic model, pronunciation dictionary and language model.
在一些实施例中,融合单元1110还用于对目标语音数据和噪声数据进行叠加来获取叠加后的音频数据;以及基于叠加后的音频数据获取融合声学特征。In some embodiments, the fusion unit 1110 is further configured to superimpose the target speech data and noise data to obtain superimposed audio data; and obtain fused acoustic features based on the superimposed audio data.
在一些实施例中,融合单元1110还用于基于目标语音数据获取第一声学特征,基于噪声 数据获取第二声学特征;基于第一声学特征和第二声学特征获取融合声学特征。In some embodiments, the fusion unit 1110 is also used to obtain the first acoustic feature based on the target speech data, and obtain the second acoustic feature based on the noise data; obtain the fusion acoustic feature based on the first acoustic feature and the second acoustic feature.
在一些实施例中,融合单元1110还用于从目标语音数据生成带噪声学特征;通过增强所述带噪声学特征来生成第一声学特征。In some embodiments, the fusion unit 1110 is further configured to generate noisy acoustic features from the target speech data; and generate first acoustic features by enhancing the noisy acoustic features.
在一些实施例中,融合单元1110还用于对带噪声学特征进行LASSO变换,以及对经LASSO变换的声学特征进行bottleneck网络处理,以获取第一声学特征。In some embodiments, the fusion unit 1110 is further configured to perform LASSO transformation on the acoustic features with noise, and perform bottleneck network processing on the acoustic features transformed by LASSO, so as to obtain the first acoustic features.
在一些实施例中,融合单元1110还用于叠加第一声学特征和第二声学特征,以得到叠加的声学特征;以及通过对叠加的声学特征进行归一化处理,生成融合声学特征。In some embodiments, the fusion unit 1110 is further configured to superimpose the first acoustic feature and the second acoustic feature to obtain superimposed acoustic features; and generate a fusion acoustic feature by normalizing the superimposed acoustic features.
在一些实施例中,融合单元1110还用于获取所述第一声学特征的帧数,第一声学特征的帧数根据目标语音数据的持续时间确定;根据第一声学特征的帧数基于第二声学特征构建第三声学特征;以及叠加第一声学特征和第三声学特征获取融合声学特征。In some embodiments, the fusion unit 1110 is also used to obtain the frame number of the first acoustic feature, the frame number of the first acoustic feature is determined according to the duration of the target voice data; according to the frame number of the first acoustic feature Constructing a third acoustic feature based on the second acoustic feature; and superimposing the first acoustic feature and the third acoustic feature to obtain a fusion acoustic feature.
在一些实施例中,所述声学模型是神经网络模型,并且训练单元1120用于从声学模型的隐藏层提取声源特征;以及将声源特征和融合声学特征作为声学模型的输入特征来训练声学模型。In some embodiments, the acoustic model is a neural network model, and the training unit 1120 is used to extract sound source features from the hidden layer of the acoustic model; and use the sound source features and fusion acoustic features as the input features of the acoustic model to train the acoustic Model.
在一些实施例中,语音识别模型构建单元1130还用于接收来自用户终端的目标关键词;根据所述目标关键词的发音序列从所述发音字典获取目标发音字典模型;根据所述目标关键词的字之间的关系从所述语音模型获取目标语言模型;以及通过合并所述声学模型、所述目标发音字典模型和所述目标语言模型来构建所述语音识别模型。In some embodiments, the speech recognition model construction unit 1130 is also used to receive the target keyword from the user terminal; obtain the target pronunciation dictionary model from the pronunciation dictionary according to the pronunciation sequence of the target keyword; obtaining a target language model from the speech model; and constructing the speech recognition model by combining the acoustic model, the target pronunciation dictionary model, and the target language model.
图12示出了根据本公开的另一实施例的用于语音代听的装置1200。装置1200可以应用于服务器130。装置1200包括目标关键词获取单元1210、语音识别模型构建单元1220和发送单元1230。目标关键词获取单元1210用于获取用户出行信息中的与用户出行方式相关的目标关键词。语音识别模型构建单元1220用于构建与所述目标关键词对应的目标语音识别模型。发送单元1230用于向用户终端发送所述语音识别模型,所述语音识别模型用于当满足目标条件时对用户终端处的环境声音进行识别,以用于确定所述环境声音中是否存在所述目标关键词。Fig. 12 shows an apparatus 1200 for voice listening substitution according to another embodiment of the present disclosure. The apparatus 1200 can be applied to the server 130 . The device 1200 includes a target keyword acquisition unit 1210 , a speech recognition model construction unit 1220 and a sending unit 1230 . The target keyword acquisition unit 1210 is configured to acquire target keywords related to the user's travel mode in the user's travel information. The speech recognition model construction unit 1220 is configured to construct a target speech recognition model corresponding to the target keyword. The sending unit 1230 is configured to send the speech recognition model to the user terminal, and the speech recognition model is used to recognize the environmental sound at the user terminal when the target condition is met, so as to determine whether the target keywords.
图13示出了可以用来实施本公开的实施例的示例设备1200的示意性框图。如图所示,设备1300包括中央处理单元(CPU)1301,其可以根据存储在只读存储器(ROM)1302中的计算机程序指令或者从存储单元1308加载到随机访问存储器(RAM)1303中的计算机程序指令,来执行各种适当的动作和处理。在RAM 1303中,还可存储设备1300操作所需的各种程序和数据。CPU 1301、ROM 1302以及RAM 1303通过总线1304彼此相连。输入/输出(I/O)接口1305也连接至总线1304。FIG. 13 shows a schematic block diagram of an example device 1200 that may be used to implement embodiments of the present disclosure. As shown, the device 1300 includes a central processing unit (CPU) 1301 that can be programmed according to computer program instructions stored in a read-only memory (ROM) 1302 or loaded from a storage unit 1308 into a random-access memory (RAM) 1303 program instructions to perform various appropriate actions and processes. In the RAM 1303, various programs and data necessary for the operation of the device 1300 can also be stored. The CPU 1301, ROM 1302, and RAM 1303 are connected to each other through a bus 1304. An input/output (I/O) interface 1305 is also connected to the bus 1304 .
设备1200中的多个部件连接至I/O接口1305,包括:输入单元1306,例如键盘、鼠标等;输出单元1307,例如各种类型的显示器、扬声器等;存储单元1308,例如磁盘、光盘等;以及通信单元1309,例如网卡、调制解调器、无线通信收发机等。通信单元1309允许设备1300通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。Multiple components in the device 1200 are connected to the I/O interface 1305, including: an input unit 1306, such as a keyboard, a mouse, etc.; an output unit 1307, such as various types of displays, speakers, etc.; a storage unit 1308, such as a magnetic disk, an optical disk, etc. ; and a communication unit 1309, such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 1309 allows the device 1300 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks.
上文所描述的各个过程和处理可由处理单元1201执行。例如,在一些实施例中,上述各个过程和处理可被实现为计算机软件程序,其被有形地包含于机器可读介质,例如存储单元1308。在一些实施例中,计算机程序的部分或者全部可以经由ROM 1302和/或通信单元1309而被载入和/或安装到设备1200上。当计算机程序被加载到RAM 1303并由CPU 1201执行时,可以执行上文描述的过程和处理的一个或多个动作。The various procedures and processing described above can be executed by the processing unit 1201 . For example, in some embodiments, the various procedures and processes described above may be implemented as computer software programs tangibly embodied on a machine-readable medium, such as the storage unit 1308 . In some embodiments, part or all of the computer program may be loaded and/or installed on the device 1200 via the ROM 1302 and/or the communication unit 1309. When the computer program is loaded into the RAM 1303 and executed by the CPU 1201, one or more actions of the procedures and processes described above may be performed.
本公开可以是方法、装置、系统和/或计算机程序产品。计算机程序产品可以包括计算机可读存储介质,其上载有用于执行本公开的各个方面的计算机可读程序指令。The present disclosure may be a method, apparatus, system and/or computer program product. A computer program product may include a computer-readable storage medium having computer-readable program instructions thereon for carrying out various aspects of the present disclosure.
计算机可读存储介质可以是可以保持和存储由指令执行设备使用的指令的有形设备。计算机可读存储介质例如可以是――但不限于――电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、静态随机存取存储器(SRAM)、便携式压缩盘只读存储器(CD-ROM)、数字多功能盘(DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。这里所使用的计算机可读存储介质不被解释为瞬时信号本身,诸如无线电波或者其他自由传播的电磁波、通过波导或其他传输媒介传播的电磁波(例如,通过光纤电缆的光脉冲)、或者通过电线传输的电信号。A computer readable storage medium may be a tangible device that can retain and store instructions for use by an instruction execution device. A computer readable storage medium may be, for example, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of computer-readable storage media include: portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), or flash memory), static random access memory (SRAM), compact disc read only memory (CD-ROM), digital versatile disc (DVD), memory stick, floppy disk, mechanically encoded device, such as a printer with instructions stored thereon A hole card or a raised structure in a groove, and any suitable combination of the above. As used herein, computer-readable storage media are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., pulses of light through fiber optic cables), or transmitted electrical signals.
这里所描述的计算机可读程序指令可以从计算机可读存储介质下载到各个计算/处理设备,或者通过网络、例如因特网、局域网、广域网和/或无线网下载到外部计算机或外部存储设备。网络可以包括铜传输电缆、光纤传输、无线传输、路由器、防火墙、交换机、网关计算机和/或边缘服务器。每个计算/处理设备中的网络适配卡或者网络接口从网络接收计算机可读程序指令,并转发该计算机可读程序指令,以供存储在各个计算/处理设备中的计算机可读存储介质中。Computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or downloaded to an external computer or external storage device over a network, such as the Internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers. A network adapter card or a network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .
用于执行本公开操作的计算机程序指令可以是汇编指令、指令集架构(ISA)指令、机器指令、机器相关指令、微代码、固件指令、状态设置数据、或者以一种或多种编程语言的任意组合编写的源代码或目标代码,编程语言包括面向对象的编程语言—诸如Smalltalk、C++等,以及常规的过程式编程语言—诸如“C”语言或类似的编程语言。计算机可读程序指令可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络—包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。在一些实施例中,通过利用计算机可读程序指令的状态信息来个性化定制电子电路,例如可编程逻辑电路、现场可编程门阵列(FPGA)或可编程逻辑阵列(PLA),该电子电路可以执行计算机可读程序指令,从而实现本公开的各个方面。Computer program instructions for performing the operations of the present disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state setting data, or Source or object code written in any combination, including object-oriented programming languages—such as Smalltalk, C++, etc., and conventional procedural programming languages—such as “C” or similar programming languages. Computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement. In cases involving a remote computer, the remote computer can be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as via the Internet using an Internet service provider). connect). In some embodiments, an electronic circuit, such as a programmable logic circuit, field programmable gate array (FPGA), or programmable logic array (PLA), can be customized by utilizing state information of computer-readable program instructions, which can Various aspects of the present disclosure are implemented by executing computer readable program instructions.
这里参照根据本公开实施例的方法、装置(系统)和计算机程序产品的流程图和/或框图描述了本公开的各个方面。应当理解,流程图和/或框图的每个方框以及流程图和/或框图中各方框的组合,都可以由计算机可读程序指令实现。Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It should be understood that each block of the flowcharts and/or block diagrams, and combinations of blocks in the flowcharts and/or block diagrams, can be implemented by computer-readable program instructions.
这些计算机可读程序指令可以提供给通用计算机、专用计算机或其它可编程数据处理装置的处理单元,从而生产出一种机器,使得这些指令在通过计算机或其它可编程数据处理装置的处理单元执行时,产生了实现流程图和/或框图中的一个或多个方框中规定的功能/动作的装置。也可以把这些计算机可读程序指令存储在计算机可读存储介质中,这些指令使得计算机、可编程数据处理装置和/或其他设备以特定方式工作,从而,存储有指令的计算机可读介质则包括一个制造品,其包括实现流程图和/或框图中的一个或多个方框中规定的功能/动作的各个方面的指令。These computer readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine such that when executed by the processing unit of the computer or other programmable data processing apparatus , producing an apparatus for realizing the functions/actions specified in one or more blocks in the flowchart and/or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium, and these instructions cause computers, programmable data processing devices and/or other devices to work in a specific way, so that the computer-readable medium storing instructions includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks in flowcharts and/or block diagrams.
也可以把计算机可读程序指令加载到计算机、其它可编程数据处理装置、或其它设备上,使得在计算机、其它可编程数据处理装置或其它设备上执行一系列操作步骤,以产生计算机实现的过程,从而使得在计算机、其它可编程数据处理装置、或其它设备上执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。It is also possible to load computer-readable program instructions into a computer, other programmable data processing device, or other equipment, so that a series of operational steps are performed on the computer, other programmable data processing device, or other equipment to produce a computer-implemented process , so that instructions executed on computers, other programmable data processing devices, or other devices implement the functions/actions specified in one or more blocks in the flowcharts and/or block diagrams.
附图中的流程图和框图显示了根据本公开的多个实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分,模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或动作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, a program segment, or a portion of an instruction that contains one or more executable instruction. In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs the specified function or action , or may be implemented by a combination of dedicated hardware and computer instructions.
以上已经描述了本公开的各实施方式,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施方式。在不偏离所说明的各实施方式的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施方式的原理、实际应用或对市场中的技术的改进,或者使本技术领域的其他普通技术人员能理解本文披露的各实施方式。While various embodiments of the present disclosure have been described above, the foregoing description is illustrative, not exhaustive, and is not limited to the disclosed embodiments. Many modifications and alterations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The choice of terminology used herein aims to best explain the principle of each embodiment, practical application or improvement of technology in the market, or to enable other ordinary skilled in the art to understand each embodiment disclosed herein.

Claims (33)

  1. 一种用于语音代听的方法,所述方法应用于用户终端,包括:A method for voice listening, the method is applied to a user terminal, comprising:
    获取目标关键词对应的目标语音识别模型,所述目标语音识别模型为根据所述目标关键词构建,所述目标关键词为根据用户的出行信息获取;Obtain the target speech recognition model corresponding to the target keyword, the target speech recognition model is constructed according to the target keyword, and the target keyword is obtained according to the travel information of the user;
    根据所述目标语音识别模型对本地语音识别模型进行更新,获得更新后的语音识别模型,所述本地语音识别模型为所述用户终端中存储的语音识别模型;updating the local speech recognition model according to the target speech recognition model to obtain an updated speech recognition model, the local speech recognition model being the speech recognition model stored in the user terminal;
    当满足目标条件时,根据所述更新后的语音识别模型对采集到的环境声音进行识别,获得识别结果,所述环境声音为在所述用户终端所处的环境中采集到的声音信息;以及When the target condition is met, recognize the collected environmental sound according to the updated speech recognition model, and obtain a recognition result, the environmental sound is sound information collected in the environment where the user terminal is located; and
    当所述识别结果指示所述环境声音中存在所述目标关键词时,对所述用户进行提示。When the recognition result indicates that the target keyword exists in the ambient sound, prompting the user.
  2. 根据权利要求1所述的方法,其中获取目标关键词对应的目标语音识别模型包括:The method according to claim 1, wherein obtaining the target speech recognition model corresponding to the target keyword comprises:
    获取所述用户的出行信息;Obtain travel information of the user;
    根据所述出行信息提取用户出行方式相关的目标关键词;Extracting target keywords related to the user's travel mode according to the travel information;
    向服务器发送所述目标关键词,以用于所述服务器根据所述目标关键词构建所述目标语音识别模型;以及sending the target keyword to a server, so that the server builds the target speech recognition model according to the target keyword; and
    从所述服务器获取所述目标语音识别模型。Obtain the target speech recognition model from the server.
  3. 根据权利要求1所述的方法,其中所述用户终端是第一用户终端并且连接到第二用户终端,所述方法包括:The method of claim 1, wherein the user terminal is a first user terminal and is connected to a second user terminal, the method comprising:
    向所述第二用户终端发送标识信息,所述标识信息用于标识所述第一用户终端;sending identification information to the second user terminal, where the identification information is used to identify the first user terminal;
    其中所述获取目标关键词对应的目标语音识别模型,具体为:Wherein the target speech recognition model corresponding to the acquisition target keyword is specifically:
    基于所述标识信息从所述第二用户终端接收所述目标语音识别模型,所述目标语音识别模型为所述第二用户终端根据所述目标关键词从所述服务器获取;receiving the target speech recognition model from the second user terminal based on the identification information, where the target speech recognition model is obtained by the second user terminal from the server according to the target keyword;
    其中所述第一用户终端是音频播放设备。Wherein the first user terminal is an audio playback device.
  4. 根据权利要求1所述的方法,其中所述目标语音识别模型是基于声学模型、目标发音字典模型和目标语言模型而生成的解码图,所述解码图是由所述目标关键词确定的语法约束规则的解码路径集合,所述目标发音字典模型是基于所述目标关键词的发音序列而获取的,并且所述目标语言模型是基于所述目标关键词的字之间的关系而获取的。The method according to claim 1, wherein the target speech recognition model is a decoding map generated based on an acoustic model, a target pronunciation dictionary model, and a target language model, and the decoding map is a grammatical constraint determined by the target keyword A set of regular decoding paths, the target pronunciation dictionary model is acquired based on the pronunciation sequence of the target keyword, and the target language model is acquired based on the relationship between words of the target keyword.
  5. 根据权利要求5所述的方法,其中所述声学模型通过融合特征和目标语音数据的文本信息进行训练而生成,所述融合特征基于目标语音数据和噪声数据而生成,所述目标语音数据为包括目标语音内容的音频数据,所述噪声数据为不包括所述目标语音内容的音频数据。The method according to claim 5, wherein the acoustic model is generated by training the text information of the fusion feature and target voice data, the fusion feature is generated based on the target voice data and noise data, and the target voice data includes audio data of the target voice content, the noise data is audio data not including the target voice content.
  6. 根据权利要求1所述的方法,还包括:The method according to claim 1, further comprising:
    根据所述出行信息获取与所述用户出行方式关联的位置信息;Obtaining location information associated with the travel mode of the user according to the travel information;
    其中当满足目标条件时,根据所述更新后的语音识别模型对采集到的环境声音进行识别包括:Wherein when the target condition is met, recognizing the collected ambient sound according to the updated speech recognition model includes:
    当所述用户的位置与所述位置信息匹配时,根据所述更新后的语音识别模型对采集到的环境声音进行识别。When the location of the user matches the location information, the collected ambient sound is recognized according to the updated speech recognition model.
  7. 根据权利要求1所述的方法,还包括:The method according to claim 1, further comprising:
    根据所述出行信息获取与所述用户出行方式关联的时间信息;Acquiring time information associated with the travel mode of the user according to the travel information;
    其中当满足目标条件时,根据所述更新后的语音识别模型对采集到的环境声音进行识别包括:Wherein when the target condition is met, recognizing the collected ambient sound according to the updated speech recognition model includes:
    当满足时间条件时,根据所述更新后的语音识别模型对采集到的环境声音进行识别,所述时间条件为根据所述时间信息确定。When the time condition is satisfied, the collected environmental sound is recognized according to the updated speech recognition model, and the time condition is determined according to the time information.
  8. 根据权利要求1所述的方法,还包括:其中对所述用户进行提示包括在所述用户终端上播放与所述目标关键词对应的语音。The method according to claim 1, further comprising: wherein prompting the user includes playing a voice corresponding to the target keyword on the user terminal.
  9. 根据权利要求1至8中任一项所述的方法,其中所述目标关键词是列车车次或航班号。The method according to any one of claims 1 to 8, wherein the target keyword is a train number or flight number.
  10. 根据权利要求1所述的方法,其中所述用户终端是智能手机、智能家电、可穿戴设备、音频播放设备、平板电脑和笔记本电脑之一。The method according to claim 1, wherein the user terminal is one of a smart phone, a smart home appliance, a wearable device, an audio playback device, a tablet computer, and a notebook computer.
  11. 一种用于语音代听的装置,用于用户终端,包括:A device for voice listening, used for a user terminal, comprising:
    模型获取单元,用于获取目标关键词对应的目标语音识别模型,所述目标语音识别模型为根据所述目标关键词构建,所述目标关键词为根据用户的出行信息获取;A model acquiring unit, configured to acquire a target speech recognition model corresponding to a target keyword, the target speech recognition model is constructed according to the target keyword, and the target keyword is acquired according to the travel information of the user;
    更新单元,用于根据所述目标语音识别模型对本地语音识别模型进行更新,获得更新后的语音识别模型,所述本地语音识别模型为所述用户终端中存储的语音识别模型;An updating unit, configured to update a local speech recognition model according to the target speech recognition model to obtain an updated speech recognition model, where the local speech recognition model is a speech recognition model stored in the user terminal;
    声音识别单元,用于当满足目标条件时,根据所述更新后的语音识别模型对采集到的环境声音进行识别,获得识别结果,所述环境声音为在所述用户终端所处的环境中采集到的声音信息;以及A sound recognition unit, configured to, when the target condition is met, recognize the collected environmental sound according to the updated speech recognition model, and obtain a recognition result, the environmental sound is collected in the environment where the user terminal is located audio messages received; and
    提示单元,用于当所述识别结果指示所述环境声音中存在所述目标关键词时,对所述用户进行提示。A prompting unit, configured to prompt the user when the recognition result indicates that the target keyword exists in the ambient sound.
  12. 根据权利要求11所述的装置,其中所述装置还包括:The apparatus of claim 11, wherein said apparatus further comprises:
    出行信息获取单元,用于获取所述用户的出行信息;a travel information acquisition unit, configured to obtain the travel information of the user;
    目标关键词获取单元,用于根据所述出行信息提取用户出行方式相关的目标关键词;和A target keyword acquisition unit, configured to extract target keywords related to the travel mode of the user according to the travel information; and
    发送单元,用于向服务器发送所述目标关键词,以用于所述服务器根据所述目标关键词构建所述目标语音识别模型;a sending unit, configured to send the target keyword to a server, so that the server can construct the target speech recognition model according to the target keyword;
    其中所述模型获取单元还用于从所述服务器获取所述目标语音识别模型。Wherein the model acquiring unit is further configured to acquire the target speech recognition model from the server.
  13. 根据权利要求11所述的装置,其中所述用户终端是第一用户终端并且连接到第二用户终端,所述装置还包括:The apparatus of claim 11, wherein the user terminal is a first user terminal and is connected to a second user terminal, the apparatus further comprising:
    发送单元,用于向所述第二用户终端发送标识信息,所述标识信息用于标识所述第一用户终端;a sending unit, configured to send identification information to the second user terminal, where the identification information is used to identify the first user terminal;
    所述模型获取单元还用于基于所述标识信息从所述第二用户终端接收所述目标语音识别模型,所述目标语音识别模型为所述第二用户终端根据所述目标关键词从所述服务器获取,The model obtaining unit is further configured to receive the target speech recognition model from the second user terminal based on the identification information, and the target speech recognition model is obtained from the second user terminal according to the target keyword. server fetches,
    其中所述第一用户终端是音频播放设备。Wherein the first user terminal is an audio playback device.
  14. 根据权利要求11所述的装置,其中所述目标语音识别模型是基于声学模型、目标发音字典和目标语言模型而生成的解码图,所述解码图是由所述目标关键词确定的语法约束规则的解码路径集合,所述目标发音字典模型是基于所述目标关键词的发音序列而获取的,并且所述目标语言模型是基于所述目标关键词的字之间的关系而获取的。The device according to claim 11, wherein the target speech recognition model is a decoding map generated based on an acoustic model, a target pronunciation dictionary, and a target language model, and the decoding map is a grammatical constraint rule determined by the target keyword A set of decoding paths, the target pronunciation dictionary model is acquired based on the pronunciation sequence of the target keyword, and the target language model is acquired based on the relationship between words of the target keyword.
  15. 根据权利要求14所述的装置,其中所述声学模型通过融合特征和目标语音数据的文本信息进行训练而生成,所述融合特征基于目标语音数据和噪声数据而生成,所述目标语音数据为包括目标语音内容的音频数据,所述噪声数据为不包括所述目标语音内容的音频数据。The device according to claim 14, wherein the acoustic model is generated by training fusion features and text information of target speech data, the fusion features are generated based on target speech data and noise data, and the target speech data is composed of audio data of the target voice content, the noise data is audio data not including the target voice content.
  16. 根据权利要求11所述的装置,还包括:The apparatus of claim 11, further comprising:
    出行位置信息获取单元,用于根据出行信息获取与所述用户出行方式关联的位置信息,a travel location information acquisition unit, configured to obtain location information associated with the travel mode of the user according to the travel information,
    其中所述声音识别单元还用于:当所述用户的位置与所述位置信息匹配时,根据所述更新后的语音识别模型对采集到的环境声音进行识别。Wherein the voice recognition unit is further configured to: when the location of the user matches the location information, recognize the collected environmental sound according to the updated voice recognition model.
  17. 根据权利要求11所述的装置,还包括:The apparatus of claim 11, further comprising:
    出行时间信息获取单元,用于根据所述出行信息获取与所述用户出行方式关联的时间信息,a travel time information acquisition unit, configured to obtain time information associated with the travel mode of the user according to the travel information,
    其中所述声音识别单元还用于:当满足时间条件时,根据所述更新后的语音识别模型对采集到的环境声音进行识别,所述时间条件为根据所述时间信息确定。Wherein the voice recognition unit is further configured to: when a time condition is satisfied, to recognize the collected environmental sound according to the updated speech recognition model, the time condition is determined according to the time information.
  18. 根据权利要求11所述的装置,其中所述提示单元还用于:在所述用户终端上播放与所述目标关键词对应的语音。The device according to claim 11, wherein the prompting unit is further configured to: play a voice corresponding to the target keyword on the user terminal.
  19. 根据权利要求11至18中任一项所述的装置,其中所述目标关键词是列车车次或航班号。The device according to any one of claims 11 to 18, wherein the target keyword is a train number or a flight number.
  20. 根据权利要求11所述的装置,其中所述用户终端是智能手机、智能家电、可穿戴设备、音频播放设备、平板电脑和笔记本电脑之一。The apparatus according to claim 11, wherein the user terminal is one of a smart phone, a smart home appliance, a wearable device, an audio playback device, a tablet computer, and a notebook computer.
  21. 一种生成语音识别模型的方法,包括:A method of generating a speech recognition model comprising:
    基于目标语音数据和噪声数据生成融合声学特征,所述目标语音数据为包括目标语音内容的音频数据,所述噪声数据为不包括所述目标语音内容的音频数据;Generate fusion acoustic features based on target voice data and noise data, where the target voice data is audio data that includes target voice content, and the noise data is audio data that does not include the target voice content;
    通过所述融合特征和所述语音数据的文本信息进行训练来生成所述声学模型;以及generating the acoustic model by training the fused features and textual information of the speech data; and
    根据所述声学模型、发音字典和语言模型构建所述语音识别模型。Constructing the speech recognition model according to the acoustic model, pronunciation dictionary and language model.
  22. 根据权利要求21所述的方法,其中生成所述融合声学特征包括:The method of claim 21, wherein generating the fused acoustic features comprises:
    对所述目标语音数据和所述噪声数据进行叠加来获取叠加后的音频数据;以及superimposing the target voice data and the noise data to obtain superimposed audio data; and
    基于所述叠加后的音频数据获取所述融合声学特征。The fused acoustic feature is acquired based on the superimposed audio data.
  23. 根据权利要求21所述的方法,其中生成所述融合声学特征包括:The method of claim 21, wherein generating the fused acoustic features comprises:
    基于所述目标语音数据获取第一声学特征;acquiring a first acoustic feature based on the target speech data;
    基于所述噪声数据获取第二声学特征;obtaining a second acoustic feature based on the noise data;
    基于所述第一声学特征和所述第二声学特征获取所述融合声学特征。The fused acoustic feature is obtained based on the first acoustic feature and the second acoustic feature.
  24. 根据权利要求23所述的方法,其中所述基于所述目标语音数据获取第一声学特征,包括:The method according to claim 23, wherein said acquiring a first acoustic feature based on said target speech data comprises:
    从所述目标语音数据生成带噪声学特征;generating noisy features from the target speech data;
    通过增强所述带噪声学数据来生成所述第一声学特征。The first acoustic feature is generated by enhancing the noisy acoustic data.
  25. 根据权利要求24所述的方法,其中增强所述带噪声学特征包括:The method of claim 24, wherein enhancing the noise characteristics comprises:
    对所述带噪声学特征进行LASSO变换;以及performing a LASSO transform on the noisy feature; and
    对经LASSO变换的声学特征进行bottleneck网络处理,以获取所述第一声学特征。performing bottleneck network processing on the LASSO-transformed acoustic features to obtain the first acoustic features.
  26. 根据权利要求23所述的方法,其中所述基于所述第一声学特征和所述第二声学特征获取所述融合声学特征包括:The method according to claim 23, wherein said obtaining said fused acoustic feature based on said first acoustic feature and said second acoustic feature comprises:
    叠加所述第一声学特征和所述第二声学特征,以得到叠加的声学特征;以及superimposing the first acoustic feature and the second acoustic feature to obtain a superimposed acoustic feature; and
    通过对所述叠加的声学特征进行归一化处理,生成所述融合声学特征。The fused acoustic features are generated by performing normalization processing on the superimposed acoustic features.
  27. 根据权利要求23所述的方法,其中基于所述第一声学特征和所述第二声学特征获取所述融合声学特征包括:The method of claim 23, wherein obtaining the fused acoustic feature based on the first acoustic feature and the second acoustic feature comprises:
    获取所述第一声学特征的帧数,所述第一声学特征的帧数根据所述目标语音数据的持续 时间确定;Obtain the number of frames of the first acoustic feature, the number of frames of the first acoustic feature is determined according to the duration of the target voice data;
    根据所述第一声学特征的帧数基于所述第二声学特征构建第三声学特征;constructing a third acoustic feature based on the second acoustic feature according to the frame number of the first acoustic feature;
    叠加所述第一声学特征和所述第三声学特征获取所述融合声学特征。and superimposing the first acoustic feature and the third acoustic feature to obtain the fused acoustic feature.
  28. 根据权利要求21所述的方法,其中所述声学模型是神经网络模型,并且所述训练包括:The method of claim 21, wherein the acoustic model is a neural network model, and the training comprises:
    从所述声学模型的隐藏层提取声源特征;以及extracting sound source features from hidden layers of the acoustic model; and
    将所述声源特征和所述融合声学特征作为所述声学模型的输入特征来训练所述声学模型。The acoustic model is trained by using the sound source feature and the fusion acoustic feature as input features of the acoustic model.
  29. 根据权利要求21所述的方法,其中所述根据所述声学模型,发音字典和语言模型构建所述语音识别模型包括:The method according to claim 21, wherein said constructing said speech recognition model according to said acoustic model, pronunciation dictionary and language model comprises:
    接收来自用户终端的目标关键词;receiving target keywords from the user terminal;
    根据所述目标关键词的发音序列从所述发音字典获取目标发音字典模型;Obtaining a target pronunciation dictionary model from the pronunciation dictionary according to the pronunciation sequence of the target keyword;
    根据所述目标关键词的字之间的关系从所述语音模型获取目标语言模型;以及Acquiring a target language model from the speech model according to the relationship between words of the target keyword; and
    通过合并所述声学模型、所述目标发音字典模型和所述目标语言模型来构建所述语音识别模型。The speech recognition model is constructed by combining the acoustic model, the target pronunciation dictionary model and the target language model.
  30. 一种生成语音识别模型的装置,包括:A device for generating a speech recognition model, comprising:
    融合单元,用于基于目标语音数据和噪声数据生成融合声学特征,所述目标语音数据为包括目标语音内容的音频数据,所述噪声数据为不包括所述目标语音内容的音频数据;A fusion unit, configured to generate fusion acoustic features based on target speech data and noise data, where the target speech data is audio data including target speech content, and the noise data is audio data not including the target speech content;
    训练单元,用于通过所述融合特征和所述语音数据的文本信息进行训练来生成所述声学模型;以及a training unit, configured to generate the acoustic model by performing training through the fusion feature and the text information of the speech data; and
    语音识别模型构建单元,用于根据所述声学模型、发音字典和语言模型构建所述语音识别模型。A speech recognition model construction unit, configured to construct the speech recognition model according to the acoustic model, pronunciation dictionary and language model.
  31. 一种电子设备,包括:An electronic device comprising:
    至少一个计算单元;at least one computing unit;
    至少一个存储器,所述至少一个存储器被耦合到所述至少一个计算单元并且存储用于由所述至少一个计算单元执行的指令,所述指令当由所述至少一个计算单元执行时,使得所述设备执行根据权利要求1至10中任一项所述的方法、或者根据权利要求21至29中任一项所述的方法。at least one memory coupled to the at least one computing unit and storing instructions for execution by the at least one computing unit that, when executed by the at least one computing unit, cause the The device performs the method according to any one of claims 1 to 10, or the method according to any one of claims 21 to 29.
  32. 一种计算机可读存储介质,其上存储有计算机程序,所述程序被处理器执行时实现根据权利要求1至10中任一项所述的方法、或者根据权利要求21至29中任一项所述的方法。A computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the method according to any one of claims 1 to 10 is implemented, or according to any one of claims 21 to 29 the method described.
  33. 一种计算机程序产品,包括计算机可执行指令,其中所述计算机可执行指令在被处理器执行时实现执行根据权利要求1至10中任一项所述的方法、或者根据权利要求21至29中任一项所述的方法。A computer program product comprising computer-executable instructions, wherein said computer-executable instructions, when executed by a processor, effectuate a method according to any one of claims 1 to 10, or a method according to any one of claims 21 to 29 any one of the methods described.
PCT/CN2021/106942 2021-07-16 2021-07-16 Method and apparatus for listening to speech by using device instead of ears, method and apparatus for generating speech recognition model, electronic device, and medium WO2023283965A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202180093163.7A CN117178320A (en) 2021-07-16 2021-07-16 Method, apparatus, electronic device and medium for voice hearing and generating voice recognition model
PCT/CN2021/106942 WO2023283965A1 (en) 2021-07-16 2021-07-16 Method and apparatus for listening to speech by using device instead of ears, method and apparatus for generating speech recognition model, electronic device, and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/106942 WO2023283965A1 (en) 2021-07-16 2021-07-16 Method and apparatus for listening to speech by using device instead of ears, method and apparatus for generating speech recognition model, electronic device, and medium

Publications (1)

Publication Number Publication Date
WO2023283965A1 true WO2023283965A1 (en) 2023-01-19

Family

ID=84918923

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/106942 WO2023283965A1 (en) 2021-07-16 2021-07-16 Method and apparatus for listening to speech by using device instead of ears, method and apparatus for generating speech recognition model, electronic device, and medium

Country Status (2)

Country Link
CN (1) CN117178320A (en)
WO (1) WO2023283965A1 (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106531155A (en) * 2015-09-10 2017-03-22 三星电子株式会社 Apparatus and method for generating acoustic model, and apparatus and method for speech recognition
CN109087631A (en) * 2018-08-08 2018-12-25 北京航空航天大学 A kind of Vehicular intelligent speech control system and its construction method suitable for complex environment
CN109599093A (en) * 2018-10-26 2019-04-09 北京中关村科金技术有限公司 Keyword detection method, apparatus, equipment and the readable storage medium storing program for executing of intelligent quality inspection
US10365887B1 (en) * 2016-03-25 2019-07-30 Amazon Technologies, Inc. Generating commands based on location and wakeword
CN110232916A (en) * 2019-05-10 2019-09-13 平安科技(深圳)有限公司 Method of speech processing, device, computer equipment and storage medium
CN110600014A (en) * 2019-09-19 2019-12-20 深圳酷派技术有限公司 Model training method and device, storage medium and electronic equipment
CN110708630A (en) * 2019-11-12 2020-01-17 广州酷狗计算机科技有限公司 Method, device and equipment for controlling earphone and storage medium
CN111601215A (en) * 2020-04-20 2020-08-28 南京西觉硕信息科技有限公司 Scene-based key information reminding method, system and device

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106531155A (en) * 2015-09-10 2017-03-22 三星电子株式会社 Apparatus and method for generating acoustic model, and apparatus and method for speech recognition
US10365887B1 (en) * 2016-03-25 2019-07-30 Amazon Technologies, Inc. Generating commands based on location and wakeword
CN109087631A (en) * 2018-08-08 2018-12-25 北京航空航天大学 A kind of Vehicular intelligent speech control system and its construction method suitable for complex environment
CN109599093A (en) * 2018-10-26 2019-04-09 北京中关村科金技术有限公司 Keyword detection method, apparatus, equipment and the readable storage medium storing program for executing of intelligent quality inspection
CN110232916A (en) * 2019-05-10 2019-09-13 平安科技(深圳)有限公司 Method of speech processing, device, computer equipment and storage medium
CN110600014A (en) * 2019-09-19 2019-12-20 深圳酷派技术有限公司 Model training method and device, storage medium and electronic equipment
CN110708630A (en) * 2019-11-12 2020-01-17 广州酷狗计算机科技有限公司 Method, device and equipment for controlling earphone and storage medium
CN111601215A (en) * 2020-04-20 2020-08-28 南京西觉硕信息科技有限公司 Scene-based key information reminding method, system and device

Also Published As

Publication number Publication date
CN117178320A (en) 2023-12-05

Similar Documents

Publication Publication Date Title
JP7114660B2 (en) Hot word trigger suppression for recording media
US11195531B1 (en) Accessory for a voice-controlled device
US9602938B2 (en) Sound library and method
US10311872B2 (en) Utterance classifier
US10869154B2 (en) Location-based personal audio
US11039240B2 (en) Adaptive headphone system
US20170213552A1 (en) Detection of audio public announcements by a mobile device
CN108351872A (en) Equipment selection for providing response
JP6987124B2 (en) Interpreters and methods (DEVICE AND METHOD OF TRANSLATING A LANGUAGE)
CN109643548A (en) System and method for content to be routed to associated output equipment
WO2022126040A1 (en) User speech profile management
US10002611B1 (en) Asynchronous audio messaging
WO2023283965A1 (en) Method and apparatus for listening to speech by using device instead of ears, method and apparatus for generating speech recognition model, electronic device, and medium
WO2023040658A1 (en) Speech interaction method and electronic device
WO2019150708A1 (en) Information processing device, information processing system, information processing method, and program
US20220157305A1 (en) Information processing apparatus, information processing method, and program
WO2020208972A1 (en) Response generation device and response generation method
JP6772468B2 (en) Management device, information processing device, information provision system, language information management method, information provision method, and operation method of information processing device
US20240087597A1 (en) Source speech modification based on an input speech characteristic
CN118095301A (en) Synchronous translation method, electronic device and computer readable storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21949746

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21949746

Country of ref document: EP

Kind code of ref document: A1