WO2023283965A1

WO2023283965A1 - Method and apparatus for listening to speech by using device instead of ears, method and apparatus for generating speech recognition model, electronic device, and medium

Info

Publication number: WO2023283965A1
Application number: PCT/CN2021/106942
Authority: WO
Inventors: 殷实; 黄韬; 翟毅斌; 伍朝晖
Original assignee: 华为技术有限公司
Priority date: 2021-07-16
Filing date: 2021-07-16
Publication date: 2023-01-19
Also published as: CN117178320A

Abstract

A method for listening to a speech by using a device instead of ears. The method is applied to a user terminal (110, 120) and comprises: obtaining a speech recognition model corresponding to a target keyword, the speech recognition model being constructed according to the target keyword, and the target keyword being obtained according to travel information of a user (310); updating a local speech recognition model according to a target speech recognition model to obtain an updated speech recognition model, the local speech recognition model being a speech recognition model stored in a user terminal (320); when a target condition is satisfied (330), recognizing collected ambient sound according to the updated speech recognition model to obtain a recognition result (340), the ambient sound being sound information collected in an environment where the user terminal is located; and when the recognition result indicates that the target keyword exists in the ambient sound (350), prompting the user (360). According to the method for listening to a speech by using a device instead of ears, when the user cannot clearly hear the ambient sound, the user is helped to recognize the target keyword in the ambient sound, thereby implementing personal and intelligent listening by using a device instead of ears.

Description

Method, device, electronic device and medium for voice listening and generating voice recognition model

technical field

The present invention relates to the technical field of artificial intelligence, in particular to methods, devices, electronic equipment and media for voice listening and generating voice recognition models.

Background technique

With the rapid development of deep learning technology and large-scale integrated circuits, digital circuits, signal processing, and microelectronics technology in recent years, various consumer electronics products equipped with speech recognition technology are becoming more and more popular. Through language recognition technology, electronic products can receive voice commands, and perform operations desired by users by recognizing voice commands.

Unfortunately, most of the existing electronic products recognize the voice commands provided by the manufacturer. It is difficult to recognize the personalized keywords for users, and usually requires manual input of the personalized keywords to be recognized. The ability to identify the keyword. This solution relies on manual active input, which is not convenient for use, and requires more computing resources. In addition, the performance of existing speech recognition technology is poor in noisy environments. For example, when it is desired to recognize train numbers or flight numbers in broadcasts in airports, railway stations and other high noise and strong reverberation environments, it is difficult to obtain satisfactory results. Effect.

Contents of the invention

Embodiments of the present disclosure provide a solution for voice recognition, which realizes voice listening for personalized keywords.

According to the first aspect of the present disclosure, a method for voice listening substitution is provided, the method is applied to a user terminal, and includes: acquiring a target voice recognition model corresponding to a target keyword, and the target voice recognition model is based on the The target keyword is constructed, the target keyword is obtained according to the travel information of the user; the local speech recognition model is updated according to the target speech recognition model, and the updated speech recognition model is obtained, and the local speech recognition model is the The speech recognition model stored in the user terminal; when the target condition is satisfied, the collected environmental sound is recognized according to the updated speech recognition model, and the recognition result is obtained. sound information collected in the surrounding environment; and when the recognition result indicates that the target keyword exists in the environmental sound, prompting the user. In this way, it is possible to detect the target keywords of the travel information in the ambient sound, and remind the user when the ambient sound includes the voice of the target keyword, so as to realize the intelligent listening function of the device instead of the human ear.

In some embodiments, obtaining the target speech recognition model corresponding to the target keyword includes: obtaining the travel information of the user; extracting the target keyword related to the travel mode of the user according to the travel information; sending the target keyword to the server, for the server to construct the target speech recognition model according to the target keyword; and obtain the target speech recognition model from the server. In this way, targeted speech recognition models for personalized keywords can be generated and deployed without user interaction.

In some embodiments, the user terminal is a first user terminal and is connected to a second user terminal, and the method further includes sending identification information to the second user terminal, the identification information being used to identify the first user terminal terminal. The acquiring the target speech recognition model corresponding to the target keyword specifically includes: receiving the target speech recognition model from the second user terminal based on the identification information, the target speech recognition model being the second user terminal according to The target keyword is obtained from the server; wherein the first user terminal is an audio playback device. In this way, intelligent proxy listening can be realized when the user uses an audio playback device (such as a headset).

In some embodiments, the target speech recognition model is a decoding map generated based on an acoustic model, a target pronunciation dictionary, and a target language model, and the decoding map is a set of decoding paths of grammatically constrained rules determined by the target keyword , the target pronunciation dictionary model is acquired based on the pronunciation sequence of the target keyword, and the target language model is acquired based on the relationship between words of the target keyword. In this way, a lightweight target speech recognition model can be produced for easy deployment on user terminals with less computing resources.

In some embodiments, the acoustic model is generated by training fusion features and text information of target speech data, the fusion features are generated based on target speech data and noise data, and the target speech data is text information including target speech content audio data, the noise data is audio data that does not include the target speech content. In this way, the generated target speech recognition model can more accurately recognize speech in a high-noise and strong-reverberation environment, thereby realizing personalized intelligent listening substitution.

In some embodiments, training is performed by fusing features and text information of the target speech data. The text information of the target speech data can be direct text content or other labeled data corresponding to the text content, such as phoneme sequences.

In some embodiments, the method further includes: acquiring location information associated with the user's travel mode according to the travel information; wherein when the target condition is met, the collected environmental sound is recognized according to the updated speech recognition model The method includes: when the location of the user matches the location information, recognizing the collected environmental sound according to the updated speech recognition model. In this way, when the target condition of the geographic location is met, the updated speech recognition model is automatically used to determine whether the ambient sound contains keywords without user interaction, bringing a better user experience.

In some embodiments, the method further includes: acquiring time information associated with the user's travel mode according to the travel information, wherein when the target condition is met, the collected environmental sound is recognized according to the updated speech recognition model The method includes: when a time condition is satisfied, the collected environmental sound is recognized according to the updated speech recognition model, and the time condition is determined according to the time information. In some embodiments, the time condition may be that the current time is within a predetermined time period before the time information. In this way, when the target condition of the time is met, the updated speech recognition model is automatically used to determine whether the ambient sound contains keywords without user interaction, bringing a better user experience.

In some embodiments, prompting the user includes playing a voice corresponding to the target keyword on the user terminal. In this way, the user can hear corresponding prompts for personalized keywords of interest.

In some embodiments, the target keyword is train number or flight number.

In some embodiments, the user terminal is one of a smart phone, a smart home appliance, a wearable device, an audio playback device, a tablet computer, and a notebook computer.

According to a second aspect of the present disclosure, there is provided a device for listening on behalf of others, including: a model acquisition unit, configured to acquire a target speech recognition model corresponding to a target keyword, and the target speech recognition model is based on the target key Word construction, the target keyword is obtained according to the travel information of the user; the update unit is used to update the local speech recognition model according to the target speech recognition model, and obtain the updated speech recognition model, and the local speech recognition model It is the speech recognition model stored in the user terminal; the sound recognition unit is used to recognize the collected environmental sound according to the updated speech recognition model when the target condition is met, and obtain the recognition result, and the environmental sound is the sound information collected in the environment where the user terminal is located; and a prompting unit, configured to prompt the user when the recognition result indicates that the target keyword exists in the environmental sound. In this way, it is possible to detect the target keywords of the travel information in the ambient sound, and remind the user when the ambient sound includes the voice of the target keyword, so as to realize the intelligent listening function of the device instead of the human ear.

In some embodiments, the device further includes: a target keyword acquisition unit, configured to acquire the travel information of the user; a target keyword acquisition unit, configured to extract a target related to the travel mode of the user according to the travel information keyword; and a sending unit, configured to send the target keyword in the travel information to a server, so that the server can construct it according to the target keyword. The model acquiring unit is further configured to acquire the target speech recognition model from the server. In this way, targeted speech recognition models for personalized keywords can be generated and deployed without user interaction.

In some embodiments, the user terminal is a first user terminal and is connected to a second user terminal. The apparatus further includes a sending unit, configured to send identification information to the second user terminal, where the identification information is used to identify the first user terminal. The model obtaining unit is further configured to receive the target speech recognition model from the second user terminal based on the identification information, and the target speech recognition model is obtained by the second user terminal from the server according to the target keyword . The first user terminal is an audio playback device. In this way, it is possible to realize intelligent proxy listening when the user uses an audio playback device (such as earphones).

In some embodiments, the target speech recognition model is a decoding map generated based on an acoustic model, a target pronunciation dictionary, and a target language model, and the decoding map is a set of decoding paths of grammatically constrained rules determined by the target keyword , the target pronunciation dictionary model is acquired based on the pronunciation sequence of the target keyword, and the target language model is acquired based on the relationship between words of the target keyword. In this way, lightweight speech recognition models can be produced for easy deployment on user terminals with less computing resources.

In some embodiments, the device further includes a travel location information acquiring unit, configured to acquire location information associated with the travel mode of the user according to the travel information. The sound recognition unit is further configured to: when the location of the user matches the location information, recognize the collected environmental sound according to the updated speech recognition model. In this way, when the target condition of the geographic location is met, the updated speech recognition model is automatically used to determine whether the ambient sound contains keywords without user interaction, bringing a better user experience.

In some embodiments, the device further includes a travel time information obtaining unit, configured to obtain time information associated with the travel mode of the user according to the travel information. The sound recognition unit is further configured to: when a time condition is satisfied, recognize the collected environmental sound according to the updated speech recognition model, the time condition being determined according to the time information. In some embodiments, the time condition may be that the current time is within a predetermined time period before the time information. In this way, when the target condition of the time is met, the updated speech recognition model is automatically used to determine whether the ambient sound contains keywords without user interaction, bringing a better user experience.

In some embodiments, the prompting unit is further configured to: play a voice corresponding to the target keyword on the user terminal. In this way, the user can hear corresponding prompts for personalized keywords of interest.

In some embodiments, the target keyword is train number or flight number.

According to a third aspect of the present disclosure, there is provided a method for generating a speech recognition model, including: generating fusion acoustic features based on target speech data and noise data, the target speech data being audio data including target speech content, the noise The data is audio data that does not include the target speech content; the acoustic model is generated by training through the fusion feature and the text information of the speech data; and constructing the acoustic model according to the acoustic model, pronunciation dictionary and language model. Speech recognition model. In this way, the acoustic model trained using fused features can be used to accurately recognize speech in high-noise, strong-reverberation environments, thereby realizing personalized intelligent listening substitutes.

In some embodiments, generating the fused acoustic feature includes: superimposing the target voice data and the noise data to obtain superimposed audio data; and obtaining the fused acoustic feature based on the superimposed audio data .

In some embodiments, generating the fused acoustic feature includes: acquiring a first acoustic feature based on the target speech data; acquiring a second acoustic feature based on the noise data; Two acoustic features to obtain the fused acoustic features.

In some embodiments, the acquiring the first acoustic feature based on the target speech data includes: generating a noisy acoustic feature from the target speech data; generating the first acoustic feature by enhancing the noisy acoustic data academic features.

In some embodiments, enhancing the noisy acoustic feature includes: performing LASSO transformation on the noisy acoustic feature; and performing bottleneck network processing on the LASSO transformed acoustic feature to obtain the first acoustic feature.

In some embodiments, the obtaining the fused acoustic feature based on the first acoustic feature and the second acoustic feature includes: superimposing the first acoustic feature and the second acoustic feature to obtain a superposition acoustic features; and generating the fused acoustic features by performing normalization processing on the superimposed acoustic features.

In some embodiments, acquiring the fusion acoustic feature based on the first acoustic feature and the second acoustic feature includes: acquiring the number of frames of the first acoustic feature, and the number of frames of the first acoustic feature The number is determined according to the duration of the target speech data; according to the frame number of the first acoustic feature, a third acoustic feature is constructed based on the second acoustic feature; the first acoustic feature and the third acoustic feature are superimposed Features Get the fused acoustic features.

In some embodiments, the acoustic model is a neural network model, and the training includes: extracting sound source features from a hidden layer of the acoustic model; and using the sound source features and the fused acoustic features as the The input features of the acoustic model are used to train the acoustic model.

In some embodiments, the constructing the speech recognition model according to the acoustic model, the pronunciation dictionary and the language model is specifically: receiving a target keyword from a user terminal; The pronunciation dictionary obtains the target pronunciation dictionary model; According to the relationship between the words of the target keyword, the target language model is obtained from the speech model; and by merging the acoustic model, the target pronunciation dictionary model and the target language model To build the speech recognition model. In this way, a lightweight speech recognition model for specific keywords can be generated, which is suitable for user terminals with limited computing resources.

According to a fourth aspect of the present disclosure, there is provided a device for generating a speech recognition model, including: a fusion unit for generating fusion acoustic features based on target speech data and noise data, the target speech data being audio including target speech content data, the noise data is audio data that does not include the target speech content; a training unit is used to generate the acoustic model through training through the fusion feature and the text information of the speech data; and speech recognition model construction A unit, configured to construct the speech recognition model according to the acoustic model, pronunciation dictionary and language model.

According to a fifth aspect of the present disclosure, there is provided a method for voice listening, including: obtaining target keywords related to the user's travel mode in the user's travel information; constructing a target voice corresponding to the target keyword A recognition model; and sending the target speech recognition model to the user terminal, the target speech recognition model is used to identify the ambient sound at the user terminal when the target condition is met, so as to determine whether the ambient sound exists in the Describe the target keywords. In this way, target speech recognition models for specific keywords can be generated and deployed to achieve intelligent voice listening for specific keywords.

According to the sixth aspect of the present disclosure, there is provided a device for voice listening, including: a target keyword acquisition unit, used to acquire target keywords related to the user's travel mode in the user's travel information; speech recognition model construction A unit for constructing a target speech recognition model corresponding to the target keyword; and a sending unit for sending the target speech recognition model to the user terminal, and the target speech recognition model is used to send the target speech recognition model to the user terminal when the target condition is met The environmental sound is recognized, so as to determine whether the target keyword exists in the environmental sound. In this way, a target speech recognition model for specific keywords can be generated and deployed, so as to realize intelligent voice listening for specific keywords.

According to a seventh aspect of the present disclosure, there is provided an electronic device, comprising: at least one computing unit; at least one memory, the at least one memory being coupled to the at least one computing unit and storing Unit-executable instructions that, when executed by the at least one computing unit, cause the electronic device to perform the method according to the first aspect, the third aspect or the fifth aspect of the present disclosure.

According to an eighth aspect of the present disclosure, there is provided a computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the program according to the first, third, or fifth aspect of the present disclosure can be implemented. described method.

According to a ninth aspect of the present disclosure, there is provided a computer program product comprising computer-executable instructions, wherein said computer-executable instructions, when executed by a processor, implement the first, third or fifth aspect of the present disclosure. method described in the aspect.

Description of drawings

The above and other features, advantages and aspects of the various embodiments of the present disclosure will become more apparent with reference to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, the same or similar reference numerals indicate the same or similar elements, wherein:

FIG. 1 shows a schematic diagram of an example environment in which various embodiments according to the present disclosure can be implemented.

Fig. 2 shows a schematic block diagram of a speech recognition system according to an embodiment of the present disclosure.

Fig. 3 shows a schematic flowchart of a method for voice listening substitution according to an embodiment of the present disclosure.

FIG. 4 shows a schematic diagram of an example process of building and deploying a speech recognition model according to an embodiment of the present disclosure.

Fig. 5 shows a schematic flowchart of a method for generating an acoustic model according to an embodiment of the present disclosure.

Fig. 6 shows a schematic flow chart of a method for enhancing speech acoustic features according to an embodiment of the present disclosure.

Fig. 7 shows a schematic conceptual diagram of a method for generating fusion features according to an embodiment of the present disclosure.

Fig. 8 shows a schematic diagram of a feature fusion process according to an embodiment of the present disclosure.

FIG. 9 shows an architecture diagram for training an acoustic model according to an embodiment of the present disclosure.

Fig. 10 shows a schematic block diagram of an apparatus for voice listening substitution according to an embodiment of the present disclosure.

Fig. 11 shows a schematic block diagram of an apparatus for generating a speech recognition model according to an embodiment of the present disclosure.

Fig. 12 shows a schematic block diagram of an apparatus for voice listening substitution according to an embodiment of the present disclosure.

Figure 13 shows a schematic block diagram of an example device that may be used to implement embodiments of the present disclosure.

detailed description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although certain embodiments of the present disclosure are shown in the drawings, it should be understood that the disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein; A more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for exemplary purposes only, and are not intended to limit the protection scope of the present disclosure.

In the description of the embodiments of the present disclosure, the term "comprising" and its similar expressions should be interpreted as an open inclusion, that is, "including but not limited to". The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be read as "at least one embodiment". The terms "first", "second", etc. may refer to different or the same object. Other definitions, both express and implied, may also be included below.

With the popularity of various user terminals such as smart phones, earphones, smart watches or wristbands, it is often difficult for users to hear sounds in the external environment clearly when wearing earphones or other user terminals, for example. In some scenarios, various inconveniences may be caused to users. For example, when a user is waiting for a flight or a train at an airport or a train station while listening to music or watching a video while wearing earphones, the user may not be able to clearly hear the broadcast information played in these places, resulting in missing the flight or train.

As mentioned above, although some electronic products have been able to recognize speech, most of them are used to recognize speech commands provided by manufacturers, and it is relatively difficult to recognize personalized keywords for users. Therefore, it is impossible to monitor the flight number and train number in the broadcast. In addition, some personalized speech recognition technologies require manual input of keywords to be recognized, and the ability to recognize the keywords can only be recognized after the machine has mastered them, which is not convenient to use and requires more computing resources. In view of this, the present disclosure provides voice listening technology. The user terminal obtains a voice recognition model for identifying personalized keywords, and uses the voice recognition model to monitor keywords of travel information in environmental sounds. The voice recognition model recognizes Prompt the user when the target keyword is reached. That is to say, this speech recognition model replaces the user's monitoring of environmental sounds, provides users with prompts about travel information, and achieves a better intelligent experience.

Example environments and systems

FIG. 1 shows a schematic diagram of an example environment 100 in which various embodiments according to the present disclosure can be implemented. The application scenario according to the embodiments of the present disclosure is that a user terminal (for example, a smart phone, a smart home appliance, a wearable device, an audio playback device, a tablet computer, a notebook computer, etc.) in an environment with high noise and strong reverberation can recognize Personalized content in non-human voice broadcasts, such as flight numbers, train numbers, etc., helps users monitor voice content in the environment. For example, when a user wears noise-canceling headphones and listens to music, it can identify keywords that the user is interested in in the external broadcast, and send a reminder to the user, thereby realizing intelligent listening.

As shown in FIG. 1 , the example environment 100 includes a first user terminal 110 and a second user terminal 120 located on the user side and a server 130 located on the cloud side. The first user terminal 110 and the second user terminal 120 as a whole can connect and communicate with the server 130 via various wired or wireless communication technologies, including but not limited to Ethernet, cellular network (4G, 5G, etc.), wireless local area network (eg, WiFi), Internet, Bluetooth, Near Field Communication (NFC), Infrared (IR), etc.

According to an embodiment of the present disclosure, the server 130 may be a distributed or centralized computing device or a computing device cluster implemented in a cloud computing environment. According to an embodiment of the present disclosure, the first user terminal 110 and the second user terminal 120 may include any one or more of smart phones, smart home appliances, wearable devices, audio playback devices, tablet computers, notebook computers, etc., both The types can be the same or different.

In some embodiments, the first user terminal 110 may not be directly connected to the server 130 and the second user terminal 120 may be connected to the server 130 . In this case, the first user terminal 110 may connect to the server 130 via the second user terminal 120 and communicate with the server 130 . For example, the first user terminal 110 can be connected to the second user terminal 120 through Bluetooth, infrared, NFC and other short-distance communication, and at the same time, the second user terminal communicates and transmits data with the server 130 through a wireless local area network, the Internet, or a cellular network.

In some embodiments, the first user terminal 110 may be directly connected to the server 130 . For example, the first user terminal 110 can communicate and transmit data with the server 130 through a wireless local area network, the Internet, or a cellular network. In addition, when the first user terminal 110 and the second user terminal 120 are connected to the same wireless local area network, the first user terminal 110 and the second user terminal 120 can communicate and transmit data with each other.

As shown in the figure, the second user terminal 120 may transmit target keywords to the server 130 on the cloud side, such as the train number or flight number of the travel information. And the first user terminal 110 may receive the target speech recognition model for the target keyword from the server 130 . The server 130 may generate a target speech recognition model, such as a decoding graph, according to the received target keywords. The decoding graph is a lightweight speech recognition model that is easy to deploy on user terminals with limited computing resources. The target speech recognition model is sent to the user side for deployment in the user terminal or to update the local speech recognition model of the user terminal, so as to realize intelligent listening on the user side, that is, to monitor whether there is a speech corresponding to the target keyword in the ambient sound. Although it is shown in Fig. 1 that the target keyword is sent from the second user terminal 120 to the server 130, and the target speech recognition model is received by the first user terminal 110, it should be understood that the target keyword can be sent to the server from any user terminal 130, and the target speech recognition model can be sent and deployed on any user terminal.

By way of example and not limitation, the first user terminal 110 is a noise-canceling headset, the second user terminal 120 is a smartphone, and the first user terminal 110 is connected to the second user terminal 120 via Bluetooth. In this case, an application may be installed on the second user terminal 120, such as an application related to the user's travel, a short message service application, or any other application that stores the user's future travel information. The personalized information that the user wants to be intelligently listened to can be obtained by accessing the application of the second user terminal 120 . According to the embodiment of the present disclosure, the second user terminal 120 can automatically obtain the personalized information desired by the user from the specified application installed on it, for example, the above-mentioned application related to the user's travel, short message service application, etc., and send it to the server 130 , to be used to generate the target speech recognition model for the personalized information.

Although the first user terminal 110 and the second user terminal are shown as separate devices in FIG. 1, they may also be implemented as the same device (as shown by the dotted line in the figure). In other words, the intelligent listening agent according to the embodiment of the present disclosure can be implemented using a single user terminal that sends personalized information to the server 130 and receives a target speech recognition model from the server 130 for use in a listening environment voice content.

FIG. 2 shows a schematic block diagram of a speech recognition system 200 according to an embodiment of the present disclosure. The speech recognition system 200 is used to generate, deploy and use a target speech recognition model for personalized target keywords to detect whether the target keywords exist in ambient sounds. As shown in FIG. 2 , the speech recognition system 200 includes a first user terminal 110 and a second user terminal 120 on the user side, and a server 130 on the cloud side. As an example and not limitation, the first user terminal 110 may be an audio playback device (for example, noise-canceling headphones, smart speaker, etc.), a wearable device (for example, a smart watch, a wristband, etc.), , infrared, etc. to connect to the second user terminal 120. The second user terminal 120 may be a smart phone, a smart home appliance, a tablet computer, a notebook computer, etc., and it can be connected to the server 130 via a wired or wireless manner such as a wireless local area network, the Internet, or a cellular network. The server 130 is configured to receive the personalized target keyword fed back from the second user terminal 120, and generate a target speech recognition model for the target keyword. Exemplary functional modules of the first user terminal 110, the second user terminal 120, and the server 130 are described below.

The second user terminal 120 includes a transmission communication module 122 , a keyword acquisition module 124 and a storage module 126 . The transmission communication module 122 is used to transmit and receive data to and from the first user terminal 110 and the server 130 . For example, it communicates with the first user terminal 110 through bluetooth, near-field communication, infrared, etc., and communicates with the server 130 through a cellular network, a wireless local area network, and the like.

The keyword acquisition module 124 is used to acquire keywords as personalized information. For example, user travel information can be read from text messages or travel applications, and target keywords can be extracted therefrom. The keyword acquisition module 124 is configured to extract keywords in the travel information, such as flight number/train number, etc., through compliance schemes (eg, designated applications authorized by users, such as travel applications or short message services, etc.). For example, the keyword acquisition module 122 may regularly access a designated application to acquire future travel information. The travel information usually includes the traveler's name, flight number or train number, time information, location information, and so on. The flight number or train number usually includes a character string composed of numbers and letters. Therefore, the flight number or train number in the travel information can be determined as a target keyword to be used for speech recognition. The target keyword can be determined by, for example, a regular expression. In addition, time and location information, etc. can also be obtained from travel information.

The storage module 126 can be used to store the device identification of the second user terminal 120, the connection information of the first user terminal 110 connected to the second user terminal 120 (for example, the identification information, address, etc. of the first user terminal 110), 130 Received target speech recognition model and request identifier. The request identifier can be used as a unique identifier for a request to the server for the target speech recognition model. In the case that the server 130 broadcasts the target speech recognition model, the second user terminal 120 can determine whether the target speech recognition model is requested by itself according to the request identifier, thereby determining whether to receive or not.

The first user terminal 110 includes a transmission communication module 112 , a speech recognition model 114 and a prompt module 116 . The transmission communication module 112 is used for sending and receiving data to and from the second user terminal 120 . For example, communicate with the first user terminal 110 through bluetooth, near field communication, infrared and other means. In the case that the first user terminal has the capability of directly communicating with the server 130, the transmission communication module 112 is also used for communicating with the server 130, for example via a cellular network or Wifi.

The speech recognition model 114 is generated based on one or more target keywords, and may be updated according to the target speech recognition model for new target keywords received from the server 130 . For example, the speech recognition model 114 may be configured to recognize a plurality of keywords by listening to whether the target keywords are included in the ambient sound at runtime. Updating the speech recognition model can enable the updated speech recognition model 114 to monitor whether the ambient sound includes a new target keyword, for example, when adding a new target keyword, or replacing one of the existing target keywords with a new target keyword, For example, the target keywords that have existed for the longest time. When the target keyword is detected, the updated speech recognition model 114 can trigger the prompt module 116 to generate prompt information. The prompt model 116 may cause the first user terminal 110 or the second user terminal to issue audible or visual prompts.

The server 130 includes a transmission communication module 132 , a speech recognition model building module 134 , an offline acoustic model training module 136 , and a model library 138 . In the server 130 , the transmission communication module 132 is configured to receive the target keyword transmitted by the keyword acquisition module 122 , and then forward it to the speech recognition model construction module 134 . The speech recognition model construction module 134 is configured to construct a customized target speech recognition model according to the received target keywords and the model library 138, and transmit the constructed target speech recognition model to the first user terminal 110 or the second user terminal 120.

The offline acoustic model training module 134 is configured to pre-train the acoustic model offline according to the robust acoustic model training method according to the training criterion of the speech recognition acoustic model. The trained acoustic models may be stored to model repository 138 . It should be noted that the operation of training the acoustic model can be performed offline, so it can be decoupled from the construction process of the speech recognition model constituting module 134 . According to an embodiment of the present disclosure, the acoustic model can be designed to be generated for an environment with high noise and strong reverberation, for example, based on fusion features to achieve more accurate speech recognition.

The model library 138 is configured to store trained models, including on-demand offline trained acoustic models (acquired through the above-mentioned offline acoustic model training model 124 ), pronunciation dictionaries, language models, and the like. These models can all be trained offline and used by the offline acoustic model training module 134 to construct a target speech recognition model for target keywords.

The speech recognition model construction module 134 can be configured to combine the pre-trained acoustic model in the model library 138, the pronunciation dictionary, the language model and the target keywords transmitted by the transmission communication module 132, and generate the target speech recognition model according to the keyword recognition model construction algorithm. Model. It should be noted that the process of building the target speech recognition model has no strong dependence on the training operation of the offline acoustic model and can be executed asynchronously. Therefore, the speech recognition model construction module 134 can obtain a pre-trained acoustic model from the model library 138 to construct the target speech recognition model.

Although the first user terminal 110 and the second user terminal 120 are shown as separate devices in FIG. 2, they may also be implemented as the same device (as shown by the dotted line in the figure). (as shown by the dotted line in the figure) to implement the intelligent listening substitution solution according to the embodiment of the present disclosure. In this case, target keywords are acquired from a single user terminal, and a speech recognition model for the target keywords is deployed on the same user terminal.

Intelligent voice listening

Fig. 3 shows a schematic flowchart of a method 300 for voice listening substitution according to an embodiment of the present disclosure. The method 300 may be implemented on the user terminal 110 as shown in FIG. 1 and FIG. 2 . The user terminal 110 may be, for example, a smart phone, a smart home appliance, a wearable device, an audio playback device, a tablet computer, a notebook computer, etc., which have a sensor capable of receiving sound, such as a microphone.

In block 310, the user terminal 110 obtains the target speech recognition model corresponding to the target keyword, the target speech recognition model is constructed by the server 130 according to the target keyword, and the target keyword is obtained according to the travel information of the user. According to an embodiment of the present disclosure, as described above, the user terminal 110 may receive the target speech recognition model from the connected user terminal 120 (such as a smart phone) via a wireless connection such as Bluetooth. Alternatively, in the case that the user terminal 110 has a direct communication capability with the server 130 , the user terminal 110 may directly receive the target speech recognition model from the server 130 .

As mentioned above, the user's travel information indicates that the user will go to airports, stations and other transportation places to travel by plane or train, or to pick up and drop off other people at airports or stations. Travel information usually includes information such as flight number or train number, location of traffic place, departure time or arrival time. The target keyword in the travel information can be a character string representing flight number or train number, usually composed of letters and numbers. For example, the travel information may include the following information: "June 2, 2021 at 7:45 am, G109, Beijing South to Shanghai Hongqiao", correspondingly, the target keyword is "G109", the location is "Beijing South Railway Station", The time is "June 2, 2021 at 7:45 AM".

The target speech recognition model is constructed by the server 130 based on the received target keywords. In some embodiments, the target keyword may be obtained from another user terminal 120 connected to the user terminal 110 and sent to the server 130 . For example, a user terminal 110 (eg, a noise-canceling headset) is connected to another user terminal 120 (eg, a smart phone) via Bluetooth or other short-range communication methods. The travel information of the user is obtained by accessing the travel application, SMS or other authorized applications of the user terminal 120 . The user terminal may send target keywords in the travel information, such as flight number or train number, to the server 130 . Thus, the server 130 may construct a target speech recognition model for recognizing the target keyword based on the received target keyword and transmit the constructed target speech recognition model to the user terminal 110 , which will be described below with reference to FIG. 4 .

After the user terminal 110 receives the target speech recognition model from the server 130, at block 320, the local speech recognition model is updated according to the target speech recognition model to obtain an updated speech recognition model, and the local speech recognition model is the user The speech recognition model stored in the terminal. Before the update, the local speech recognition model 114 of the user terminal 110 can recognize one or more keywords, and the target keyword can only be recognized after the update. In some embodiments, the local speech recognition model and the target speech recognition model may be, for example, decoding maps. The decoding graph is a collection of decoding paths of grammatical constraint rules determined by keywords to be recognized. The details of the decoding graph will be described in the section "Generation and Deployment of Speech Recognition Model" below, which will not be described in detail for now. The decoding path for the target keyword of the target speech recognition model is added to the local speech recognition model, so that the local speech recognition model is updated so that the target keyword can be recognized. Alternatively, considering the model size constraints, the existing decoding path in the local speech recognition model can be replaced with the decoding path of the target speech recognition model for the target keyword, for example, the longest existing path in the local speech recognition model The decoding path of the keywords.

It should be understood that if the user terminal 110 does not have a local speech recognition model, the target speech recognition model may be directly deployed as a local speech recognition model. In this case, the local speech recognition model is dedicated to recognize the corresponding target keywords and can be updated later.

At block 330, it is determined whether the user terminal 110 satisfies the target condition. If the target condition is met, then at block 330, the collected environmental sound is recognized according to the updated speech recognition model, and a recognition result is obtained. That is to say, the updated speech recognition model is triggered to monitor broadcast sounds in the external environment only under proper conditions. Since the speech recognition model 114 may exist in the user terminal 110 earlier, it is not necessary to start listening to the ambient sound immediately at this time. It is allowed to trigger the execution of the local speech recognition model when certain target conditions are met, which meets the user's real listening needs and can also save the computing resources and power of the user terminal.

In some embodiments, the target condition may be that the user's location matches the location information of the travel information. As mentioned above, travel information usually includes location information in addition to target keywords. For example, the travel information may include the following information: "June 2, 2021, 7:45 am, G109, Beijing South to Shanghai Hongqiao", then "Beijing South Railway Station" will be used as the location information. When the user's location matches "Beijing South Railway Station", for example, when it is determined that the user is in or near Beijing South Railway Station according to the GPS information or other positioning information of the user terminal, the updated speech recognition model is activated to analyze the collected environmental sounds identify. In this way, when the geographical location condition is satisfied, the updated speech recognition model can be automatically used to recognize the keywords of the ambient sound without user interaction, bringing a better user experience.

In some embodiments, the target condition may also be that when the current time is within a predetermined time period before the time information, the collected environmental sound is recognized according to the updated speech recognition model. Still taking the travel information in the above example as an example, the time information is "7:45 AM on June 2, 2021". For example, when the current time is half an hour, 1 hour, or other time periods before "7:45 am on June 2, 2021", use the updated speech recognition model to process the collected environmental sounds identify. Usually, during these time periods, airports or stations will broadcast target keywords that busy users expect to monitor. In this way, when the time condition is met, the updated speech recognition model can be automatically used to recognize the keywords of the ambient sound without user interaction, bringing a better user experience.

The location information and time information of the user may be provided by the user terminal 110 itself, or may be obtained from other devices, such as another user terminal 120 connected to the user terminal 110 . In addition, the execution of the speech recognition model may be triggered by the user terminal 110 itself or other terminals, such as the user terminal 120 (for example, a trigger signal is sent through a Bluetooth connection). In some embodiments, the above target conditions for triggering the speech recognition model may be used alone or in combination.

Alternatively, the execution of the speech recognition model may also be manually triggered by the user, for example through a manual button. In particular, when the method of triggering monitoring is manual, the button can be set on the user terminal 110 as an intelligent listening device, or on another user terminal 120, or in the application of the

user terminals

110 and 120 .

In some embodiments, the speech recognition model 114 of the user terminal 110 is capable of recognizing multiple keywords. In this case, the user may select some or all of them for identification, or automatically select and identify the latest updated target keywords.

In block 340, the collected environmental sound is recognized according to the updated speech recognition model, and a recognition result is obtained. In order to recognize the collected ambient sound, firstly, the microphone of the user terminal 110 is turned on to collect the external ambient sound. Then, the collected ambient sound is recognized locally in the user terminal 110 through the speech recognition model. The identification can be done in real time or near real time. The collected environmental sound can be directly input to the speech recognition model, and the speech recognition model can judge whether it is the text of the target keyword, for example, through the decoding path of the decoding map. The collected ambient sound can also be buffered at the user terminal 110 and then read into the speech recognition model, and the buffered sound can last for about 10 seconds, 20 seconds, 30 seconds or more. Over time, if no target keyword is identified, the cached ambient sound may be gradually removed or overwritten.

The initial value of the recognition result can be set to "No". According to an embodiment of the present disclosure, ambient sounds may be input to the speech recognition model frame by frame in time order. The speech recognition model determines whether these speech frames correspond to the target keyword. If they match completely, it is determined that the target keyword is recognized, otherwise it is determined that the target keyword is not recognized, and the monitoring is restarted. For example, in the case that the target keyword is "G109", if the speech in the ambient sound includes "G107", then "G", "1", "0", "7" will be recognized in sequence. As an example, before recognizing "7", the speech recognition model sequentially determines that the ambient sound matches the first part of the target keyword (because "G", "1", "0" are consistent with the target keyword). However, once the "7" that is inconsistent with the "9" in the target keyword is recognized, the speech recognition model immediately restarts listening and clears the recognized content "G", "1", and "0". In some embodiments, once a speech that does not match a keyword is recognized, the associated cached data may be deleted and monitoring may be restarted. In fact, as long as the first word of the speech in the ambient sound is not the first word of the target keyword, monitoring will be restarted. According to an embodiment of the present disclosure, when the complete target keyword is detected, the recognition result may be set as "Yes".

At block 350, it is determined whether there is a speech of the target keyword. If the recognition result is "no", it is determined that there is no speech of the target keyword, and the ambient sound is continued to be monitored. If the recognition result is "Yes", proceed to block 360 .

At block 360, the user device 110 prompts the user. The form of the prompt may depend on the capabilities of the user terminal and user configuration. In some embodiments, prompts may include, but are not limited to, one or more of text, images, audio and video. For example, when the user terminal 110 is a device with a speaker, in response to detecting that the ambient sound includes the target keyword, the prompt may be playing a specified reminder sound, a specific recording, or playing a voice corresponding to the target keyword. When the user terminal is a device with a screen, the prompt may be a card pop-up window, a banner display, and the like. When the user terminal 110 has both a speaker and a screen, the notification may be any one or a combination of some of the above. Through various types of reminding methods, the intelligent proxy listening on the user terminal is realized.

In some embodiments, the user terminal 110 may also provide the prompt to other connected user terminals 120 . For example, the reminder is provided via the Bluetooth communication protocol between the user terminal 110 and the user terminal 120 . In this way, notifications can be presented on user terminals deployed with speech recognition models or on other devices to achieve better notification effects.

The above describes the user terminal 110 as an intelligent listening device, but it should be understood that the intelligent listening function can also be implemented in other user terminals (for example, the user terminal 120). In this case, the user terminal 120 sends the target keyword to the server 130, and receives the speech recognition model from the server 130, and uses the speech recognition model to listen to the speech content in the environment without forwarding the speech recognition model to the user Terminal 110.

Through the embodiments described above, it is possible to detect the target keyword of the travel information in the ambient sound of the public transportation place, and remind the user, thereby realizing the intelligent listening function of the device replacing the human ear.

Speech Recognition Model Generation and Deployment

As described above, the speech recognition model according to the embodiments of the present disclosure is a lightweight model deployed on user terminals with limited computing resources. Moreover, this speech recognition model is customized by the user and is aimed at a specific target keyword. The process of generating and deploying a speech recognition model according to an embodiment of the present disclosure is further described below with reference to FIG. 4 .

According to an embodiment of the present disclosure, the speech recognition model for identifying target keywords is constructed by the server 130, and deployed on users such as smart phones, smart home appliances, wearable devices, audio playback devices, tablet computers, notebook computers, etc. Either of

terminals

110 and 120. The

user terminals

110 and 120 can use the speech recognition model to recognize whether a speech containing keywords is played in the surrounding environment, especially in a high-noise environment.

FIG. 4 shows a schematic diagram of an example process 400 of building and deploying a speech recognition model according to an embodiment of the disclosure. FIG. 4 shows an example of deploying a speech recognition model on a first user terminal 110 as shown in FIGS. 1 and 2 , wherein the first user terminal 110 is connected to a second user terminal 120 via short-range communication such as Bluetooth. It should be understood that the speech recognition model may be deployed on the second user terminal 120, or on the single terminal if there is only one user terminal, without departing from the scope of the embodiments of the present disclosure.

The first user terminal 110 may send its own identification information to the second user terminal 120 when establishing a connection with the second user terminal. The second user terminal 120 may store the identification information locally so as to transmit data to the first user terminal 110 later, such as a target speech recognition model or other information.

As shown in the figure, the second user terminal 120 may obtain 410 a target keyword that the user wants to identify. The target keyword text may be a keyword in the user's travel information, such as the flight number or train number that the user will take. For example, the travel information may include the following information: "June 2, 2021 at 7:45 am, G109, from Beijing south to Shanghai Hongqiao", correspondingly, the target keyword is "G109". In some embodiments, the keywords in the travel information can be extracted through a compliance scheme (for example, a specified application authorized by the user, such as a travel application or a short message service, etc.), and it is also possible to access the short message from a specified sender ( For example, airlines or train operators) to obtain target keywords.

According to an embodiment of the present disclosure, the target keyword may be obtained automatically without manual input by the user. For example, if the second user terminal 120 is a smart phone, after authorization, the target can be extracted from the short message or message of the specified sender (for example, the vehicle operator) by accessing the short message of the smart phone or the message of the specified application. Key words. It should be understood that the short message or message including flight number or train number may also include departure time information. In some embodiments, the keyword text can also be obtained according to such time information. For example, the nearest flight number or train number can be obtained as the target keyword. Alternatively, flight numbers or train numbers within a preset time period (for example, one day) from the current moment can also be used as the keyword text.

Then, the second user terminal 120 may request 420 the speech recognition model for the target keyword from the server 130 . The second user terminal 120 may send a request including the target keyword to the server 130 through a cellular network or a wireless local area network (such as WiFi).

In some embodiments, the request may also include an identifier of the second user terminal 120 (e.g., IMSI, IMEI, or other unique identifier) and current connection information of the second user terminal, including but not limited to Bluetooth connection information (e.g., Bluetooth address, device identification, etc.), wireless local area network connection information (for example, wireless access point address, device identification, etc.), etc. Such information may be used to establish a point-to-point connection between the server 130 and the second user terminal 120 or the first user terminal 110 .

Alternatively, the request further includes a request identifier that can uniquely identify the request. The request identifier may be generated by the second user terminal in any suitable manner, for example, according to the device identifier (for example, IMSI, IMEI, etc.) of the second device or other unique identifiers, the first device connected to the second user terminal 120 One or more of the connection information of the user terminal 110, the time stamp, etc. to generate the request identifier. The request identifier can be used for the server 130 to broadcast the constructed speech recognition model. To this end, the second user terminal 120 may locally create and maintain a mapping table. The mapping table includes the associated stored device identifier of the second user terminal 120, the connection information of the first user terminal 110 connected to the second user terminal 120, and the generated request identifier.

The server 130 receives the request of the second user terminal 120, and builds 430 a speech recognition model for the target keyword based on the target keyword in the request. According to the embodiments of the present disclosure, the constructed speech recognition model is a lightweight decoding graph, and the decoding graph is a set of decoding paths determined by grammatical constraint rules determined by target keywords. The server 130 generates a decoding graph based on, for example, an HMLG (HMM+Context+Lexicon+Grammar) decoding graph construction process.

In some embodiments, the server 130 builds a specific lightweight language model for the keyword based on grammatical rules and grammatical rules (for example, JSpeech Grammar Format, referred to as "JSGF" grammatical rules), n-gram statistical rules, etc., namely Target language model (G.fst). Different from traditional language model construction, it relies on large-scale massive data training texts, allowing machines to learn as much as possible the relationship between words, words, sentences, and paragraphs that satisfy natural language logic, so that the language model can cover almost all The transition probability and connection weight between learning units (characters, words, sentences, paragraphs), the server 130 only constrains the transition probability and connection weight between words of the target keyword according to the target keyword, and ignores the transition probability and connection weight between other learning units. relationship and connection, and then customize the target language model as a parameter set that only conforms to the grammatical constraints of the target keyword, so as to ensure the ability to recognize the target keyword. For example, the word combination of the target keyword is determined to have a higher occurrence probability, and the combination occurrence probability of other non-target keywords is set to 0.

Then, select a specific pronunciation sequence from the pronunciation dictionary stored in the model storehouse 138 according to the target keyword, and construct the target pronunciation dictionary model (L.fst) in conjunction with the phoneme description file in the pronunciation dictionary, because the pronunciation sequence is based on the target keyword Compared with the original pronunciation dictionary, the scale of the retrieved words is also greatly reduced. In addition, the server 130 also obtains an acoustic model, such as an HMM model (H.fst), through offline training.

The server 130 combines the target language model, the target pronunciation dictionary model and the acoustic model to obtain a speech recognition model. The speech recognition model uses the original acoustic model, a lightweight target language model constructed according to the target keywords, and a lightweight pronunciation dictionary model retrieved from the target keywords, so the constructed speech recognition model has a lightweight structure. Generalized speech recognition model, which only includes the transition probability and connection weight for the target keyword, and the parameter scale has been greatly reduced. The speech recognition model may be a decoding map as described above. Specifically, the server 130 merges the target language model (G.fst) and the pronunciation dictionary model (L.fst) constructed above to generate the merged pronunciation dictionary and language model (LG.fst), and then merges the generated pronunciation dictionary model. The context model (C.fst) is used to generate CLG.fst, and finally the HMM model (H.fst) constructed above is combined to generate a decoding graph model (HCLG.fst) as a speech recognition model for target keywords.

Embodiments of the present disclosure provide an acoustic model, which is suitable for recognizing non-human voice far-field broadcast speech in an environment with high noise and strong reverberation, and can significantly improve the accuracy of speech recognition. The acoustic model will be described below with reference to FIGS. 5 to 9 . In some embodiments, the acoustic model can be trained offline or online. In addition, the present disclosure is not intended to limit pronunciation dictionaries, types of target language models, or training procedures.

Then, the server 130 transmits 440 the constructed target speech recognition model to the second user terminal 120 .

As mentioned above, the server 130 may transmit the target speech recognition model in a peer-to-peer manner. In some embodiments, the server 130 establishes a point-to-point connection with the second user terminal using a cellular or WiFi communication protocol according to the identifier of the second user terminal 120 included in the request 420, and transmits 440 the target speech recognition model to the a second user terminal.

Next, the second user terminal 120 determines 450 the first user terminal 110 to deploy the speech recognition model according to the local connection information. Then, the second user terminal 120 transmits 460 the speech recognition model to the first user terminal 110 through the connection with the first user terminal 110 .

In addition, the server can also transmit the target speech recognition model through broadcasting. The server 130 broadcasts the constructed target speech recognition model and the associated request identification. The second user terminal 120 may compare the broadcast request identifier with the local mapping table to determine whether to receive the speech recognition model. If the request identifier cannot be found in the mapping table, the target speech recognition model is not received. If the request identifier is found, the corresponding target speech recognition model is received.

The second user terminal 120 may also determine the connected first user terminal 110 according to the request identifier. The second user terminal 120 can use the request identifier to look up the connection information of the first user terminal 110 corresponding to the request identifier in the mapping table, such as the identification information of the first user terminal 110, so as to determine 450 the address of the target voice device model to be received. A first user terminal 110 . Then, the second user terminal 120 sends 460 the target speech recognition model to the determined first user terminal 110 .

After receiving the speech recognition model, the first user terminal 110 may deploy the target speech recognition model or update the local speech recognition model based on the target speech recognition model, and start executing 470 the updated speech recognition model to monitor the environment when the target condition is satisfied. Whether the target keyword exists in the voice, as in the process 300 described above with reference to FIG. 3 .

FIG. 4 describes the process of transmitting the target speech recognition model from the server 130 to the first user terminal 110 via the second user terminal 120 . In some embodiments, the first user terminal 110 may have the capability of directly communicating with the server 130 . Therefore, it is also possible to directly transmit the target speech recognition model from the server 130 to the first user terminal 110 . The server 130 can locate the first user equipment 110 by using the information of the first mobile terminal 110 reported by the second user terminal 120 (for example, Bluetooth connection information, wireless local area network connection information, etc.), and directly transmit the target speech recognition model to the first user. Terminal 110.

In addition, the second user terminal 120 may also not transmit the received speech recognition model to the first user terminal 110, but execute the speech recognition model by itself to realize the voice listening function.

acoustic model

The speech recognition model for target keywords according to an embodiment of the present disclosure is used to recognize a broadcast speech in ambient sound of an airport or a train station. However, identifying such ambient sounds is challenging. First of all, the airport broadcast is usually too far away from the user's pickup equipment, which has strong reverberation interference. Secondly, broadcast sounds are basically synthesized according to fixed templates, which are quite different from standard human voices in Mandarin. Finally, there are various noises such as the conversations of other passengers in the lobby, and the environment is extremely complicated. Therefore, it is desirable to provide a solution for accurately identifying broadcast voice content in a complex background noise environment by using a user terminal in a noisy environment.

The present disclosure utilizes deep learning technology to obtain an acoustic model capable of recognizing broadcast content in environments with high noise and strong reverberation, such as airports and railway stations, through offline training. FIG. 5 shows a schematic diagram of a method 500 for generating an acoustic model according to an embodiment of the present disclosure.

Method 500 includes, at block 510, collecting sound data in a noisy location. In order to adapt the sound to detect speech in a noisy environment, sound data is collected from the noisy environment to generate training data for training and building an acoustic model.

For example, various types of mobile phones, earphones with recording functions, recording pens and other equipment can be used to collect ambient sounds at multiple locations in airports and train stations. Sound collection locations may include, but are not limited to, counter halls, security check passages, waiting halls, convenience stores, dining areas, public restrooms, etc., so as to cover areas that users can reach. Specifically, according to the size of the area where the collection location is located (such as a terminal building), several collection locations can be set based on a circular area with a location coverage radius of R (R>0) meters as a standard. The way of sound collection can be to turn off the front-end gain of the recording device, and record continuously (for example, for 24 hours), so as to ensure that background noise without broadcast sound can be recorded at various locations. In some embodiments, static recording can be adopted, and the sound collection device is fixed and continuously and uninterruptedly recorded. Alternatively, dynamic recording can also be used, where a person or machine moves the acquisition device in a noisy place and records continuously and uninterruptedly. In addition, the recording format may be, for example, wav format, 16kHz, 16bit, multi-channel, etc., but is not limited thereto.

An exemplary process of acquiring acoustic features of voice data and noise data is described above. Acoustic features can be acquired as described above, or by other means, such as accessing existing noisy speech features or various types of existing noise features, without the need for dedicated on-site acquisition

In block 520, the voice data is preprocessed to obtain target voice data and noise data. According to the embodiments of the present disclosure, due to the continuous and uninterrupted recording, the collected original sound data includes broadcast voices in a part of time periods, while other time periods do not include broadcast voices. The preprocessing may include manually or by machine dividing the original sound data into audio data including the target speech content and audio data not including the target speech content, and marking them respectively. In some embodiments, the target voice data is marked with the location information from which the data comes and the text of the target voice data, for example, including flight number or train number. For noisy data, only the location information of the noisy data is marked.

At block 530, acoustic features of the speech data and noise data are extracted. Acoustic features can be extracted by performing framing, windowing, FFT and other processing on the marked speech data and noise data. In some embodiments, the acoustic features can be represented by, for example, Mel-frequency cepstral coefficients (MFCC), but not limited thereto, which takes 10 ms as a frame, and each frame has a corresponding set of parameters, and each parameter has a value of 0 A value between 1 and 1. That is, both target speech data and noise data can be represented as a series of frames lasting for a period of time, each frame is characterized by a set of parameters with values between 0 and 1.

The acoustic features extracted from the target speech data after processing such as framing, windowing, and FFT are noisy features. The noise characteristics can be enhanced to obtain the purest speech acoustic characteristics, thereby reducing the adverse effect of noise on recognition. Referring to FIG. 6 , it shows a schematic flowchart of a method 600 for enhancing speech acoustic features according to an embodiment of the present disclosure.

In block 610, LASSO transform is performed on the input noisy speech acoustic features to perform reverberation suppression on the acoustic features. Reverberation means that when sound waves are reflected and absorbed by obstacles such as walls, ceilings, and floors when they propagate indoors, after the sound source stops emitting sound waves, the sound waves will be reflected and absorbed many times in the room before disappearing. The phenomenon that the sound persists after the source stops sounding is called reverberation. Reverberation is not conducive to accurate recognition of content in speech.

LASSO transformation is also known as LASSO regression. By limiting the conditions of the correlation between important variables in the acoustic features (that is, variables whose coefficients are not 0) and other variables, the acoustic features related to reverberation can be removed, thereby suppressing the adverse effects of reverberation.

In block 620, bottleneck network processing is performed on the acoustic features of the reverberation-suppressed speech data. A bottleneck network is a neural network model that includes a bottleneck layer. The bottleneck layer has a better number of nodes than the previous layers, which can be used to obtain input representations with fewer dimensions. In some embodiments, the dimensionality of the acoustic features processed by the bottleneck network can be reduced, resulting in a better loss during training. The coefficients of the bottleneck network can be precomputed or updated during training.

Through speech enhancement 600 as shown in FIG. 6 , speech acoustic features with background noise are transformed into speech features that are as pure as possible. Furthermore, clean speech features can be fused with noise features from multiple locations to generate fused features.

Returning to FIG. 5 , at block 540 , fusion features are generated from the speech acoustic features and the noise acoustic features. Fusion features can reduce the impact of the type and size differences of background noise in different places or different positions in the same place on the recognition accuracy. According to an embodiment of the present disclosure, fusion features are generated by aligning speech features and noise features frame by frame.

FIG. 7 shows a schematic conceptual diagram of a method 700 for generating fusion features according to an embodiment of the present disclosure. As shown in the figure, the target speech data divided from the original data undergoes feature extraction 710 and speech enhancement 720 to generate enhanced speech features. Moreover, the noise data is uniformly sampled to obtain sampling noise at multiple positions (for example, position 1 to position N). Similarly, feature extraction 710 is performed on these sampled noises from multiple locations to generate noise features. Feature extraction 710 may be performed as described with reference to block 530, including framing, windowing, FFT, and the like. According to an embodiment of the present disclosure, the acoustic features of the speech data and the acoustic features of the noise data may have the same frame size, for example, both are 10 ms, so that they can be fused frame by frame.

As mentioned above, the enhanced speech acoustic features and noise features have the same frame size, say 10ms, so the frame-by-frame alignment of speech features and noise features can produce temporally aligned fused features. Specifically, all sampled noise features (for example, noise features from positions 1 to N) can be superimposed on the enhanced speech features frame by frame to form fusion features. As mentioned above, each frame is characterized by a set of parameters with values between 0 and 1, namely vectors, and superposition refers to adding the corresponding parameters of speech acoustic features and noise features through vector addition. For example, in the case that each frame in the speech acoustic feature and the noise acoustic feature is represented by a 40-dimensional vector, a frame in the fusion feature is also represented by a corresponding 40-dimensional vector.

It should be understood that the value of the superimposed parameter may exceed the range of 0 to 1. In this case, a global normalization process can be performed so that the value of the parameter of the fusion feature is still in the range of 0 to 1.

In some cases, the duration of speech data may be different from that of noise data, and the duration of noise data may also be different for each location. Therefore, feature fusion also includes the alignment of speech data and noise data.

FIG. 8 shows a schematic diagram of a feature fusion process 800 according to an embodiment of the present disclosure. Enhanced speech acoustic features 810 for feature fusion and noise features 820-1, 820-2, ... 820-N (collectively 820) from multiple locations in FIG. 8 are shown in a sequence of frames. The enhanced speech acoustic features 810 in FIG. 8 include L frames. Since the duration of speech acoustic feature 810 and noise acoustic feature 820 may be different, noise feature 820 may include the same or a different number of frames than L. For example, noise signature 820-N may include, for example, R frames.

In some embodiments, the noise acoustic feature 820 can be adjusted according to the duration of the speech acoustic feature 810, for example, by selecting a part of the frames of the noise acoustic feature or expanding the frames of the noise acoustic feature, the relationship between the number of frames (or duration) and the speech Adjusted noise acoustic signature with the same acoustic signature. After the two are aligned, the speech acoustic features and the adjusted noise acoustic features are superimposed.

Specifically, if the number of frames of the enhanced speech acoustic feature 810 and the noise acoustic feature 820 are the same (L=R), the speech acoustic feature 810 and the noise acoustic feature 820 are superimposed frame by frame.

If the frame number of the enhanced speech acoustic feature 810 is less than the frame number of the noise acoustic feature 820 (L<R), then the first L frames of the noise acoustic feature 820 can be selected to be superimposed with the enhanced speech acoustic feature, and the rear R-L frames are discarded. do processing. It should be understood that the last L frames in the noise acoustic feature 820 , the middle L frame, or the L frames selected in any other way may also be selected to be superimposed on the speech acoustic feature 810 .

If the frame number of the enhanced speech acoustic feature 810 is greater than the frame number (L>R) of the noise acoustic feature 820, then for the first frame of the noise acoustic feature 820 can be superimposed on the L-R frame of the enhanced speech acoustic feature, the first 2 frames are superimposed to L-R+1 frame, and so on, until all frames of speech acoustic feature 810 are superimposed with frames of noise feature 820 . For example, as shown in FIG. 8, the frame number R of the noise feature 820-N is smaller than the frame number of the speech acoustic feature, so its first frame is again superimposed on the corresponding frame of the speech acoustic feature. It should be understood that FIG. 8 is only schematic, and the frame numbers of speech acoustic features and noise features are not necessarily the same as those shown in FIG. 8 .

According to the above method, the first frame of the enhanced speech acoustic feature 810 and the first frame of the noise feature 820-1, 820-2, ... 820-N are superimposed to obtain the first frame of the fusion feature, and the second frame and the noise feature 1 , 2, ... N, the first frame is superimposed to obtain the second frame of the fusion feature 830, and so on, the fusion feature 830 with the number of L frames is generated. Fused features 830 are used to train the acoustic model.

With the help of this fusion method of speech acoustic features and noise acoustic features, a large number of fusion features can be generated as the training data of the acoustic model, and the generated fusion features can truly simulate the environmental sound of a specific real noise place, so that the trained Acoustic models have higher accuracy.

The above describes the fusion specific process by superimposing the acoustic features of the target speech data and the acoustic features of the noise data. In some other embodiments, the target speech data and noise data obtained in block 520 may be superimposed to obtain superimposed audio data; then the fused acoustic features may be obtained based on the superimposed audio data. In this case, the superposition of target speech data and noise data can also be performed based on frame number alignment, and the extraction of fused acoustic features can be similarly performed.

Returning to FIG. 5, at block 550, the acoustic model is trained using the fused features and the text of the speech data. According to an embodiment of the present disclosure, the acoustic model may be based on a deep neural network (DNN) architecture. The text of the voice data is the text marked in step 520, for example, including flight number or train number. During training, the fusion feature is the input of the acoustic model, and the text or the phoneme corresponding to the text is the labeled data corresponding to the fusion feature. In order to better pick up non-human voice broadcasting sounds in high-noise, strong reverberation environments such as airports/train stations, the acoustic model uses a multi-task architecture, including the sound source recognition task of the sound source tag and the speech recognition task of the voice tag.

FIG. 9 shows an architecture diagram for training an acoustic model according to an embodiment of the present disclosure. The architecture 900 includes a deep neural network 910, and the deep neural network 710 may include a plurality of

hidden layers

912, 914, 916 and input and output layers (not shown). Deep neural network 710 may also include more or fewer hidden layers.

According to the embodiment of the present disclosure, multi-task training can be performed on the deep neural network 910, specifically, the training target of the deep neural network 910 is modified, and another voiceprint recognition tag is added as the training target on the basis of the speech recognition tag. As shown in the figure, the output from the last hidden layer 916 of the deep neural network 910 can be obtained as the sound source feature. Then, the fusion feature and the sound source feature are spliced together as the input of 910 of the deep neural network. For example, the Y-dimensional sound source feature can be spliced with the X-dimensional fusion feature to form an X+Y-dimensional training feature, which can be used as the input of the deep neural network. During the training process, each round of iteration updates the input features with the sound source features generated in the previous round until the final training ends. In some embodiments, the sound source features input in the first iteration may be all set to 0.

Thus, using the multi-task learning combined with voiceprint features of the present disclosure, the sound source features of broadcast voices can be extracted from the deep neural network as compensation for acoustic model learning, thereby more accurately picking up non-human voice broadcast voices.

Returning to FIG. 5, at block 560, a speech recognition model is constructed based on the acoustic model, pronunciation dictionary and language model. In some embodiments, the process of constructing a speech recognition model may include receiving target keywords from a user terminal, generating a target language model and a target pronunciation dictionary model for the target keywords, by merging the target language model, the target pronunciation dictionary model, and An acoustic model is used to construct the speech recognition model, and more specifically, reference may be made to the above description about FIG. 4 .

According to an embodiment of the present disclosure, the offline trained acoustic model may be stored in the model library of the server. When the server receives the target keyword from the user terminal, the acoustic model and other models in the model library (such as pronunciation dictionary and language model) can be used to construct a speech recognition model for recognizing the target keyword. This speech recognition model dedicated to specific keywords is lightweight and suitable for deployment to user equipment or smart listening devices.

Example Apparatus and Equipment

Fig. 10 shows a schematic block diagram of an apparatus 1000 for voice listening substitution according to an embodiment of the present disclosure. The device 1000 may be applied to a user terminal, such as a first user terminal 110 or a second user device 120 . The apparatus 1000 includes a model acquiring unit 1010, configured to acquire a target speech recognition model corresponding to a target keyword. The target speech recognition model is constructed based on the target keywords, which are obtained based on the user's travel information. The device 1000 also includes an updating unit 1020 . The update unit is used to update the local speech recognition model according to the target speech recognition model to obtain the updated speech recognition model, where the local speech recognition model is the speech recognition model stored in the user terminal. The device 1000 also includes a voice recognition unit 1020 . The environment recognition unit 1020 is used to recognize the collected environmental sound according to the updated speech recognition model when the target condition is met, and obtain the recognition result. The environmental sound is the sound information collected in the environment where the user terminal is located. The device 1000 also includes a prompt unit 1030 . The prompting unit 1030 is configured to prompt the user when the recognition result indicates that there is a voice corresponding to the target keyword in the ambient sound.

In some embodiments, the device 1000 further includes a target keyword acquiring unit. The target keyword acquisition unit is used to acquire target keywords in the user's travel information. The device 1000 also includes a sending unit. The sending unit is used to send the target keywords in the travel information to the server, so that the server can construct a target speech recognition model according to the target keywords. The model obtaining unit 1010 is also used to obtain the target speech recognition model from the server.

In some embodiments, the user terminal is a first user terminal and is connected to a second user terminal, and the method includes: sending identification information to the second user terminal, the identification information being used to identify the first user terminal ; Wherein the acquiring the target speech recognition model corresponding to the target keyword is specifically: receiving the target speech recognition model from the second user terminal based on the identification information, the target speech recognition model being the second user The terminal obtains from the server according to the target keyword; wherein the first user terminal is an audio playback device.

In some embodiments, the target speech recognition model is a decoding map generated based on an acoustic model, a target pronunciation dictionary model, and a target language model. A decoding graph is a collection of decoding paths of grammatically constrained rules determined by target keywords. The target pronunciation dictionary model is acquired based on the pronunciation sequence of the target keyword, and the target language model is acquired based on the relationship between words of the target keyword.

In some embodiments, the acoustic model is generated by: generating fusion acoustic features based on target speech data and noise data, where the target speech data is audio data including the target speech content, and the noise data is audio data not including the target speech content ; generate an acoustic model by fusing features and textual information of speech data for training.

In some embodiments, the travel information has associated location information, wherein the sound recognition unit 1020 is further configured to perform the collected environmental sound based on the updated speech recognition model when the user's location matches the location information of the travel information. identify.

In some embodiments, the travel information also has associated time information, and the sound recognition unit 1020 is also used to recognize the collected environmental sound according to the updated speech recognition model when the current time is within a predetermined period of time before the time information .

In some embodiments, the prompting unit 1030 is further configured to play the voice corresponding to the target keyword on the user terminal.

In some embodiments, the target keyword is train number or flight number.

In some embodiments, the user terminal is one of a smart phone, a smart home appliance, a wearable device, an audio playback device, a tablet computer and a notebook computer.

Fig. 11 shows a schematic block diagram of an apparatus 1100 for generating a speech recognition model according to an embodiment of the present disclosure. Apparatus 1100 may be used in server 130, for example. The device 1100 includes a fusion unit 1110 , a training unit 1120 and a speech recognition model construction unit 1130 . The fusion unit 1110 is used to generate fusion acoustic features based on the target speech data and noise data. The target speech data is audio data including the target speech content, and the noise data is audio data not including the target speech content. The training unit 1120 is used for generating an acoustic model by performing training through the fusion feature and the text information of the speech data. The speech recognition model construction unit 1130 is configured to construct the speech recognition model according to the acoustic model, pronunciation dictionary and language model.

In some embodiments, the fusion unit 1110 is further configured to superimpose the target speech data and noise data to obtain superimposed audio data; and obtain fused acoustic features based on the superimposed audio data.

In some embodiments, the fusion unit 1110 is also used to obtain the first acoustic feature based on the target speech data, and obtain the second acoustic feature based on the noise data; obtain the fusion acoustic feature based on the first acoustic feature and the second acoustic feature.

In some embodiments, the fusion unit 1110 is further configured to generate noisy acoustic features from the target speech data; and generate first acoustic features by enhancing the noisy acoustic features.

In some embodiments, the fusion unit 1110 is further configured to perform LASSO transformation on the acoustic features with noise, and perform bottleneck network processing on the acoustic features transformed by LASSO, so as to obtain the first acoustic features.

In some embodiments, the fusion unit 1110 is further configured to superimpose the first acoustic feature and the second acoustic feature to obtain superimposed acoustic features; and generate a fusion acoustic feature by normalizing the superimposed acoustic features.

In some embodiments, the fusion unit 1110 is also used to obtain the frame number of the first acoustic feature, the frame number of the first acoustic feature is determined according to the duration of the target voice data; according to the frame number of the first acoustic feature Constructing a third acoustic feature based on the second acoustic feature; and superimposing the first acoustic feature and the third acoustic feature to obtain a fusion acoustic feature.

In some embodiments, the acoustic model is a neural network model, and the training unit 1120 is used to extract sound source features from the hidden layer of the acoustic model; and use the sound source features and fusion acoustic features as the input features of the acoustic model to train the acoustic Model.

In some embodiments, the speech recognition model construction unit 1130 is also used to receive the target keyword from the user terminal; obtain the target pronunciation dictionary model from the pronunciation dictionary according to the pronunciation sequence of the target keyword; obtaining a target language model from the speech model; and constructing the speech recognition model by combining the acoustic model, the target pronunciation dictionary model, and the target language model.

Fig. 12 shows an apparatus 1200 for voice listening substitution according to another embodiment of the present disclosure. The apparatus 1200 can be applied to the server 130 . The device 1200 includes a target keyword acquisition unit 1210 , a speech recognition model construction unit 1220 and a sending unit 1230 . The target keyword acquisition unit 1210 is configured to acquire target keywords related to the user's travel mode in the user's travel information. The speech recognition model construction unit 1220 is configured to construct a target speech recognition model corresponding to the target keyword. The sending unit 1230 is configured to send the speech recognition model to the user terminal, and the speech recognition model is used to recognize the environmental sound at the user terminal when the target condition is met, so as to determine whether the target keywords.

FIG. 13 shows a schematic block diagram of an example device 1200 that may be used to implement embodiments of the present disclosure. As shown, the device 1300 includes a central processing unit (CPU) 1301 that can be programmed according to computer program instructions stored in a read-only memory (ROM) 1302 or loaded from a storage unit 1308 into a random-access memory (RAM) 1303 program instructions to perform various appropriate actions and processes. In the RAM 1303, various programs and data necessary for the operation of the device 1300 can also be stored. The CPU 1301, ROM 1302, and RAM 1303 are connected to each other through a bus 1304. An input/output (I/O) interface 1305 is also connected to the bus 1304 .

Multiple components in the device 1200 are connected to the I/O interface 1305, including: an input unit 1306, such as a keyboard, a mouse, etc.; an output unit 1307, such as various types of displays, speakers, etc.; a storage unit 1308, such as a magnetic disk, an optical disk, etc. ; and a communication unit 1309, such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 1309 allows the device 1300 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks.

The various procedures and processing described above can be executed by the processing unit 1201 . For example, in some embodiments, the various procedures and processes described above may be implemented as computer software programs tangibly embodied on a machine-readable medium, such as the storage unit 1308 . In some embodiments, part or all of the computer program may be loaded and/or installed on the device 1200 via the ROM 1302 and/or the communication unit 1309. When the computer program is loaded into the RAM 1303 and executed by the CPU 1201, one or more actions of the procedures and processes described above may be performed.

The present disclosure may be a method, apparatus, system and/or computer program product. A computer program product may include a computer-readable storage medium having computer-readable program instructions thereon for carrying out various aspects of the present disclosure.

A computer readable storage medium may be a tangible device that can retain and store instructions for use by an instruction execution device. A computer readable storage medium may be, for example, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of computer-readable storage media include: portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), or flash memory), static random access memory (SRAM), compact disc read only memory (CD-ROM), digital versatile disc (DVD), memory stick, floppy disk, mechanically encoded device, such as a printer with instructions stored thereon A hole card or a raised structure in a groove, and any suitable combination of the above. As used herein, computer-readable storage media are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., pulses of light through fiber optic cables), or transmitted electrical signals.

Computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or downloaded to an external computer or external storage device over a network, such as the Internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers. A network adapter card or a network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .

Computer program instructions for performing the operations of the present disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state setting data, or Source or object code written in any combination, including object-oriented programming languages—such as Smalltalk, C++, etc., and conventional procedural programming languages—such as “C” or similar programming languages. Computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement. In cases involving a remote computer, the remote computer can be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as via the Internet using an Internet service provider). connect). In some embodiments, an electronic circuit, such as a programmable logic circuit, field programmable gate array (FPGA), or programmable logic array (PLA), can be customized by utilizing state information of computer-readable program instructions, which can Various aspects of the present disclosure are implemented by executing computer readable program instructions.

Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It should be understood that each block of the flowcharts and/or block diagrams, and combinations of blocks in the flowcharts and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine such that when executed by the processing unit of the computer or other programmable data processing apparatus , producing an apparatus for realizing the functions/actions specified in one or more blocks in the flowchart and/or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium, and these instructions cause computers, programmable data processing devices and/or other devices to work in a specific way, so that the computer-readable medium storing instructions includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks in flowcharts and/or block diagrams.

It is also possible to load computer-readable program instructions into a computer, other programmable data processing device, or other equipment, so that a series of operational steps are performed on the computer, other programmable data processing device, or other equipment to produce a computer-implemented process , so that instructions executed on computers, other programmable data processing devices, or other devices implement the functions/actions specified in one or more blocks in the flowcharts and/or block diagrams.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, a program segment, or a portion of an instruction that contains one or more executable instruction. In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs the specified function or action , or may be implemented by a combination of dedicated hardware and computer instructions.

While various embodiments of the present disclosure have been described above, the foregoing description is illustrative, not exhaustive, and is not limited to the disclosed embodiments. Many modifications and alterations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The choice of terminology used herein aims to best explain the principle of each embodiment, practical application or improvement of technology in the market, or to enable other ordinary skilled in the art to understand each embodiment disclosed herein.

Claims

A method for voice listening, the method is applied to a user terminal, comprising:

Obtain the target speech recognition model corresponding to the target keyword, the target speech recognition model is constructed according to the target keyword, and the target keyword is obtained according to the travel information of the user;

updating the local speech recognition model according to the target speech recognition model to obtain an updated speech recognition model, the local speech recognition model being the speech recognition model stored in the user terminal;

When the target condition is met, recognize the collected environmental sound according to the updated speech recognition model, and obtain a recognition result, the environmental sound is sound information collected in the environment where the user terminal is located; and

When the recognition result indicates that the target keyword exists in the ambient sound, prompting the user.
The method according to claim 1, wherein obtaining the target speech recognition model corresponding to the target keyword comprises:

Obtain travel information of the user;

Extracting target keywords related to the user's travel mode according to the travel information;

sending the target keyword to a server, so that the server builds the target speech recognition model according to the target keyword; and

Obtain the target speech recognition model from the server.
The method of claim 1, wherein the user terminal is a first user terminal and is connected to a second user terminal, the method comprising:

sending identification information to the second user terminal, where the identification information is used to identify the first user terminal;

Wherein the target speech recognition model corresponding to the acquisition target keyword is specifically:

receiving the target speech recognition model from the second user terminal based on the identification information, where the target speech recognition model is obtained by the second user terminal from the server according to the target keyword;

Wherein the first user terminal is an audio playback device.
The method according to claim 1, wherein the target speech recognition model is a decoding map generated based on an acoustic model, a target pronunciation dictionary model, and a target language model, and the decoding map is a grammatical constraint determined by the target keyword A set of regular decoding paths, the target pronunciation dictionary model is acquired based on the pronunciation sequence of the target keyword, and the target language model is acquired based on the relationship between words of the target keyword.
The method according to claim 5, wherein the acoustic model is generated by training the text information of the fusion feature and target voice data, the fusion feature is generated based on the target voice data and noise data, and the target voice data includes audio data of the target voice content, the noise data is audio data not including the target voice content.
The method according to claim 1, further comprising:

Obtaining location information associated with the travel mode of the user according to the travel information;

Wherein when the target condition is met, recognizing the collected ambient sound according to the updated speech recognition model includes:

When the location of the user matches the location information, the collected ambient sound is recognized according to the updated speech recognition model.
The method according to claim 1, further comprising:

Acquiring time information associated with the travel mode of the user according to the travel information;

Wherein when the target condition is met, recognizing the collected ambient sound according to the updated speech recognition model includes:

When the time condition is satisfied, the collected environmental sound is recognized according to the updated speech recognition model, and the time condition is determined according to the time information.
The method according to claim 1, further comprising: wherein prompting the user includes playing a voice corresponding to the target keyword on the user terminal.
The method according to any one of claims 1 to 8, wherein the target keyword is a train number or flight number.
The method according to claim 1, wherein the user terminal is one of a smart phone, a smart home appliance, a wearable device, an audio playback device, a tablet computer, and a notebook computer.
A device for voice listening, used for a user terminal, comprising:

A model acquiring unit, configured to acquire a target speech recognition model corresponding to a target keyword, the target speech recognition model is constructed according to the target keyword, and the target keyword is acquired according to the travel information of the user;

An updating unit, configured to update a local speech recognition model according to the target speech recognition model to obtain an updated speech recognition model, where the local speech recognition model is a speech recognition model stored in the user terminal;

A sound recognition unit, configured to, when the target condition is met, recognize the collected environmental sound according to the updated speech recognition model, and obtain a recognition result, the environmental sound is collected in the environment where the user terminal is located audio messages received; and

A prompting unit, configured to prompt the user when the recognition result indicates that the target keyword exists in the ambient sound.
The apparatus of claim 11, wherein said apparatus further comprises:

a travel information acquisition unit, configured to obtain the travel information of the user;

A target keyword acquisition unit, configured to extract target keywords related to the travel mode of the user according to the travel information; and

a sending unit, configured to send the target keyword to a server, so that the server can construct the target speech recognition model according to the target keyword;

Wherein the model acquiring unit is further configured to acquire the target speech recognition model from the server.
The apparatus of claim 11, wherein the user terminal is a first user terminal and is connected to a second user terminal, the apparatus further comprising:

a sending unit, configured to send identification information to the second user terminal, where the identification information is used to identify the first user terminal;

The model obtaining unit is further configured to receive the target speech recognition model from the second user terminal based on the identification information, and the target speech recognition model is obtained from the second user terminal according to the target keyword. server fetches,

Wherein the first user terminal is an audio playback device.
The device according to claim 11, wherein the target speech recognition model is a decoding map generated based on an acoustic model, a target pronunciation dictionary, and a target language model, and the decoding map is a grammatical constraint rule determined by the target keyword A set of decoding paths, the target pronunciation dictionary model is acquired based on the pronunciation sequence of the target keyword, and the target language model is acquired based on the relationship between words of the target keyword.
The device according to claim 14, wherein the acoustic model is generated by training fusion features and text information of target speech data, the fusion features are generated based on target speech data and noise data, and the target speech data is composed of audio data of the target voice content, the noise data is audio data not including the target voice content.
The apparatus of claim 11, further comprising:

a travel location information acquisition unit, configured to obtain location information associated with the travel mode of the user according to the travel information,

Wherein the voice recognition unit is further configured to: when the location of the user matches the location information, recognize the collected environmental sound according to the updated voice recognition model.
The apparatus of claim 11, further comprising:

a travel time information acquisition unit, configured to obtain time information associated with the travel mode of the user according to the travel information,

Wherein the voice recognition unit is further configured to: when a time condition is satisfied, to recognize the collected environmental sound according to the updated speech recognition model, the time condition is determined according to the time information.
The device according to claim 11, wherein the prompting unit is further configured to: play a voice corresponding to the target keyword on the user terminal.
The device according to any one of claims 11 to 18, wherein the target keyword is a train number or a flight number.
The apparatus according to claim 11, wherein the user terminal is one of a smart phone, a smart home appliance, a wearable device, an audio playback device, a tablet computer, and a notebook computer.
A method of generating a speech recognition model comprising:

Generate fusion acoustic features based on target voice data and noise data, where the target voice data is audio data that includes target voice content, and the noise data is audio data that does not include the target voice content;

generating the acoustic model by training the fused features and textual information of the speech data; and

Constructing the speech recognition model according to the acoustic model, pronunciation dictionary and language model.
The method of claim 21, wherein generating the fused acoustic features comprises:

superimposing the target voice data and the noise data to obtain superimposed audio data; and

The fused acoustic feature is acquired based on the superimposed audio data.
The method of claim 21, wherein generating the fused acoustic features comprises:

acquiring a first acoustic feature based on the target speech data;

obtaining a second acoustic feature based on the noise data;

The fused acoustic feature is obtained based on the first acoustic feature and the second acoustic feature.
The method according to claim 23, wherein said acquiring a first acoustic feature based on said target speech data comprises:

generating noisy features from the target speech data;

The first acoustic feature is generated by enhancing the noisy acoustic data.
The method of claim 24, wherein enhancing the noise characteristics comprises:

performing a LASSO transform on the noisy feature; and

performing bottleneck network processing on the LASSO-transformed acoustic features to obtain the first acoustic features.
The method according to claim 23, wherein said obtaining said fused acoustic feature based on said first acoustic feature and said second acoustic feature comprises:

superimposing the first acoustic feature and the second acoustic feature to obtain a superimposed acoustic feature; and

The fused acoustic features are generated by performing normalization processing on the superimposed acoustic features.
The method of claim 23, wherein obtaining the fused acoustic feature based on the first acoustic feature and the second acoustic feature comprises:

Obtain the number of frames of the first acoustic feature, the number of frames of the first acoustic feature is determined according to the duration of the target voice data;

constructing a third acoustic feature based on the second acoustic feature according to the frame number of the first acoustic feature;

and superimposing the first acoustic feature and the third acoustic feature to obtain the fused acoustic feature.
The method of claim 21, wherein the acoustic model is a neural network model, and the training comprises:

extracting sound source features from hidden layers of the acoustic model; and

The acoustic model is trained by using the sound source feature and the fusion acoustic feature as input features of the acoustic model.
The method according to claim 21, wherein said constructing said speech recognition model according to said acoustic model, pronunciation dictionary and language model comprises:

receiving target keywords from the user terminal;

Obtaining a target pronunciation dictionary model from the pronunciation dictionary according to the pronunciation sequence of the target keyword;

Acquiring a target language model from the speech model according to the relationship between words of the target keyword; and

The speech recognition model is constructed by combining the acoustic model, the target pronunciation dictionary model and the target language model.
A device for generating a speech recognition model, comprising:

A fusion unit, configured to generate fusion acoustic features based on target speech data and noise data, where the target speech data is audio data including target speech content, and the noise data is audio data not including the target speech content;

a training unit, configured to generate the acoustic model by performing training through the fusion feature and the text information of the speech data; and

A speech recognition model construction unit, configured to construct the speech recognition model according to the acoustic model, pronunciation dictionary and language model.
An electronic device comprising:

at least one computing unit;

at least one memory coupled to the at least one computing unit and storing instructions for execution by the at least one computing unit that, when executed by the at least one computing unit, cause the The device performs the method according to any one of claims 1 to 10, or the method according to any one of claims 21 to 29.
A computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the method according to any one of claims 1 to 10 is implemented, or according to any one of claims 21 to 29 the method described.
A computer program product comprising computer-executable instructions, wherein said computer-executable instructions, when executed by a processor, effectuate a method according to any one of claims 1 to 10, or a method according to any one of claims 21 to 29 any one of the methods described.