CN117275467A

CN117275467A - Voice instruction recognition method and device in noise environment

Info

Publication number: CN117275467A
Application number: CN202311277131.XA
Authority: CN
Inventors: 于红超; 徐开庭; 万为侗; 李洋全
Original assignee: Chongqing Seres New Energy Automobile Design Institute Co Ltd
Current assignee: Chongqing Seres New Energy Automobile Design Institute Co Ltd
Priority date: 2023-09-28
Filing date: 2023-09-28
Publication date: 2023-12-22

Abstract

The application provides a voice instruction recognition method and device in a noise environment. The method comprises the following steps: collecting voice to be recognized in a noise environment, and determining recognizable syllables and unrecognizable syllables in the voice to be recognized based on a voice recognition technology; determining fuzzy sound information in unrecognizable syllables; determining at least one alternative syllable corresponding to the unrecognizable syllable by utilizing the identifiable syllable and the fuzzy sound information based on the semantic analysis technology; according to the scene information corresponding to the voice to be recognized, determining target syllables corresponding to unrecognizable syllables from the candidate syllables; and determining a voice instruction corresponding to the voice to be recognized according to the target syllable. According to the voice command recognition method and device based on the semantic analysis, voice recognition and semantic analysis technologies are integrated, an efficient and accurate solution is provided for the voice command recognition problem in a noisy environment, and therefore performance and user experience of a vehicle-mounted voice assistant system are improved.

Description

Voice instruction recognition method and device in noise environment

Technical Field

The present disclosure relates to the field of speech recognition technologies, and in particular, to a method and an apparatus for recognizing a speech command in a noise environment.

Background

The vehicle-mounted voice assistant is an intelligent assistant installed in a car and can be operated by voice instructions to complete a plurality of tasks including navigation, entertainment, information inquiry, communication and the like. The vehicle-mounted voice assistant not only simplifies the operation of a driver in the driving process, but also improves the driving safety. The driver does not need to be distracted to operate the mobile phone or the vehicle-mounted key in the driving process any more, and only needs to concentrate on driving, so that the driving safety is improved. Meanwhile, the vehicle-mounted voice assistant is more convenient and efficient to use, and the driving experience of a driver is more comfortable.

Although convenient, the recognition capability of the car voice assistant can be affected in a noisy environment. When the vehicle is traveling in a busy city or a highway, external noise may interfere with the recognition effect of the voice assistant. In such noisy environments, the voice assistant often cannot recognize the complete voice instruction statement of the user, and the ability to recognize and parse the instruction is greatly compromised. The user needs to repeat the sentence or otherwise restate, which undoubtedly reduces the use experience.

Disclosure of Invention

In view of this, the embodiments of the present application provide a method and an apparatus for recognizing a voice command in a noise environment, so as to solve the problem of reduced user experience caused by limited recognition capability of a voice assistant in the noise environment in the prior art.

In a first aspect of an embodiment of the present application, a method for recognizing a voice command in a noise environment is provided, including:

collecting voice to be recognized in a noise environment, and determining recognizable syllables and unrecognizable syllables in the voice to be recognized based on a voice recognition technology;

determining fuzzy sound information in unrecognizable syllables;

determining at least one alternative syllable corresponding to the unrecognizable syllable by utilizing the identifiable syllable and the fuzzy sound information based on the semantic analysis technology;

according to the scene information corresponding to the voice to be recognized, determining target syllables corresponding to unrecognizable syllables from the candidate syllables;

and determining a voice instruction corresponding to the voice to be recognized according to the target syllable.

In a second aspect of the embodiments of the present application, a voice command recognition apparatus in a noise environment is provided, including:

the syllable determining module is configured to collect the voice to be recognized in the noise environment, and determine recognizable syllables and unrecognizable syllables in the voice to be recognized based on the voice recognition technology;

a fuzzy sound information determination module configured to determine fuzzy sound information in unrecognizable syllables;

an alternative syllable determination module configured to determine at least one alternative syllable corresponding to the unrecognizable syllable using the identifiable syllable and the fuzzy sound information based on a semantic analysis technique;

The target syllable determining module is configured to determine target syllables corresponding to unrecognizable syllables from the candidate syllables according to the scene information corresponding to the voice to be recognized;

and the voice instruction determining module is configured to determine a voice instruction corresponding to the voice to be recognized according to the target syllable.

In a third aspect of the embodiments of the present application, there is provided an electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above method when executing the computer program.

In a fourth aspect of the embodiments of the present application, there is provided a readable storage medium storing a computer program which, when executed by a processor, implements the steps of the above method.

Compared with the prior art, the embodiment of the application has the beneficial effects that: the method comprises the steps of determining identifiable syllables and unrecognizable syllables in the voice to be recognized based on a voice recognition technology by collecting the voice to be recognized in a noise environment; determining fuzzy sound information in unrecognizable syllables; determining at least one alternative syllable corresponding to the unrecognizable syllable by utilizing the identifiable syllable and the fuzzy sound information based on the semantic analysis technology; according to the scene information corresponding to the voice to be recognized, determining target syllables corresponding to unrecognizable syllables from the candidate syllables; and determining a voice instruction corresponding to the voice to be recognized according to the target syllable. According to the embodiment of the application, by integrating the voice recognition and semantic analysis technology, an efficient and accurate solution is provided for the voice instruction recognition problem in a noisy environment, so that the performance and user experience of the vehicle-mounted voice assistant system are improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the following description will briefly introduce the drawings that are needed in the embodiments or the description of the prior art, it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow chart of a voice command recognition method in a noise environment according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a voice command recognition device in a noise environment according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system configurations, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

A voice command recognition method and apparatus in a noisy environment according to embodiments of the present application will be described in detail with reference to the accompanying drawings.

Fig. 1 is a flowchart of a voice command recognition method in a noise environment according to an embodiment of the present application. As shown in fig. 1, the voice command recognition method in the noise environment includes:

s101, collecting voice to be recognized in a noise environment, and determining recognizable syllables and unrecognizable syllables in the voice to be recognized based on a voice recognition technology;

s102, determining fuzzy sound information in unrecognizable syllables;

s103, determining at least one alternative syllable corresponding to the unrecognizable syllable by utilizing the identifiable syllable and the fuzzy sound information based on a semantic analysis technology;

s104, determining target syllables corresponding to unrecognizable syllables from the candidate syllables according to the scene information corresponding to the voice to be recognized;

s105, determining a voice instruction corresponding to the voice to be recognized according to the target syllable.

In particular, the vehicle-mounted voice assistant is an intelligent assistant installed in a car, and can operate through voice instructions to complete a plurality of tasks including navigation, entertainment, information inquiry, communication and the like. The vehicle-mounted voice assistant not only simplifies the operation of a driver in the driving process, but also improves the driving safety. The driver does not need to be distracted to operate the mobile phone or the vehicle-mounted key in the driving process any more, and only needs to concentrate on driving, so that the driving safety is improved. Meanwhile, the vehicle-mounted voice assistant is more convenient and efficient to use, and the driving experience of a driver is more comfortable.

The advent of the car voice assistant does provide drivers with numerous convenience and improved driving safety because it allows drivers to perform various tasks through voice commands without distraction to operate the cell phone or car keys, thereby enabling them to focus more on driving. However, noisy environments can pose challenges to the recognition capabilities of the onboard voice assistant.

The problem with noise interference is that it may cause the voice assistant to have problems only in part of the voice instructions, not the entire voice sentence to be unrecognizable. Noise may obscure some of the key words or syllables therein, and thus, the voice assistant may still be able to recognize part of the content, but not fully understand the user's intent. In this case, the driver may need to continually repeat or correct some of the instructions to ensure that the voice assistant is performing the task correctly.

Such partially recognized situations may add to the complexity of the operation because the driver may need to spend more time and effort interacting with the voice assistant to ensure that the task is properly performed. This may not only reduce the satisfaction of the driver, but may also distract them, increasing the unsafe driving. Therefore, improving recognition accuracy of an on-vehicle voice assistant in a noisy environment is important for improving experience and safety of a driver. The voice command recognition method under the noise environment is designed to solve the problem, so that the recognition capability in the noisy environment is improved, the interaction problem between a driver and a voice assistant is reduced, and driving experience and driving safety are further improved.

Further, the on-board voice assistant needs to be equipped with a microphone or a voice sensor for capturing the voice input of the user. This process typically occurs inside the vehicle, triggered by the user speaking a voice command. The voice acquisition device converts the voice signal of the user into an electrical signal and then processes the electrical signal.

However, various sources of noise may be present in the vehicle, including engine noise, road noise while the vehicle is traveling, wind noise, and other mechanical sounds of the vehicle. These noises can interfere with the driver's voice commands, complicating the voice signal and even making it illegible.

The voice recognition system of the car voice assistant uses complex algorithms and models to analyze the voice signal and convert it into text form. In this process, the system must recognize both recognizable syllables and unrecognizable syllables. Syllables are the basic pronunciation units in speech. Syllables are typically composed of one or more phones (pronunciation units). Syllables are the basis for constructing words and sentences, each word being decomposed into one or more syllables.

In a chinese scenario, syllables may correspond to a specific word pronunciation. For example, in the word "car", there are two syllables: "qi" and "che" correspond to the two parts of the pronunciation, respectively. This syllable-to-word pronunciation correspondence is important in speech recognition because it helps to recognize the pronunciation of the entire vocabulary and sentence.

In the context of an on-board voice assistant, it is critical to understand and correctly recognize the pronunciation of a single syllable. This may help the system more accurately understand the driver's voice instructions to perform the correct task. In addition, the syllables are linked with specific words or contexts, so that the system can be helped to better understand the intention of a driver, and the performance and user experience of a voice assistant are further improved.

Recognizable syllables are syllables that can be accurately recognized and converted to text during speech recognition. The syllables are pronounced and are not affected by noise or other disturbances, and thus can be accurately converted to text by the speech recognition system. During the recognition process, the system recognizes and marks these recognizable syllables to construct an accurate voice command.

In contrast to recognizable syllables, unrecognizable syllables are syllables that are difficult to accurately recognize or convert to text for a variety of reasons. These include noise, unclear pronunciation, too fast speech, etc. Unrecognizable syllables may be present in the speech signal, but due to their ambiguity or incompleteness, the speech recognition system may not accurately map them to text. Thus, these syllables often require additional processing and analysis to determine their possible content.

The speech recognition system typically marks the input speech signal in segments, which clearly distinguish between identifiable syllables and unrecognizable syllables, providing the basis for subsequent processing. In general, this process aims to perform a preliminary analysis of the voice commands provided by the driver in noisy environments, to distinguish between identifiable syllables and unrecognizable syllables, and to provide critical information for the subsequent voice command recognition and execution process. The processing enables the system to better cope with complex voice input conditions, and improves recognition accuracy and user experience.

In speech recognition in noisy environments, determining ambiguous sound information in unrecognizable syllables may help the system better understand the voice instructions uttered by the user. In noisy environments, there may be some unrecognizable syllables or words in the voice instructions recognized by the voice assistant. These unrecognizable portions, while not accurately identifiable, often leave some ambiguous characteristic information. Such syllables or vocabulary segments that are unrecognizable but retain some of the features are referred to as ambiguous information.

The Chinese language has the characteristics of initials, finals and tones. Initials and finals constitute the basic structure of syllables, while intonation can be used to distinguish word senses. In noisy environments, the pronunciation of initials, finals and tones may be disturbed, resulting in an insufficiently sharp speech signal.

In some cases, the speech recognition system may only recognize a portion of syllables, such as recognizing only initials and failing to recognize finals, or vice versa. In this case, the ambiguous sound information appears as incomplete pronunciation of a part of syllables. For example, the user has issued "kaimen" meaning "open door", but due to pronunciation problems, the voice assistant may only successfully recognize the vowel "ai men" and fail to correctly recognize the initials. This can also result in ambiguous sound information that the system may misrecognize as other words. Where "men" is a recognizable syllable and "ai" is fuzzy sound information in an unrecognizable syllable.

The ambiguous sound information, while unable to directly parse syllable content, provides valuable linguistic features. These features may provide important clues for judging possible contents of unrecognizable syllables. For example, capturing incomplete initials, finals features such as "m, eng" may infer that this may be syllables such as "eng" or "mini". Having this layer of ambiguous sound information will help the speech recognition system to resolve the sentence intent correctly.

And (3) carrying out detailed analysis on unrecognizable syllables, and extracting some characteristic information contained in the unrecognizable syllables, such as incomplete fuzzy sound information of initials, finals and the like. Then, the speech information of the recognizable part and the fuzzy speech features extracted from the unrecognizable part are comprehensively utilized, and the candidate syllables possibly corresponding to the unrecognizable syllables are deduced through a semantic analysis technology.

For example, a user has issued "kaimen", meaning "home", and the speech recognition system may not accurately recognize all syllables due to the influence of noisy environments. In this case, only a part of the initials "k" may be recognized, and the finals "ai" may be difficult to recognize due to noise. Thus, the initial consonant "k" is determined as the ambiguous sound information of the initial consonant feature. In this example, "men" is a recognizable syllable, and "kai" is an unrecognizable syllable. The system will use the information in the identifiable syllable "men" to assist in inferring the alternative syllables to which the ambiguous information "k" may correspond. Based on the identifiable syllables and the extracted fuzzy sound information, the system performs semantic analysis. Here, the system may infer that the alternative syllables for "k" may include the term "on" (kai) based on the known initial "k" and the semantic information of the recognizable syllables.

Semantic analysis is a high-level natural language processing technique that aims at understanding the grammar and semantic structures in speech in order to accurately grasp the intent of the user. In speech recognition, the task of semantic analysis is to analyze the identified identifiable syllables and ambiguous information and combine the context and semantic rules to determine alternative syllables to unrecognizable syllables.

The generation of alternative syllables may be based on known language models and lexicons, generating alternative syllables that match the identifiable syllables and the ambiguous sound information. Typically, more than one alternative syllable will be generated, as the ambiguous information may have an impact on multiple possible syllables. In this case, the system would consider multiple alternative syllables to increase robustness and handle ambiguity.

In general, determining alternative syllables to unrecognizable syllables based on semantic analysis techniques requires the combination of acoustic feature analysis, language models, grammar rules, etc. to improve the accuracy and reliability of speech recognition. This process is very important for applications such as car voice assistants, as it helps to ensure that the user's voice instructions are properly understood, thereby improving the driving experience and safety.

After the alternative syllables are obtained, calibration is also needed according to the context scene information corresponding to the voice, and the best matched target syllables in the current scene are determined to replace the originally unrecognizable parts. Context information refers to context and environmental information during driving related to the current voice command or voice interaction, which provides a key clue for a voice assistant or voice recognition system to better understand the user's intent.

The scene information may include the current state of the vehicle (e.g., whether music is being navigated or played), a user's request (e.g., finding the nearest gas station or changing radio channels), the location of the vehicle, speed, etc. The context information provides key cues about the user's likely intent, helping to more accurately select the target syllable.

When selecting a target syllable based on the scene information, the system typically determines whether the candidate syllable matches the current scene information. For example, if the user is in a navigation mode, the system may tend to select alternative syllables that are relevant to navigation. The system also typically will consider the user's requests and intentions. If the user makes an explicit request, the system may prioritize the alternative syllables that are relevant to the request. The system may also consider context information, including the semantics of the context and the context, to determine the most appropriate target syllable.

In general, context information is a collection of various context and environmental information collected for better understanding and servicing a user. It helps the speech recognition system to more accurately interpret the user's voice instructions, provide personalized suggestions, and ensure that proper operations are performed in a particular environment. This helps to improve the convenience, efficiency and user satisfaction of the voice interaction.

Once the target syllable is determined, the next task is to parse the target syllable into executable speech instructions. This includes determining the type, parameters, objects, and manner of execution of the operation. For example, if the target syllable is "open window", the system needs to resolve it into an operation to open the window of the vehicle.

The parsed voice instructions are transmitted to a control system of the vehicle or related applications to perform corresponding operations. This may involve integration with vehicle control, navigation, entertainment systems, and the like. For example, executing voice instructions may include navigating to a particular location, changing a music playlist, adjusting the temperature within the vehicle, and so forth.

In general, determining the voice command corresponding to the voice to be recognized according to the target syllable requires the system to not only understand the user's intent, but also convert it into specific actions or tasks to provide an efficient, convenient and satisfactory user experience. This requires extensive semantic analysis, operational parsing and system integration to ensure accurate execution of the voice instructions.

According to the technical scheme provided by the embodiment of the application, through collecting the voice to be recognized in the noise environment, the recognizable syllables and the unrecognizable syllables in the voice to be recognized are determined based on the voice recognition technology; determining fuzzy sound information in unrecognizable syllables; determining at least one alternative syllable corresponding to the unrecognizable syllable by utilizing the identifiable syllable and the fuzzy sound information based on the semantic analysis technology; according to the scene information corresponding to the voice to be recognized, determining target syllables corresponding to unrecognizable syllables from the candidate syllables; and determining a voice instruction corresponding to the voice to be recognized according to the target syllable. According to the embodiment of the application, by integrating the voice recognition and semantic analysis technology, an efficient and accurate solution is provided for the voice instruction recognition problem in a noisy environment, so that the performance and user experience of the vehicle-mounted voice assistant system are improved.

In some embodiments, determining fuzzy sound information in unrecognizable syllables includes: when the initial characteristic, the final characteristic or the tone characteristic is included in the unrecognizable syllable, the initial characteristic, the final characteristic or the tone characteristic is determined as the ambiguous tone information.

In particular, in noisy environments, speech recognition systems often fail to recognize the entire syllable with full accuracy, some of which may become obscured or unclear. By determining the fuzzy sound information, the system can better understand the syllables partially recognized and improve the accuracy of overall recognition.

If the ambiguous sound information is not determined, the system may produce inaccurate recognition results, resulting in the user having to repeat the voice command or make corrections. This increases the difficulty of user interaction, reducing the usability and user experience of the voice assistant.

The undetermined ambiguous tone information may cause the system to misunderstand the user's instructions, perform erroneous operations or provide erroneous answers. By making the tone information clear, the system can better avoid these misunderstandings and misoperations. Especially in special environments such as driving, the reliability of the voice assistant is critical for the safety and convenience of the user. By determining the ambiguous tone information, the system can more reliably execute the user's instructions, reducing potential security risks.

In summary, ambiguous tone information in unrecognizable syllables is determined to improve the performance and reliability of a voice assistant or voice recognition system in noisy environments. This helps to reduce user interaction difficulties, misunderstandings and erroneous operations, to improve user satisfaction, and to ensure reliability in special environments, such as driving.

Further, the primary consideration in determining ambiguous tone information in unrecognizable syllables is initials, finals, and tone characteristics. These features are important in chinese language because they can affect the distinction of word senses. Because initials, finals and intonation constitute the core recognition elements of syllables and vocabulary. Even if syllables cannot be completely identified, as long as the key features are reserved, valuable clues are provided for further judgment of the vocabulary. The fuzzy sound information provides an important reference for a voice recognition system to correctly analyze voice intention.

The initial is the beginning of a syllable, which usually starts with a consonant, which determines the pronunciation characteristics of the syllable. The initial characteristic refers to information of an initial part which can still be captured in a voice signal under the condition that the final part is unrecognizable or lost. For example, the word "open door" is spoken by the user ("kaimen"). In noisy environments, the speech recognition system may recognize the recognizable syllables "k" and "men", but the part of the vowel "ai" may be difficult to recognize due to noise. In this case, "k" is regarded as the initial consonant feature, and is confirmed as the ambiguous sound information.

Vowel characteristics refer to characteristics or information of vowel parts in syllables. In speech signals, syllables are usually composed of two parts, an initial and a final, with the initial being located in the initial part of the syllable and the final being located in the core part of the syllable. Vowel features describe the vowels' way of pronunciation, such as opening, tongue position, lip shape, etc. These characteristics determine the sound quality and tone of the vowels and help to distinguish between different vowels.

Vowel features are critical to speech recognition and understanding because they contain key information in the speech signal that helps determine vocabulary, grammar, and semantics. In a noisy environment, the initial part may be disturbed, so that the system cannot determine the initial, and therefore, the final characteristics are accurately captured and identified, and the final characteristics are very important to improve the identification accuracy of the voice command as fuzzy sound information.

For example, the word "open door" is spoken by the user ("kaimen"). In noisy environments, the speech recognition system may recognize the recognizable syllables "ai" and "men", but the portion of the initial "k" may be difficult to recognize due to noise. In this case, "ai" is identified as the vowel feature as the ambiguous tone information.

Tone features simply refer to specific recognition elements on syllable tone morphology. The Chinese tone is mainly divided into high level tone, upper tone, low tone, etc. A syllable in speech whose tone pattern would be consistent with a particular type of tone feature. If the tone curve exhibits a high level characteristic, it is indicated that it may be a high level tone.

The features of the syllable tone form, if captured correctly by the speech recognition system, provide a powerful clue to the judgment of lexical meaning. Like a fingerprint or facial feature, tonal features explicitly reveal lexical information. Even if syllables cannot be completely identified in a noise environment, powerful support can be provided for vocabulary analysis as long as tone features of syllables are captured.

Therefore, the direct extraction of the elements such as the initial characteristics, the final characteristics, the tone characteristics and the like as the fuzzy tone information can effectively obtain the core characteristics of unrecognizable syllables. The method provides a basis for predicting and correcting recognition errors of subsequent words, and is a key point for improving the voice recognition quality in a noise environment.

In some embodiments, further comprising; determining associated features of initial features, final features or tone features in unrecognizable syllables; determining fuzzy sound information in unrecognizable syllables includes: the associated feature is determined as fuzzy sound information.

In particular, in speech recognition, and particularly in noisy environments, determining ambiguous sound information in unrecognizable syllables may involve a process of capturing associated features. Associated features refer to other audio characteristics associated with an initial, final, or tone feature. For example, the initials "b" and "p" are very similar in sound, the only difference being whether the sound is with vocal cord vibration ("b" with "p" without). In noisy environments, this small difference may cause the speech recognition system to confuse them.

Thus, the associated features may be further blurred to account for these minor acoustic differences. For example, if the system determines that the initial is characterized as "b", but it may take into account noise interference to take "p" as part of the ambiguous sound information because the two initial are very close in actual pronunciation. This process helps to improve the system's tolerance for fuzzy information, making it better suited to voice commands in noisy environments.

The method performs collective fuzzy processing on the initials which are easy to be confused and similar to pronounces, can increase the possibility of correctly identifying the vocabulary subsequently and improves the fuzzy sound processing effect.

Further, in the analysis of associated features, the system needs to determine which features have ambiguity or uncertainty. These ambiguous features may include characteristics associated with initials, finals, or tones, which may be affected by noise, speech variations, or pronunciation ambiguity. The system will flag these ambiguous features as potentially ambiguous sound information.

After the blur feature is determined, the system integrates it into the recognition process of the voice command. This means that the system will take into account the uncertainty of these features, including them as possible candidates for the recognition process.

Determining the correlation features as ambiguous tone information helps to improve the robustness of the system. Even if certain features are difficult to accurately identify in noisy environments, the system may still take into account their potential value to better understand the user's voice instructions. This helps to cope with various complex speech input situations.

In summary, determining the associated features as fuzzy sound information is a key step in speech recognition systems to improve performance and robustness in noisy environments. By integrating various acoustic and linguistic information, the system can better understand and process voice input, thereby providing better user experience and recognition accuracy.

In some embodiments, the identifiable syllables and the unrecognizable syllables are continuous syllables, and determining at least one alternative syllable corresponding to the unrecognizable syllables using the identifiable syllables and the fuzzy information based on the semantic analysis technique comprises: based on semantic analysis technology, determining a predicted vocabulary according to identifiable syllable and fuzzy sound information; wherein a first syllable of the predicted vocabulary is matched with the identifiable syllable, and a second syllable of the predicted vocabulary comprises fuzzy sound information; a second syllable in the predicted vocabulary is determined as an alternative syllable.

Specifically, after the fuzzy sound information in the identifiable syllable and the unrecognizable syllable is obtained, if the two parts are continuous syllables, the candidate syllables of the unrecognizable syllable can be determined by a prediction vocabulary mode. In the case of continuous syllables, the system first needs to generate a set of predicted words from the recognizable syllables and the ambiguous sound information.

The predicted vocabulary refers to a vocabulary in which unrecognizable syllables are estimated to be likely to be located according to the recognized voice information in combination with semantic analysis. The predicted vocabulary needs to rely on semantic information provided by the recognition result, and assisted by the characteristics of the unrecognizable part, and a vocabulary which contains unrecognizable syllables and has reasonable semantics is deduced by using a semantic analysis technology. This approach can greatly reduce the range of possible values for unrecognizable syllables compared to random matching.

For example, the syllable "Beijing big" has been identified, but the latter syllable is unclear. This word can then be predicted to be likely "Beijing university" based on the identified syllable, as well as some ambiguous information in unrecognizable syllables.

Further, the system compares the first syllable of each of the predicted words to whether the recognizable syllables match. If the first syllable match is successful, the system will continue to consider the ambiguous information. The ambiguous tone information is part of an unrecognizable syllable that is considered in the generation of the alternative syllable. The system compares the second syllable in the predicted vocabulary with the fuzzy sound information to see if there is a match.

For example, the first syllable of the recognition result is "river", but the second syllable is not recognizable, and only the ambiguous sound information "s" is extracted. The possible word can be predicted to be "Jiangsu" at this time. Because the first syllable "river" of "Jiangsu" is consistent with the identifiable syllable, the second syllable "su" contains the ambiguous sound information "s". Thus, the unrecognizable syllable "threo" can be derived as an alternative syllable.

If a vocabulary is found in the predicted vocabulary that matches the first syllable with the recognizable syllable and the second syllable matches the ambiguous sound information, the vocabulary will be determined as an alternate syllable. Alternative syllables are possible candidates that can be considered by the system as alternatives to unrecognizable syllables.

After the alternative syllables are determined, the system uses them for semantic analysis to understand the meaning of the voice command. Semantic analysis considers combinations of alternative syllables with other identifiable syllables to determine the final speech instructions. This may involve determining the keywords or phrases of the instruction and combining it with the context information to ensure that the correct instruction is fetched and executed.

In summary, determining alternative syllables corresponding to unrecognizable syllables based on semantic analysis techniques requires a combination of acoustic properties, grammar rules, and context information for the speech. The process is helpful for the system to better understand the voice command in the noise environment, and the accuracy and the robustness of recognition are improved.

In some embodiments, further comprising: from the speech to be recognized, determining the pre-semantic information of the recognizable syllables and the unrecognizable syllables and the post-semantic information of the recognizable syllables and the unrecognizable syllables; and determining scene information according to the previous semantic information and the subsequent semantic information.

Specifically, after the recognizable syllables and unrecognizable syllables of the voice to be recognized are obtained, semantic information before and after the syllables can be further analyzed to determine a recognition scene. The scene information refers to information about the current driving scene or context in the voice recognition system of the on-vehicle voice assistant.

In summary, determining the context information facilitates the system to better understand and respond to the driver's voice instructions, providing personalized, accurate and convenient services, while taking into account the driving environment and user needs. This helps to improve the performance of the voice assistant, enhance the user experience, and provide a higher level of safety during driving.

Determining semantic contents of a front text and a rear text of scene information through a voice instruction, and extracting semantic information of the part of voice before syllables can not be identified as the front text semantic information; also, after the syllables are unrecognizable, semantic information of the speech is also analyzed as the post-semantic.

The foregoing semantic information refers to semantic information about a subject, task, or command contained in a voice interaction or dialog prior to a user's current voice command. Such information may include previous user requests, responses of the system, and important content of any contextual dialog. The contextual information helps determine the context of the current voice instruction and the intent of the user. For example, if the user has previously asked questions about weather forecast, then the location information contained in the current voice instruction may be related to the previous query, which is the previous semantic information.

The latter semantic information refers to semantic information that may be involved after the user's current voice instruction. This may include subsequent questions, requests or operations posed by the user. Knowledge of the semantic information later is the same as that used to predict the current instruction. For example, the post semantic information includes "gas station", "park", etc., which indicates that the undetermined voice command may be to query the location of the nearest service area.

Based on the previous semantic information and the subsequent semantic information, the system may determine current context information. Scene information is a high-level description of the current dialog or task that includes the user's intent, the task to be performed, and possibly related context information. The context information helps the system to better understand the needs of the user, matching voice instructions to specific tasks or services.

By the method for acquiring the context semantics, the voice recognition can better simulate the understanding language process of a person, and the recognition is performed under the condition of having the context, so that the result is more accurate. Therefore, determining a scene using the front and rear Wen Yuyi is an important technical means for improving the speech recognition effect.

By using the pre-semantic information, post-semantic information, and context information, the voice assistant can more fully understand the user's voice instructions, provide a more accurate response, and ensure consistency and consistency is maintained among the different interactions. This helps to improve the user experience, making the voice assistant more intelligent and useful.

In some embodiments, further comprising: determining historical semantic information from historical voice information before voice to be recognized; and determining scene information according to the historical semantic information.

Specifically, in addition to analyzing the front and rear Wen Yuyi of the speech to be recognized, semantics may also be extracted from the earlier historical speech information to determine the scene in which the current speech is located. Historical speech information refers to a series of speech inputs prior to the current speech to be recognized and may include prior voice instructions, interactive dialogs, or other speech signals. Such voice information may be recorded and stored for later analysis and reference.

The historical voice information includes previous voice instructions, interactive dialogs, and voice signals that record previous communications and interactions between the driver and the voice assistant. This information is time-sequential and can be used to infer the current needs and context of the driver. The importance of the historical voice information is that it can provide context that helps the voice assistant to better understand the current voice instructions.

The historical speech information is typically stored in a database or cache of the system in the form of a series of speech signals or text. Such information may be previous voice instructions, query requests, navigation instructions, etc. The system will record and update the historical voice information periodically or continuously to ensure that previous interactions can be traced back.

The historical voice information not only provides context for the user's current voice instructions, but also may demonstrate the user's long-term intent and preferences. By analyzing these long-term historical data, the system can better understand the needs of users, such as where they often query, what types of restaurants they like, and so on. This helps personalize the service and improves user satisfaction.

By analyzing the historical speech information, the system can determine the current scene and the user's intent. For example, if the current language pattern of the user is displayed in historical speech information representing information querying nearby restaurants, then the current scenario may be related to restaurants that the user may wish to dislike. Such context information helps to better interpret the current speech to be recognized.

By means of acquiring the historical semantics, language venation of a user can be held more fully, scene information is analyzed from a longer semantic chain, and limitation to single sentences is avoided. Therefore, the scene judgment can be more comprehensive and accurate by utilizing the semantic content of the historical voice. This is very helpful in resolving ambiguity of recognition speech and improving speech recognition quality.

In summary, determining historical semantic information is one of the key steps of an on-board voice assistant to provide high quality services in a complex driving environment. By fully utilizing the historical voice information, the system can better determine the voice information to be recognized so as to determine scene information, better understand and meet the requirements of drivers, and improve the interaction efficiency and convenience.

In some embodiments, determining the voice instruction corresponding to the voice to be recognized according to the target syllable includes: replacing unrecognizable syllables in the voice to be recognized by using target syllables to determine target voice; determining target semantics of target voice; and determining the voice instruction by utilizing the target semantics.

Specifically, after obtaining the final target syllable corresponding to the unrecognizable syllable, it is also necessary to determine the final voice command based on the target syllable. Based on the target syllables that have been determined, the system will locate unrecognizable syllable positions in the speech to be recognized. It then performs a replacement process to replace the unrecognizable syllable with the target syllable. This replacement process requires reliance on previously determined alternative syllables and scene information to ensure the accuracy of the replacement.

Once the replacement process is complete, the system obtains a target speech segment containing the target syllable. This target speech segment is generated from unrecognizable syllables and alternative syllables in the original speech instructions of the user. It is a segment of speech that has been modified and optimized and can more accurately represent the user's intent.

Although the target speech already contains target syllables, further semantic analysis is required to understand the user's real intent. This process involves matching the target speech to a semantic model of the system to determine the needs, instructions, or problems of the user.

For example, the primitive voice command is "please help me navigate to wang poplar", where "poplar" is an unrecognizable syllable, and the processing determines "poplar" as the target syllable. Then replace this unrecognizable syllable and get the voice "please help me navigate to the wangfu well". Then, the meaning of the replaced voice, namely the target voice, is judged by semantic analysis. In the above example, the target speech represents the semantics of a navigational query. And finally, determining the specific voice instruction type and content corresponding to the voice statement according to the semantic analysis result. For example, the voice command of the type "inquiry navigation route" is determined as above, and the parameter is "wangfu well".

By replacing syllables and judging the semantics, the voice command can be further defined on the basis of solving the problem of unrecognizable syllables, and the conversion from voice to command is completed, which is the final purpose of voice recognition.

Any combination of the above optional solutions may be adopted to form an optional embodiment of the present application, which is not described herein in detail.

The following are device embodiments of the present application, which may be used to perform method embodiments of the present application. For details not disclosed in the device embodiments of the present application, please refer to the method embodiments of the present application.

Fig. 2 is a schematic diagram of a voice command recognition device in a noise environment according to an embodiment of the present application. As shown in fig. 2, the voice command recognition apparatus in the noise environment includes:

a syllable determination module 201 configured to collect a voice to be recognized in a noisy environment, determine a recognizable syllable and a unrecognizable syllable in the voice to be recognized based on a voice recognition technique;

a fuzzy sound information determination module 202 configured to determine fuzzy sound information in unrecognizable syllables;

an alternative syllable determination module 203 configured to determine at least one alternative syllable corresponding to the unrecognizable syllable using the identifiable syllable and the fuzzy sound information based on the semantic analysis technique;

the target syllable determining module 204 is configured to determine target syllables corresponding to unrecognizable syllables from the candidate syllables according to the scene information corresponding to the voice to be recognized;

the voice instruction determining module 205 is configured to determine a voice instruction corresponding to the voice to be recognized according to the target syllable.

In some embodiments, the ambiguous sound information determination module 202 in fig. 2 determines an initial feature, a final feature, or a tone feature as ambiguous sound information when the initial feature, the final feature, or the tone feature is included in the unrecognizable syllable.

In some embodiments, the fuzzy sound information determination module 202 in fig. 2 determines the associated feature as fuzzy sound information.

In some embodiments, the alternative syllable determination module 203 in FIG. 2 determines the predicted vocabulary from the identifiable syllable and the ambiguous sound information based on semantic analysis techniques; wherein a first syllable of the predicted vocabulary is matched with the identifiable syllable, and a second syllable of the predicted vocabulary comprises fuzzy sound information; a second syllable in the predicted vocabulary is determined as an alternative syllable.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic of each process, and should not limit the implementation process of the embodiment of the present application in any way.

Fig. 3 is a schematic diagram of an electronic device 3 provided in an embodiment of the present application. As shown in fig. 3, the electronic apparatus 3 of this embodiment includes: a processor 301, a memory 302 and a computer program 303 stored in the memory 302 and executable on the processor 301. The steps of the various method embodiments described above are implemented when the processor 301 executes the computer program 303. Alternatively, the processor 301, when executing the computer program 303, performs the functions of the modules/units in the above-described apparatus embodiments.

The electronic device 3 may be an electronic device such as a desktop computer, a notebook computer, a palm computer, or a cloud server. The electronic device 3 may include, but is not limited to, a processor 301 and a memory 302. It will be appreciated by those skilled in the art that fig. 3 is merely an example of the electronic device 3 and is not limiting of the electronic device 3 and may include more or fewer components than shown, or different components.

The processor 301 may be a central processing unit (Central Processing Unit, CPU) or other general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.

The memory 302 may be an internal storage unit of the electronic device 3, for example, a hard disk or a memory of the electronic device 3. The memory 302 may also be an external storage device of the electronic device 3, for example, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the electronic device 3. The memory 302 may also include both internal storage units and external storage devices of the electronic device 3. The memory 302 is used to store computer programs and other programs and data required by the electronic device.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a readable storage medium (e.g., a computer readable storage medium). Based on such understanding, the present application implements all or part of the flow in the methods of the above embodiments, or may be implemented by a computer program to instruct related hardware, and the computer program may be stored in a computer readable storage medium, where the computer program may implement the steps of the respective method embodiments described above when executed by a processor. The computer program may comprise computer program code, which may be in source code form, object code form, executable file or in some intermediate form, etc. The computer readable storage medium may include: any entity or device capable of carrying computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth.

The above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims

1. A method for recognizing a voice command in a noisy environment, comprising:

determining fuzzy sound information in the unrecognizable syllables;

determining at least one alternative syllable corresponding to the unrecognizable syllable by utilizing the identifiable syllable and the fuzzy sound information based on a semantic analysis technology;

determining target syllables corresponding to the unrecognizable syllables from the candidate syllables according to the scene information corresponding to the voice to be recognized;

2. The method of claim 1, wherein the determining fuzzy sound information in the unrecognizable syllable comprises:

and when the unrecognizable syllable comprises an initial characteristic, a final characteristic or a tone characteristic, determining the initial characteristic, the final characteristic or the tone characteristic as the fuzzy sound information.

3. The method of claim 2, further comprising; determining associated features of initial, final or tone features in the unrecognizable syllable; the determining of the ambiguous sound information in the unrecognizable syllable includes:

and determining the association characteristic as the fuzzy sound information.

4. The method of claim 1, wherein the identifiable syllables and the unrecognizable syllables are consecutive syllables, and wherein the determining at least one alternative syllable corresponding to the unrecognizable syllable using the identifiable syllables and the fuzzy information based on the semantic analysis technique comprises:

determining a predicted vocabulary according to the identifiable syllables and the fuzzy sound information based on the semantic analysis technology; wherein a first syllable of the predicted vocabulary is matched with the identifiable syllable, and a second syllable of the predicted vocabulary comprises the fuzzy sound information;

And determining a second syllable in the predicted vocabulary as the alternative syllable.

5. The method according to any one of claims 1 to 4, further comprising:

determining, from the speech to be recognized, the preceding semantic information of the identifiable syllable and the unrecognizable syllable, and the following semantic information of the identifiable syllable and the unrecognizable syllable;

and determining the scene information according to the pre-text semantic information and the post-text semantic information.

6. The method according to any one of claims 1 to 4, further comprising:

determining historical semantic information from the historical voice information before the voice to be recognized;

and determining the scene information according to the historical semantic information.

7. The method of any one of claims 1 to 4, wherein the determining, from the target syllable, a voice instruction corresponding to the voice to be recognized includes:

using the target syllable to replace the unrecognizable syllable in the voice to be recognized so as to determine a target voice;

determining target semantics of the target voice; and determining the voice instruction by utilizing the target semantics.

8. A voice command recognition device in a noisy environment, comprising:

a syllable determining module configured to collect a voice to be recognized in a noise environment, and determine a recognizable syllable and an unrecognizable syllable in the voice to be recognized based on a voice recognition technology;

a fuzzy sound information determination module configured to determine fuzzy sound information in the unrecognizable syllable;

the target syllable determining module is configured to determine target syllables corresponding to the unrecognizable syllables from the candidate syllables according to the scene information corresponding to the voice to be recognized;

9. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 7 when the computer program is executed.

10. A readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 7.