CN116798421A

CN116798421A - Multi-voice-zone voice recognition method, device, vehicle and storage medium

Info

Publication number: CN116798421A
Application number: CN202210258683.5A
Authority: CN
Inventors: 池军
Original assignee: Beijing Rockwell Technology Co Ltd
Current assignee: Beijing Rockwell Technology Co Ltd
Priority date: 2022-03-16
Filing date: 2022-03-16
Publication date: 2023-09-22

Abstract

The present disclosure relates to a multitone voice recognition method, apparatus, vehicle, and storage medium. According to the embodiment of the disclosure, after the real-time interactive voice sent by the user in the cabin of the target vehicle is obtained, the first source voice zone corresponding to the real-time interactive voice can be determined, if the first source voice zone is the target voice zone, voice recognition is carried out on the real-time interactive voice, and a real-time control instruction corresponding to the real-time interactive voice is obtained.

Description

Multi-voice-zone voice recognition method, device, vehicle and storage medium

Technical Field

The disclosure relates to the field of information technology, and in particular relates to a multi-voice-zone voice recognition method, a multi-voice-zone voice recognition device, a vehicle and a storage medium.

Background

In-vehicle voice is becoming more and more popular, and a user can use an in-vehicle voice assistant to realize functions of navigation, voice playing, vehicle control and the like through voice interaction.

To meet the needs of users, vehicle-mounted voice assistants have been able to implement directional voice interactions for users in the vehicle cabin to avoid interference from sounds at other locations in the vehicle cabin. However, in the process of directional voice interaction between the vehicle-mounted voice assistant and the user, if the user moves to other positions to make a sound, the vehicle-mounted voice assistant cannot continue to perform voice interaction with the user, so that the user experience is reduced.

Disclosure of Invention

In order to solve the technical problems, the present disclosure provides a multi-voice zone voice recognition method, a multi-voice zone voice recognition device, a vehicle and a storage medium.

A first aspect of an embodiment of the present disclosure provides a multi-voice zone speech recognition method, including:

acquiring real-time interactive voice sent by a user in a cabin of a target vehicle;

determining a first source voice zone corresponding to the real-time interactive voice;

And under the condition that the first source voice zone is a target voice zone, performing voice recognition on the real-time interactive voice to obtain a real-time control instruction corresponding to the real-time interactive voice, wherein the target voice zone comprises a second source voice zone corresponding to the wake-up voice sent by the user and adjacent voice zones of the second source voice zone.

A second aspect of an embodiment of the present disclosure provides a multitone voice recognition apparatus, the apparatus including:

the first acquisition module is used for acquiring real-time interactive voice sent by a user in a cabin of the target vehicle;

the first determining module is used for determining a first source voice zone corresponding to the real-time interactive voice;

the recognition module is used for carrying out voice recognition on the real-time interactive voice under the condition that the first source voice zone is the target voice zone to obtain a real-time control instruction corresponding to the real-time interactive voice, wherein the target voice zone comprises a second source voice zone corresponding to the wake-up voice sent by the user and adjacent voice zones of the second source voice zone.

A third aspect of the disclosed embodiments provides a vehicle comprising a memory and a processor, wherein the memory stores a computer program which, when executed by the processor, implements the multi-zone speech recognition method of the first aspect.

A fourth aspect of the disclosed embodiments provides a computer-readable storage medium having a computer program stored therein, which when executed by a processor, implements the multi-voice zone speech recognition method of the first aspect.

Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has the following advantages:

in the embodiment of the disclosure, after the real-time interactive voice sent by the user in the cabin of the target vehicle is obtained, the first source voice zone corresponding to the real-time interactive voice can be determined, if the first source voice zone is the target voice zone, voice recognition is performed on the real-time interactive voice to obtain the real-time control instruction corresponding to the real-time interactive voice, and because the target voice zone comprises the second source voice zone corresponding to the wake-up voice sent by the user and the adjacent voice zone of the second source voice zone, the user can realize voice interaction in the second source voice zone corresponding to the wake-up voice, and can realize voice interaction in the adjacent voice zone of the second source voice zone, and further, in the process of directional voice interaction between the vehicle-mounted voice assistant and the user, even if the user moves from the voice zone of the wake-up vehicle-mounted voice assistant to the adjacent voice zone, voice interaction can be performed with the vehicle-mounted voice assistant normally, and the experience of the user is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments of the present disclosure or the solutions in the prior art, the drawings that are required for the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

Fig. 1 is a schematic diagram of a vehicle cabin of a multi-zone speech recognition method provided by an embodiment of the present disclosure;

FIG. 2 is a flow chart of a multi-voice zone speech recognition method provided by an embodiment of the present disclosure;

FIG. 3 is a flow chart of another multi-zone speech recognition method provided by an embodiment of the present disclosure;

FIG. 4 is a flowchart of a target audio zone verification method provided by an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a multi-voice zone speech recognition device according to an embodiment of the present disclosure;

fig. 6 is a schematic structural view of a vehicle according to an embodiment of the present disclosure.

Detailed Description

In order that the above objects, features and advantages of the present disclosure may be more clearly understood, a further description of aspects of the present disclosure will be provided below. It should be noted that, without conflict, the embodiments of the present disclosure and features in the embodiments may be combined with each other.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced otherwise than as described herein; it will be apparent that the embodiments in the specification are only some, but not all, embodiments of the disclosure.

It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The multi-voice-zone voice recognition method provided by the embodiment of the disclosure can be applied to a vehicle cabin, in order to enable all members in the cabin to use voice services, one or more microphones can be arranged in the vehicle cabin, the voice receiving range of one microphone is called a voice zone, one voice zone can correspond to one or more seat zones, the microphone can divide the space in the vehicle into a plurality of independent voice zones according to the seat positions, voice signals can be independently collected for each voice zone, the microphone enables each voice zone to have enough isolation, the collected voice signals can be transmitted to a multi-voice-zone voice recognition device, the multi-voice-zone voice recognition device can process the voice signals of each voice zone to realize voice recognition and voice interaction, the voice recognition result can be transmitted to a vehicle screen, and the vehicle screen displays the result. The microphone mainly comprises a plurality of digital microphones, a digital signal processing (Digital Signal Processing, DSP) chip and an automobile audio bus (Automotive Audio Bus, A2B) chip.

For example, fig. 1 is a schematic diagram of a vehicle cabin of a multi-voice zone speech recognition method according to an embodiment of the present disclosure, where a cabin space 100 is divided into six voice zones according to a direction from a head to a tail, 101 is a main driving voice zone, and a range is a main driving seat zone; 102 is a secondary driving sound zone, and the range is a secondary driving seat zone; 103 is two rows of left sound zones, the range is the second row of left seat zones; 104 is two rows of right sound zones, the range is the second row of right seat zones; 105 is three rows of left sound zones, ranging from the third row of left seat zones; 106 are three rows of right sound areas, the range is a third row of right seat areas, each sound area is provided with a microphone corresponding to the sound receiving range, and six sound areas can realize independent pickup and can automatically shield sound sources of other sound areas. For example, when a person is making a call in a certain voice zone, voice recognition can be normally performed for other voice zones without interference. 107 is the car machine screen of the center console, and the voice recognition process can be visualized. Fig. 1 is merely an exemplary illustration of a vehicle cabin of a multitone speech recognition method, and is not intended to be exclusive.

In order to meet the requirements of users, the vehicle-mounted voice assistant can realize directional voice interaction of the users in the vehicle cabin so as to avoid interference caused by sounds at other positions in the vehicle cabin. However, in the process of directional voice interaction between the vehicle-mounted voice assistant and the user, if the user moves to other positions to make a sound, the vehicle-mounted voice assistant cannot continue to perform voice interaction with the user, so that the user experience is reduced.

Aiming at the defect that the voice interaction cannot be performed after the user moves the position in the related technology, the embodiment of the disclosure provides a multi-voice-zone voice recognition method, a device, a vehicle and a storage medium, which can perform voice interaction with the vehicle-mounted voice assistant normally even if the user moves from the voice zone of the awakening vehicle-mounted voice assistant to the adjacent voice zone in the process of performing directional voice interaction with the user, and improve the user experience.

In order to better understand the inventive concepts of the embodiments of the present disclosure, the technical solutions of the embodiments of the present disclosure are described below in conjunction with exemplary embodiments.

Fig. 2 is a flowchart of a multi-zone speech recognition method provided by an embodiment of the present disclosure, which may be performed by a multi-zone speech recognition device disposed in a target vehicle. As shown in fig. 2, the multi-voice zone speech recognition method provided in this embodiment includes the following steps:

step 201, acquiring real-time interactive voice emitted by a user in a cabin of a target vehicle.

The real-time interactive voice in the embodiment of the present disclosure may be understood as the interactive voice sent by the user at the current moment.

In the embodiment of the disclosure, when a user in a cabin of a target vehicle wants to perform voice interaction with a vehicle-mounted voice assistant in the target vehicle at a current moment, the user can send out voice, and the multi-voice-zone voice recognition device can acquire the voice sent out by the user.

Optionally, the microphone of the target vehicle collects real-time interactive voice sent by the user in the cabin, and sends the real-time interactive voice to the multi-voice-zone voice recognition device of the target vehicle, so that the multi-voice-zone voice recognition device can acquire the voice sent by the user.

Step 202, determining a first source voice zone corresponding to the real-time interactive voice.

In the embodiment of the disclosure, after receiving the real-time interactive voice sent by the user, the multi-voice-zone voice recognition device may determine a voice zone corresponding to the real-time interactive voice, and determine the voice zone corresponding to the real-time interactive voice as the first source voice zone.

The first source voice zone in the embodiments of the present disclosure may be understood as a voice zone where a user who emits the real-time interactive voice is located, and since one voice zone corresponds to a range in which one microphone receives sound, that is, the voice zone corresponds to the microphone one by one, the first source voice zone of the voice may be determined by the voice zone of the microphone that receives the voice emitted by the user.

Specifically, the multi-voice-zone voice recognition device may first determine a microphone for collecting user voice, further determine the microphone for collecting user voice as a target microphone, then determine a voice zone to which the target microphone belongs according to a range of sound received by the target microphone, and determine the voice zone to which the target microphone belongs as a first source voice zone.

For example, in fig. 1, when real-time interactive voice is emitted from a main driving seat area of a vehicle cabin, the multitone zone voice recognition device determines a main driving voice zone where the main driving seat area is located as a voice zone of the real-time interactive voice, and determines the main driving voice zone as a first source voice zone. The exemplary description of determining the first source audio region is merely illustrative, and not exclusive.

Each voice zone may have a unique voice zone identifier, and the multi-voice zone voice recognition device may prestore voice zone identifiers of respective voice zones corresponding to a range of sound received by each microphone, so that after the multi-voice zone voice recognition device determines the target microphone, the prestored voice zone identifier corresponding to the target microphone may be obtained, and then the voice zone corresponding to the obtained voice zone identifier is used as the first source voice zone.

In step 203, under the condition that the first source voice zone is the target voice zone, performing voice recognition on the real-time interactive voice to obtain a real-time control instruction corresponding to the real-time interactive voice, where the target voice zone includes a second source voice zone corresponding to the wake-up voice sent by the user and an adjacent voice zone of the second source voice zone.

The wake-up voice in the embodiment of the disclosure may be understood as a greeting of a voice recognition function module in a multi-voice area voice recognition device, where each target vehicle corresponds to a specific greeting, after receiving the greeting of the user, the multi-voice area voice recognition device determines whether the greeting is a specific greeting of the target vehicle, if yes, the voice recognition function module is started, if not, the voice recognition function module cannot be started, and after the voice recognition function module is started, the user can perform voice interaction with the target vehicle. If the target vehicle does not receive the wake-up voice of the user, the voice recognition function module cannot be started, and further voice interaction with the user cannot be performed.

The second source audio zone in the embodiment of the present disclosure may be understood as an audio zone corresponding to wake-up speech uttered by a user, and adjacent audio zones of the second source audio zone may include audio zones located in four directions of front, rear, left and right of the second source audio zone. For example, in fig. 1, a user sends wake-up voice to a target vehicle in a second left seat area, where two left sound areas of the second left seat area are corresponding to the wake-up voice, that is, a second source sound area, and adjacent sound areas of the two left sound areas include a main driving sound area, two right sound areas and three left sound areas. This is merely an exemplary illustration of the second source audio region and the adjacent audio region and is not the only illustration.

In the embodiment of the disclosure, when the first source voice zone is the target voice zone, the multi-voice zone voice recognition device may perform voice recognition on the real-time interactive voice to obtain a real-time control instruction corresponding to the real-time interactive voice.

Specifically, the multi-voice-zone voice recognition device can perform voice recognition on the real-time interactive voice to obtain a voice text corresponding to the real-time interactive voice, and then perform semantic recognition on the voice text to obtain a real-time control instruction corresponding to the real-time interactive voice.

In some embodiments of the present disclosure, after determining a first source voice zone corresponding to a real-time interactive voice, the multi-voice zone voice recognition device of the target vehicle may further determine whether the first source voice zone belongs to a second source voice zone or an adjacent voice zone of the second source voice zone, and if the first source voice zone is the second source voice zone or the adjacent voice zone of the second source voice zone, perform voice recognition on the real-time interactive voice to obtain a real-time control instruction corresponding to the real-time interactive voice, so as to perform voice interaction with a user; if the first source voice zone does not belong to the second source voice zone or the adjacent voice zone of the second source voice zone, the real-time interactive voice is not subjected to voice recognition.

Because the soundtrack may have a unique soundtrack identifier, determining whether the first source soundtrack belongs to the second source soundtrack or a soundtrack adjacent to the second source soundtrack may compare the soundtrack identifier of the first source soundtrack with the soundtrack identifier of the second source soundtrack or a soundtrack adjacent to the second source soundtrack, if the soundtrack identifier of the first source soundtrack is the same as the soundtrack identifier of the second source soundtrack or the soundtrack identifier of the first source soundtrack is the same as the soundtrack identifier of the adjacent to the second source soundtrack, determining that the first source soundtrack is the target soundtrack, otherwise determining that the first source soundtrack is not the target soundtrack.

Therefore, in the embodiment of the disclosure, after the real-time interactive voice sent by the user in the cabin of the target vehicle is obtained, the first source voice zone corresponding to the real-time interactive voice can be determined, if the first source voice zone is the target voice zone, voice recognition is performed on the real-time interactive voice to obtain the real-time control instruction corresponding to the real-time interactive voice, and because the target voice zone comprises the second source voice zone corresponding to the wake-up voice sent by the user and the adjacent voice zone of the second source voice zone, the user can realize voice interaction in the second source voice zone corresponding to the wake-up voice, and can realize voice interaction in the adjacent voice zone of the second source voice zone, and further in the process of carrying out directional voice interaction between the vehicle-mounted voice assistant and the user, even if the user moves from the voice zone of the wake-up vehicle-mounted voice assistant to the adjacent voice zone, voice interaction can be normally performed with the vehicle-mounted voice assistant, and the experience of the user is improved.

Fig. 3 is a flowchart of another multi-zone speech recognition method provided by an embodiment of the present disclosure, which may be performed by a multi-zone speech recognition device disposed within a target vehicle. As shown in fig. 3, the multi-voice zone speech recognition method provided in this embodiment includes the following steps:

Step 301, acquiring real-time interactive voice emitted by a user in a cabin of a target vehicle.

Step 302, determining a first source voice zone corresponding to the real-time interactive voice.

And 303, performing voice recognition on the real-time interactive voice under the condition that the first source voice zone is the target voice zone, and obtaining a real-time control instruction corresponding to the real-time interactive voice.

Steps 301 to 303 in the embodiments of the present disclosure may refer to the above steps 210 to 203, and will not be described herein.

Step 304, determining whether the real-time control command is a complete control command.

The complete control instruction in the embodiments of the present disclosure may be understood as an instruction that the vehicle is capable of performing at least one specific action, including at least an operation manner of the instruction and an operation object of the instruction, that is, a manner in which the multitone voice recognition device performs a corresponding operation on the operation object may be told.

An incomplete control command may be understood as a command that the vehicle cannot perform a specific action, the incomplete control command lacking at least one of the manner of operation of the command and the object of operation of the command.

For example, the "play heat list music" is an instruction, the operation mode is "play", the playing object is "heat list music", the instruction is a complete control instruction, and the multi-sound-area voice recognition device can execute the operation of playing heat list music based on the instruction;

The "play" is a non-complete control instruction which lacks an operation object, and the multitone area speech recognition device cannot determine an object to be specifically executed based on the instruction.

This is merely an exemplary illustration of complete control instructions and incomplete control instructions and is not the only illustration.

In the embodiment of the disclosure, after obtaining the real-time control instruction corresponding to the real-time interactive voice, the multi-voice zone voice recognition device can determine whether the real-time control instruction is a complete control instruction.

Specifically, the multi-voice zone voice recognition device can determine whether the real-time control instruction comprises an operation mode and an operation object by judging whether the real-time control instruction comprises the operation mode and the operation object, and if the operation mode and the operation object in the real-time control instruction exist, the real-time control instruction is a complete control instruction; if the operation mode and the operation object in the real-time control instruction do not exist, or only the operation mode exists, or only the operation object exists, determining that the real-time control instruction is an incomplete control instruction.

Step 305, if the real-time control command is a complete control command, executing the real-time control command.

In the embodiment of the disclosure, if the real-time control instruction is a complete control instruction, the multi-voice-zone speech recognition device of the target vehicle executes the real-time control instruction.

According to the embodiment of the disclosure, real-time interactive voice emitted by a user in a cabin of a target vehicle is acquired; determining a first source voice zone corresponding to the real-time interactive voice; under the condition that the first source voice zone is a target voice zone, performing voice recognition on the real-time interactive voice to obtain a real-time control instruction corresponding to the real-time interactive voice; judging whether the real-time control instruction is a complete control instruction or not; if the real-time control instruction is a complete control instruction, the real-time control instruction is executed, so that in the process of directional voice interaction between the vehicle-mounted voice assistant and the user, even if the user moves from a voice zone of the awakening vehicle-mounted voice assistant to an adjacent voice zone, the voice interaction between the user and the vehicle-mounted voice assistant can be performed normally, the control instruction of the voice interaction is ensured to be a complete and uninterrupted instruction, and the user experience is improved.

In some embodiments of the present disclosure, after determining whether the real-time control instruction is a complete control instruction, if the real-time control instruction is not a complete control instruction, the multi-voice zone speech recognition device may query whether a history incomplete control instruction exists in the stored instructions, and if it is queried that the history incomplete control instruction exists, merge the real-time control instruction and the history incomplete control instruction to obtain a merged control instruction; and if no historical incomplete control instruction exists, storing the real-time control instruction as the historical incomplete control instruction.

Specifically, the historical incomplete control command may be generated based on at least one historical interactive voice from the target voice area, and the at least one historical interactive voice may be understood as an interactive voice which is continuous with the real-time interactive voice and is sent by the user.

It should be noted that the historical incomplete control command is generated based on the interactive voice of the target voice zone.

In some embodiments, the multi-voice zone speech recognition device may determine, based on a start time point of the real-time control instruction, that a time difference between the start time point of the real-time control instruction and an end time point of the stored instruction is less than or equal to a preset time threshold, from among instructions stored in the multi-voice zone speech recognition device, and if there is an instruction in the stored instruction that the time difference between the start time point of the real-time control instruction and the end time point of the stored instruction is less than or equal to the preset time threshold, determine the instruction as a history incomplete instruction of the real-time control instruction. The preset time threshold value can be set according to the requirement, and is not limited herein.

Specifically, the starting time point and the ending time point of the instruction can be determined through voice activity detection (Voice Activity Detection, VAD), the voice activity detection is a voice recognition technology, an application program of the voice activity detection can be integrated in the multi-voice zone voice recognition device, the starting time point and the ending time point of the instruction can be accurately positioned, and the starting time point and the ending time point of the instruction are reported to the multi-voice zone voice recognition device.

Specifically, whether the incomplete control instruction is a historical incomplete control instruction generated by interactive voice which is sent by a user and is continuous with the real-time interactive voice can be determined according to a comparison result of a time difference between a starting time point of the real-time control instruction and an ending time point of the incomplete control instruction and a preset time threshold, if the time difference between the starting time point of the real-time control instruction and the ending time point of the incomplete control instruction is smaller than or equal to the preset time threshold, the incomplete control instruction is determined to be the historical incomplete control instruction corresponding to the real-time control instruction, and the preset time threshold can be set by the user according to needs without limitation.

For example, in fig. 1, a "play heat list music" is a complete control instruction, a user speaks a first interactive voice "play" in a main driving seat area, a corresponding voice area is a main driving voice area, then after the main driving seat is laid down in a second row of left seat areas, a second interactive voice "hot list music" is continuously spoken within a preset time threshold, the corresponding voice area is a two row of left voice areas, and then the main driving voice area is a first source voice area.

After receiving the real-time control instruction 'play', the multi-voice-zone voice recognition device judges that the 'play' is not a complete control instruction, then inquires whether a historical incomplete control instruction exists in the stored instruction, if the time difference between the starting time point of the 'play' and the ending time point of the stored incomplete control instruction is larger than a preset time threshold, determines that the historical incomplete control instruction of the 'play' does not exist in the stored instruction, and determines the 'play' of the real-time control instruction as the historical incomplete control instruction and stores the historical incomplete control instruction.

When the multi-voice zone voice recognition device receives a real-time control instruction 'hot list music', judging that the 'hot list music' is not a complete control instruction, inquiring whether a historical incomplete control instruction exists in the stored instructions, and combining the real-time control instruction 'hot list music' and the historical incomplete control instruction 'play' according to the time sequence to obtain a combined control instruction which is 'play hot list music' because the time difference between the first ending time point of the stored incomplete control instruction 'play' and the second starting time point of the 'hot list music' is smaller than or equal to a preset time threshold value. This is merely an exemplary illustration of a historical incomplete control instruction corresponding to a real-time control instruction, and not a unique illustration.

In some embodiments, after obtaining the merge control command, the multi-voice zone speech recognition device may determine whether the merge control command is a complete control command, and if the merge control command is a complete control command, execute the merge control command. If the merge control command is not a complete control command, the merge control command may be stored as a new historical incomplete control command.

In the embodiment of the disclosure, after judging whether the real-time control instruction is a complete control instruction, if the real-time control instruction is not the complete control instruction, the multi-voice area voice recognition device can query whether a historical incomplete control instruction exists in the stored instructions, if the historical incomplete control instruction exists, combine the real-time control instruction with the historical incomplete control instruction to obtain a combined control instruction, further judge whether the combined control instruction is the complete control instruction, if the combined control instruction is the complete control instruction, execute the combined control instruction, and if the combined control instruction is not the complete control instruction, store the combined control instruction as a new historical incomplete control instruction; if no historical incomplete control instruction exists in the stored instructions, the real-time control instructions are stored as the historical incomplete control instructions, and the control instructions of voice interaction are further guaranteed to be complete and uninterrupted instructions through multistage judgment of the real-time control instructions, so that user experience is improved.

Fig. 4 is a flowchart of a target voice zone verification method according to an embodiment of the present disclosure, which may be performed by a multi-voice zone voice recognition device disposed in a target vehicle, and may be performed before verifying whether a first source voice zone is a first source voice zone, as shown in fig. 4, the target voice zone verification method according to the present embodiment includes the following steps:

step 401, obtaining wake-up voice sent by a user.

In the embodiment of the disclosure, a microphone of a target vehicle collects wake-up voice sent by a user in a cabin and sends the wake-up voice to a multi-voice-zone voice recognition device of the target vehicle.

Step 402, determining a second source voice zone corresponding to the wake-up voice.

In an embodiment of the disclosure, after receiving a wake-up voice sent by a user, a multi-voice-zone voice recognition device of a target vehicle determines a voice zone corresponding to the wake-up voice, and determines the voice zone corresponding to the wake-up voice as a second source voice zone.

Step 403, determining the target sound zone based on the second source sound zone and the adjacent sound zones of the second source sound zone.

In an embodiment of the disclosure, the multi-zone speech recognition device of the target vehicle determines the second source zone and the adjacent zones of the second source zone as the target zone.

According to the embodiment of the disclosure, after the wake-up voice sent by the user is obtained, the second source voice zone corresponding to the wake-up voice is determined, the target voice zone is determined based on the second source voice zone and the adjacent voice zones of the second source voice zone, and the target voice zone can be accurately determined, so that the multi-voice zone voice recognition is more accurate.

In other embodiments of the present disclosure, determining the target volume based on the second source volume and the adjacent volumes of the second source volume may include the following steps 4031-4033:

step 4031, receiving the voice zone correction data.

In the embodiment of the present disclosure, the adjacent sound zone of the second source sound zone may be understood as a sound zone after the sound zone correction is performed on the second source sound zone, and the sound zone correction may be understood as a correction of the position of the sound zone.

The target vehicle can arrange at least one vehicle screen in front of the sound zone to which each row of seats belongs, or arrange the vehicle screen behind the front row of seat back of each sound zone, the vehicle screen can display the voice recognition instruction and result of the sound zone to which the user belongs, the voice interaction between the user and the vehicle is facilitated, when the seats lie down or the side face talks to separate the head and the buttocks of a person, the second source sound zone determined based on the sound zone to which the wake-up voice belongs can not conform to the actual requirement, the instruction displayed on the vehicle screen corresponding to the sound zone is not in front of a speaker, for example, the user sits on the main driving seat and lays the main driving seat to the left of two rows, and at the moment, when the user performs voice interaction, the microphone of the left sound zone of the two rows can collect the voice of the user and display the voice on the vehicle screen of the sound zone of the two rows or the vehicle screen behind the main driving seat back, so that the user on the main driving seat can hardly see the voice instruction displayed on the vehicle screen. Therefore, it is necessary to correct the second source voice zone so that the instructions displayed on the car screen in the second source voice zone are always in front of the speaker, which meets the actual use requirements, and thus the adjacent voice zones of the second source voice zone are determined.

The volume correction data in the embodiments of the present disclosure may include at least one of image data, seat pressure data, and seat pose data within the cabin.

The image data may be understood as image data of a multi-voice-zone voice recognition device sent by a camera in the target vehicle, specifically, the target vehicle may arrange at least one camera in front of a voice zone to which each row of seats belongs, or arrange a camera in the rear of a front row of seat backs of each voice zone, or arrange a camera in the forefront of a cabin, so that the camera may shoot an image of any voice zone in the cabin of the target vehicle, and after the camera converts the shot image into image data, send the image data to the multi-voice-zone voice recognition device.

The seat pressure data may be understood as the pressure of the seat surface, which may be obtained by a pressure sensor inside the seat, and the seat pose data may include the seat back angle of the vehicle in the longitudinal direction from the head to the tail.

Step 4032, determining a modified soundfield in the neighboring soundfield of the second source soundfield based on the soundfield modification data.

In the embodiment of the disclosure, the positions of the soundtracks may be divided according to the positions of the soundtrack seats, for example, a three-row seat vehicle, where the rows of soundtracks include one row, two rows and three rows in the direction from the head to the tail, the preset soundtrack correction coefficients of the soundtracks in the same row are the same, the first preset soundtrack correction coefficient of one row is greater than the second preset soundtrack correction coefficient of two rows, the second preset soundtrack correction coefficient of two rows is greater than the third preset soundtrack correction coefficient of three rows, where, the preset sound area correction coefficient can be understood as the weight of the sound area correction, and the weight of the sound area correction of each row of seats in the embodiment can be understood as the sum of the number of scenes according to practical application, and it is assumed that each row has two seats, one seat of one row is at a first position, the other seat is at a second position, one seat of the two rows is at a third position, the other seat is at a fourth position, one seat of the three rows is at a fifth position, and the other seat is at a sixth position.

In some embodiments, the second source volume may be volume corrected based on the seat pressure data.

Optionally, when the pressure of the seat in the second source audio zone is greater than the preset pressure threshold, the pressure of the seat in any other position is less than the preset pressure threshold, and no correction is performed on the second source audio zone.

Optionally, when the position of the second source sound zone is the first position in the row, and the pressure of the seat in the first position is smaller than the preset pressure threshold, if the pressure of the seat in the second position in the row is greater than or equal to the preset pressure threshold, the pressures of the rest seats are smaller than the preset pressure threshold, the sound zone in the second position is determined to be the correction sound zone, for example, when the position of the second source sound zone is the main driving zone, and when the pressure of the main driving seat is smaller than the preset pressure threshold, if the pressure of the auxiliary driving seat is greater than or equal to the preset pressure threshold, and the pressures of the rest seats are smaller than the preset pressure threshold, the sound zone in the auxiliary driving position is determined to be the correction sound zone;

optionally, if the pressures of the seats at any positions of the two rows and the three rows are greater than or equal to a preset pressure threshold, the pressures of the rest seats are less than the preset pressure threshold, and the second source sound zone is not corrected.

Optionally, when the position of the second source voice zone is the third position in the two rows, and the pressure of the seat in the third position is smaller than the preset pressure threshold, if the pressure of the seat in the first position in one row is greater than or equal to the preset pressure threshold, the pressures of the rest seats are smaller than the preset pressure threshold, the first position is determined to be the correction voice zone, for example, the position of the second source voice zone is two rows of left areas, the pressure of the seat in the two rows of left areas is smaller than the preset pressure threshold, and if the pressure of the main driving seat is greater than or equal to the preset pressure threshold, the pressure of the rest seats is smaller than the preset pressure threshold, and the voice zone in the main driving position is determined to be the correction voice zone; if the pressure of the seats at the fourth position of the two rows is greater than or equal to a preset pressure threshold, the pressures of the other seats are less than the preset pressure threshold, and the sound zone at the fourth position is determined to be a correction sound zone;

Optionally, if the pressures of the seats at the fourth position and any position in the three rows are greater than or equal to a preset pressure threshold, the pressures of the rest seats are less than the preset pressure threshold, and the sound zone at the third position is determined to be the correction sound zone.

Optionally, when the position of the second source sound zone is the fifth position in the three rows, and the pressure of the seat in the fifth position is smaller than the preset pressure threshold, if the pressure of the seat in the fourth position in the two rows is greater than or equal to the preset pressure threshold, the surface pressures of the rest seats are smaller than the preset pressure threshold, and the sound zone in the fourth position is determined to be the correction sound zone.

In still other embodiments, the second source audio region may be modified based on the seat pressure data and the seat pose data.

Optionally, when the position of the second source sound zone is the third position in the two rows, and the pressure of the seat in the third position is smaller than the preset pressure threshold, if the pressure of the seat in the front row corresponding to the third position and the pressure of the seat in the fourth position in the two rows are both greater than or equal to the preset pressure threshold, the pressures of the rest seats are smaller than the preset pressure threshold, the backrest angle of the seat in the front row is greater than or equal to the preset angle threshold, and the sound zone in the front row is determined to be the correction sound zone. For example, when the position of the second source voice zone is two rows of left zones, and the seat pressures of the two rows of left zones are smaller than a preset pressure threshold, if the main driving seat pressure and the seat pressures of the two rows of right zones are both larger than or equal to the preset pressure threshold, the pressures of the rest seats are smaller than the preset pressure threshold, and the voice zone of the main driving position is determined to be the correction voice zone. The scene at this time is that the main driving seat is laid down to two rows of left, the user speaks in two rows of left areas, the two rows of right areas are occupied, the second source voice area is two rows of left, and after the voice area is corrected, the voice area at the main driving position is determined to be the final second source voice area.

Optionally, when the position of the second source sound zone is the fifth position in the three rows, and the pressure of the seat in the fifth position is smaller than the preset pressure threshold, if the pressure of the seat in the front row position and the pressure of the seat in the rear row position corresponding to the fifth position are both greater than or equal to the preset pressure threshold, the pressures of the rest seats are smaller than the preset pressure threshold, the backrest angle of the seat in the front row position is greater than or equal to the preset angle threshold, and the sound zone in the front row position is determined to be the correction sound zone.

The preset pressure threshold and the preset angle threshold in the embodiments of the present disclosure may be set according to needs, and are not particularly limited herein, when the pressure of the seat is greater than or equal to the preset pressure threshold, the seat may be considered to be occupied, and when the pressure of the seat is less than the preset pressure threshold, the seat may be considered to be occupied; the seatback may be considered to be inclined toward the rear row region when the seatback angle is equal to or greater than a preset angle threshold, and may be considered to be inclined toward the region where the seatback is located when the seatback angle is less than the preset angle threshold.

In other embodiments, the second source audio region may be audio region modified based on the image data:

optionally, detecting a person image in the range of the seat corresponding to the second source voice zone in the image, if a person head is detected in the range of the seat in the image, judging the sitting posture of the person, if the sitting posture is the voice zone corresponding to the seat zone of the person head, but the body is in other voice zones, wherein the body is other parts of the body except the person head, the voice zone where the body is located is taken as a corrected second source voice zone, and if the sitting posture is the voice zone corresponding to the seat zone of the person head and the body, no correction is needed; if two heads are detected in the image within the seat, no correction is needed.

In still other embodiments, the second source audio region may be modified based on the seat data and the image data:

alternatively, the multi-zone speech recognition device may acquire the seat data and the image data at the same time, or may acquire the image data after correcting the second source zone based on the seat data. The multi-voice-zone voice recognition device can firstly correct the second source voice zone based on the seat data to obtain a first confirmed second source voice zone, then judges whether the first confirmed second source voice zone is a person or not through analyzing the image data, if the first confirmed second source voice zone is a person, the correction is not performed, and the first confirmed second source voice zone is determined to be a final second source voice zone; if the second source voice zone confirmed for the first time is not found, the second source voice zone before the first time is confirmed as a final second source voice zone.

Step 4033, the second source audio region and the correction audio region are used as target audio regions.

In the embodiment of the disclosure, after performing the sound region modification on the second source sound region to obtain the modified sound region of the second source sound region, the multi-sound region recognition device of the target vehicle may use the second source sound region and the modified sound region as the target sound region.

According to the embodiment of the disclosure, correction data of a radio zone is received; determining a modified soundfield in adjacent soundfields to the second source soundfield based on the soundfield modification data; the second source sound zone and the correction sound zone are used as target sound zones, so that the target sound zones can be accurately determined, and the multi-sound zone voice recognition is more accurate.

Fig. 5 is a schematic structural diagram of a multi-voice zone speech recognition device according to an embodiment of the present disclosure, which may be understood as part of the functional modules of the target vehicle. As shown in fig. 5, the multitone voice recognition apparatus 500 includes:

a first obtaining module 501, configured to obtain real-time interactive voice sent by a user located in a cabin of a target vehicle;

a first determining module 502, configured to determine a first source voice zone corresponding to real-time interactive voice;

the recognition module 503 is configured to perform voice recognition on the real-time interactive voice under the condition that the first source voice zone is a target voice zone, so as to obtain a real-time control instruction corresponding to the real-time interactive voice, where the target voice zone includes a second source voice zone corresponding to the wake-up voice sent by the user and an adjacent voice zone of the second source voice zone.

According to the embodiment of the disclosure, after the first acquisition module acquires the real-time interactive voice sent by the user in the cabin of the target vehicle, the first determination module can determine the first source voice zone corresponding to the real-time interactive voice, if the first source voice zone is the target voice zone, the recognition module carries out voice recognition on the real-time interactive voice to obtain the real-time control instruction corresponding to the real-time interactive voice, and because the target voice zone comprises the second source voice zone corresponding to the wake-up voice sent by the user and the adjacent voice zone of the second source voice zone, the user can realize voice interaction in the second source voice zone corresponding to the wake-up voice, and can realize voice interaction in the adjacent voice zone of the second source voice zone, and further, in the process of carrying out directional voice interaction between the vehicle-mounted voice assistant and the user, even if the user moves from the voice zone of the wake-up vehicle-mounted voice assistant to the adjacent voice zone, voice interaction can be normally carried out with the vehicle-mounted voice assistant, and user experience is improved.

Optionally, the multi-tone region voice recognition apparatus 500 further includes:

the first judging module is used for judging whether the real-time control instruction is a complete control instruction or not;

the query module is used for querying whether a historical incomplete control instruction exists or not if the real-time control instruction is not the complete control instruction, the historical incomplete control instruction is generated based on at least one historical interactive voice from a target voice area, and the at least one historical interactive voice is interactive voice which is sent by a user and is continuous with the real-time interactive voice;

and the merging module is used for merging the real-time control instruction and the historical incomplete control instruction if the historical incomplete control instruction is inquired to obtain a merged control instruction.

the second judging module is used for judging whether the combined control instruction is a complete control instruction or not;

the first execution module is used for executing the merging control instruction if the merging control instruction is a complete control instruction;

and the first storage module is used for storing the merging control instruction as a new historical incomplete control instruction if the merging control instruction is not the complete control instruction.

And the second storage module is used for storing the real-time control instruction as the historical incomplete control instruction if no historical incomplete control instruction exists.

and the second execution module is used for executing the real-time control instruction if the real-time control instruction is a complete control instruction.

the second acquisition module is used for acquiring wake-up voice sent by a user;

the second determining module is used for determining a second source voice zone corresponding to the wake-up voice;

and a third determining module for determining the target sound zone based on the second source sound zone and the adjacent sound zones of the second source sound zone.

Optionally, the third determining module includes:

the receiving submodule is used for receiving the correction data of the radio zone;

a first determination sub-module for determining a modified soundfield in an adjacent soundfield to the second source soundfield based on the soundfield modification data;

the second determining submodule is used for taking the second source voice zone and the correction voice zone as target voice zones;

wherein the volume correction data includes at least one of image data, seat pressure data, and seat pose data in the cabin.

The multi-voice zone voice recognition device provided in this embodiment can execute the method described in any of the foregoing embodiments, and the execution manner and the beneficial effects thereof are similar, and are not repeated here.

Fig. 6 shows a schematic structural diagram of a vehicle provided by an embodiment of the present disclosure.

As shown in fig. 6, the vehicle may include a processor 601 and a memory 602 storing computer program instructions.

In particular, the processor 601 may include a Central Processing Unit (CPU), or an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or may be configured as one or more integrated circuits that implement embodiments of the present application.

Memory 602 may include a mass storage for information or instructions. By way of example, and not limitation, memory 602 may include a Hard Disk Drive (HDD), floppy Disk Drive, flash memory, optical Disk, magneto-optical Disk, magnetic tape, or universal serial bus (Universal Serial Bus, USB) Drive, or a combination of two or more of these. The memory 602 may include removable or non-removable (or fixed) media, where appropriate. The memory 602 may be internal or external to the integrated gateway device, where appropriate. In a particular embodiment, the memory 602 is a non-volatile solid state memory. In a particular embodiment, the Memory 602 includes Read-Only Memory (ROM). The ROM may be mask-programmed ROM, programmable ROM (PROM), erasable PROM (Electrical Programmable ROM, EPROM), electrically erasable PROM (Electrically Erasable Programmable ROM, EEPROM), electrically rewritable ROM (Electrically Alterable ROM, EAROM), or flash memory, or a combination of two or more of these, where appropriate.

The processor 601 reads and executes the computer program instructions stored in the memory 602 to perform the steps of the multitone voice recognition method provided by the embodiments of the present disclosure.

In one example, the vehicle may also include a transceiver 603 and a bus 604. As shown in fig. 6, the processor 601, the memory 602, and the transceiver 603 are connected to each other through the bus 604 and perform communication with each other.

Bus 604 includes hardware, software, or both. By way of example, and not limitation, the buses may include an accelerated graphics port (Accelerated Graphics Port, AGP) or other graphics BUS, an enhanced industry standard architecture (Extended Industry Standard Architecture, EISA) BUS, a Front Side BUS (FSB), a HyperTransport (HT) interconnect, an industry standard architecture (Industrial Standard Architecture, ISA) BUS, an InfiniBand interconnect, a Low Pin Count (LPC) BUS, a memory BUS, a micro channel architecture (Micro Channel Architecture, MCa) BUS, a peripheral control interconnect (Peripheral Component Interconnect, PCI) BUS, a PCI-Express (PCI-X) BUS, a serial advanced technology attachment (Serial Advanced Technology Attachment, SATA) BUS, a video electronics standards association local (Video Electronics Standards Association Local Bus, VLB) BUS, or other suitable BUS, or a combination of two or more of these. Bus 604 may include one or more buses, where appropriate. Although embodiments of the application have been described and illustrated with respect to a particular bus, the application contemplates any suitable bus or interconnect.

Further, the vehicle can further comprise a man-machine interaction device, such as a vehicle-machine screen, which can communicate with the processor through the bus, and the man-machine interaction device can display the voice control instruction and the corresponding execution result to the user, so that the voice interaction is visualized, and the voice interaction experience of the user is improved.

The embodiment of the disclosure further provides a computer readable storage medium, where the storage medium may store a computer program, where when the computer program is executed by a processor, the processor implements the multi-voice-zone speech recognition method provided by the embodiment of the disclosure, and the execution manner and the beneficial effects of the method are similar, and are not repeated herein.

The storage medium may, for example, comprise a memory 602 of computer program instructions executable by the processor 601 of the multi-zone speech recognition device to perform the multi-zone speech recognition method provided by embodiments of the present disclosure.

The computer readable storage media described above can employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer programs described above may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server.

The foregoing is merely a specific embodiment of the disclosure to enable one skilled in the art to understand or practice the disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown and described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A multi-tone region speech recognition method, comprising:

and under the condition that the first source voice zone is a target voice zone, carrying out voice recognition on the real-time interactive voice to obtain a real-time control instruction corresponding to the real-time interactive voice, wherein the target voice zone comprises a second source voice zone corresponding to wake-up voice sent by the user and adjacent voice zones of the second source voice zone.

2. The method according to claim 1, wherein after the voice recognition is performed on the real-time interactive voice to obtain the real-time control command corresponding to the real-time interactive voice, the method further comprises:

judging whether the real-time control instruction is a complete control instruction or not;

if the real-time control instruction is not a complete control instruction, inquiring whether a history incomplete control instruction exists, wherein the history incomplete control instruction is generated based on at least one history interactive voice from the target voice area, and the at least one history interactive voice is an interactive voice which is sent by the user and is continuous with the real-time interactive voice;

If the historical incomplete control instruction is found to exist, combining the real-time control instruction with the historical incomplete control instruction to obtain a combined control instruction.

3. The method of claim 2, wherein after said merging said real-time control instruction with said historical incomplete control instruction resulting in a merged control instruction, said method further comprises:

judging whether the combined control instruction is a complete control instruction or not;

if the combination control instruction is a complete control instruction, executing the combination control instruction;

and if the combined control instruction is not a complete control instruction, storing the combined control instruction as a new historical incomplete control instruction.

4. The method of claim 2, wherein after the query has a history of incomplete control instructions, the method further comprises:

and if no historical incomplete control instruction exists, storing the real-time control instruction as the historical incomplete control instruction.

5. The method of claim 2, wherein after said determining whether said real-time control instruction is a complete control instruction, said method further comprises:

And if the real-time control instruction is a complete control instruction, executing the real-time control instruction.

6. The method as recited in claim 1, further comprising:

acquiring the awakening voice sent by the user;

determining a second source voice zone corresponding to the wake-up voice;

the target volume is determined based on the second source volume and adjacent volumes of the second source volume.

7. The method of claim 6, wherein the determining the target volume based on the second source volume and adjacent volumes of the second source volume comprises:

receiving correction data of the radio zone;

determining a modified soundfield in adjacent soundfields to the second source soundfield based on the soundfield modification data;

taking the second source sound zone and the correction sound zone as the target sound zone;

wherein the volume correction data includes at least one of image data, seat pressure data, and seat pose data within the cabin.

8. A multitone region speech recognition device, the device comprising:

the recognition module is used for carrying out voice recognition on the real-time interactive voice under the condition that the first source voice zone is a target voice zone, so as to obtain a real-time control instruction corresponding to the real-time interactive voice, wherein the target voice zone comprises a second source voice zone corresponding to wake-up voice sent by the user and an adjacent voice zone of the second source voice zone.

9. A vehicle, characterized in that the vehicle comprises:

a memory and a processor, wherein the memory has stored therein a computer program which, when executed by the processor, implements the multi-zone speech recognition method of any of claims 1-7.

10. A computer readable storage medium, characterized in that the storage medium has stored therein a computer program which, when executed by a processor, implements the multitone voice recognition method according to any one of claims 1-7.