CN116978379A

CN116978379A - Voice instruction generation method and device, readable storage medium and electronic equipment

Info

Publication number: CN116978379A
Application number: CN202311005351.7A
Authority: CN
Inventors: 陶然
Original assignee: Nanjing Horizon Integrated Circuit Co ltd
Current assignee: Nanjing Horizon Integrated Circuit Co ltd
Priority date: 2023-08-09
Filing date: 2023-08-09
Publication date: 2023-10-31

Abstract

The embodiment of the disclosure discloses a voice instruction generation method, a device, a computer readable storage medium and electronic equipment, wherein the method comprises the following steps: acquiring user induction information, and determining first user information of a user positioned in a target area based on the user induction information; determining target user information of a user in a target area based on the first user information and the user information base; responsive to receiving the voice control signal, determining a voice attribute of the voice control signal; determining a matching relationship between the voice attribute and the target user information; and generating a voice command corresponding to the voice control signal in response to the matching relation as matching. According to the voice control signal source matching method and device, the source of the voice control signal can be automatically matched, the voice control authority of the target user is guaranteed to be not occupied by other users, and the risk of misidentification of the voice control signal is reduced. Meanwhile, the automatic recording of the target user information of the user in the target area is realized, and the convenience of the user is improved.

Description

Voice instruction generation method and device, readable storage medium and electronic equipment

Technical Field

The disclosure relates to the technical field of computers, and in particular relates to a method and a device for generating voice instructions, a computer readable storage medium and electronic equipment.

Background

With the development of artificial intelligence technology, the scene of controlling equipment by voice is more and more abundant. But with the accompanying problems of misrecognition, etc.

For example, the vehicle may be controlled with speech to perform adjustments to window height, control in-vehicle multimedia playback, control navigation procedures, and the like. For safety of vehicle driving, individual application functions need to limit control authority of users, such as voice control navigation, to be used by persons in the driving position, so that accurate positioning of sources of voice instructions is needed to avoid collision between persons in non-driving positions and voice control authority of persons in the driving position, and driving safety is affected.

Currently, a voice zone locating method is generally used to locate a source of voice, and multiple microphones are required to collect multiple voice signals for voice zone locating. However, the multi-path microphone cannot solve the problem that a person in a non-main driving position approaches to the main driving position to cause the person in the non-main driving position to be mistakenly identified as the person in the main driving position.

Disclosure of Invention

In order to solve the technical problems, embodiments of the present disclosure provide a method, an apparatus, a computer-readable storage medium, and an electronic device for generating a voice command, so as to reduce the risk of misidentification of voice control, and automatically record user information of a target area, without active registration of a user, thereby improving convenience of use of the user.

The embodiment of the disclosure provides a voice instruction generation method, which comprises the following steps: acquiring user induction information, and determining first user information of a user positioned in a target area based on the user induction information; determining target user information of a user in a target area based on the first user information and the user information base; responsive to receiving a voice control signal comprising a control instruction, determining a voice attribute of the voice control signal; determining a matching relationship between the voice attribute and the target user information; and generating a voice command corresponding to the voice control signal in response to the matching relation as matching.

According to another aspect of the embodiments of the present disclosure, there is provided a voice instruction generating apparatus, including: the acquisition module is used for acquiring user induction information and determining first user information of a user positioned in a target area based on the user induction information; the first determining module is used for determining target user information of the user in the target area based on the first user information and the user information base; a second determining module for determining a voice attribute of the voice control signal in response to receiving the voice control signal including the control instruction; the third determining module is used for determining the matching relation between the voice attribute and the target user information; and the generating module is used for responding to the matching relation as matching and generating a voice instruction corresponding to the voice control signal.

According to another aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium storing a computer program for execution by a processor to implement a generating method of executing the above-described voice instructions.

According to another aspect of an embodiment of the present disclosure, there is provided an electronic device including: a processor; a memory for storing processor-executable instructions; and the processor is used for reading the executable instructions from the memory and executing the instructions to realize the voice instruction generation method.

According to another aspect of embodiments of the present disclosure, there is provided a computer program product comprising computer program instructions which, when executed by an instruction processor, perform the method of generating speech instructions proposed by the present disclosure.

According to the voice command generation method, the voice command generation device, the computer-readable storage medium and the electronic equipment, the user information is recorded through the user information base, the target user information of the user is determined based on the first user information and the user information base which are recognized by the user, when the voice control signal is received, the voice attribute of the voice control signal is determined, if the voice attribute is matched with the target user information, the voice command is generated, automatic matching of the source of the voice control signal is achieved, the voice control authority of the target user is ensured not to be occupied by other users, and the risk of misrecognition of the voice control signal is reduced. Meanwhile, the automatic recording of the target user information of the user in the target area is realized, active registration of the user is not needed, and the convenience of the user is improved.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing embodiments thereof in more detail with reference to the accompanying drawings. The accompanying drawings are included to provide a further understanding of embodiments of the disclosure, and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure, without limitation to the disclosure. In the drawings, like reference numerals generally refer to like parts or steps;

FIG. 1 is a system diagram to which the present disclosure is applicable;

FIG. 2 is a flow chart of a method for generating voice instructions according to an exemplary embodiment of the present disclosure;

FIG. 3 is a flow chart of a method for generating voice instructions provided by another exemplary embodiment of the present disclosure;

FIG. 4 is a flow chart of a method for generating voice instructions provided by another exemplary embodiment of the present disclosure;

FIG. 5 is a flow chart of a method for generating voice instructions provided by another exemplary embodiment of the present disclosure;

FIG. 6 is a flow chart of a method of generating voice instructions provided by another exemplary embodiment of the present disclosure;

FIG. 7 is a flow chart of a method of generating voice instructions provided by another exemplary embodiment of the present disclosure;

FIG. 8 is a flow chart of a method of generating voice instructions provided by another exemplary embodiment of the present disclosure;

FIG. 9 is a flowchart of a method for generating voice instructions provided by another exemplary embodiment of the present disclosure;

fig. 10 is a schematic structural diagram of a voice command generating apparatus according to an exemplary embodiment of the present disclosure;

fig. 11 is a schematic structural view of a voice instruction generating apparatus provided in another exemplary embodiment of the present disclosure;

fig. 12 is a block diagram of an electronic device provided in an exemplary embodiment of the present disclosure.

Detailed Description

For the purpose of illustrating the present disclosure, exemplary embodiments of the present disclosure will be described in detail below with reference to the drawings, it being apparent that the described embodiments are only some, but not all embodiments of the present disclosure, and it is to be understood that the present disclosure is not limited by the exemplary embodiments.

It should be noted that: the relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless it is specifically stated otherwise.

Summary of the application

In order to reduce confusion of voice control authorities of users in different positions, a voice zone positioning method is generally used for positioning a voice source, and multiple microphones are required to collect multiple voice signals for voice zone positioning. For example, in the field of voice-controlled vehicles, multiple microphones cannot solve the problem that a person not in a primary driving position approaches the primary driving position, resulting in erroneous recognition of the person not in the primary driving position as the person in the primary driving position. If the image recognition lip movement information is used, the accuracy of the voice zone positioning can be improved by combining the image recognition lip movement information with an acoustic method, but the problem that the lip movement of a person not in a main driving position is erroneously recognized as the lip movement of the person in the main driving position due to the fact that the person not in the main driving position approaches to the main driving position cannot be solved.

In order to solve the above problems, the embodiment of the disclosure adopts the user sensing information of the user identifying the target area, determines the target user information of the user by utilizing the pre-established user information base, and matches the voice control signal with the target user information when the voice control signal is acquired, thereby avoiding the voice of other users from being mistakenly identified as the voice of the user, reducing the risk of mistakenly identifying the voice control signal, simultaneously, the target user information can be automatically recorded to the user information base without active registration of the user, and improving the convenience of use of the user.

Exemplary System

Fig. 1 illustrates an exemplary system architecture 100 of a voice instruction generation method or a voice instruction generation apparatus to which embodiments of the present disclosure may be applied.

As shown in fig. 1, system architecture 100 may include a terminal device 101, a network 102, a server 103, and an inductive device 104. Network 102 is a medium used to provide communication links between terminal device 101 and server 103. Network 102 may include various connection types such as wired, wireless communication links, or fiber optic cables, among others.

The inductive device 104 may be disposed in various spaces, such as in a car, room, etc. The inductive device 104 may include, but is not limited to, a camera, a microphone, and the like. The sensing device 104 may sense the user in the space, so as to obtain user sensing information such as images and audio.

A user may interact with the server 103 via the network 102 using the terminal device 101 to receive or send messages or the like. The terminal device 101 may have various applications installed thereon, such as a multimedia application, an electronic map application, and the like.

The terminal device 101 may be various electronic devices including, but not limited to, mobile terminals such as in-vehicle terminals, mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), and the like, and fixed terminals such as smart home appliances, desktop computers, and the like.

The server 103 may be a server providing various services, such as a background server identifying user-induced information uploaded by the terminal device 101. The background server may recognize the received user-induced information, generate a new voice command, and feed back the information such as the voice command to the terminal device 101.

It should be noted that, the method for generating a voice command provided in the embodiment of the present disclosure may be executed by the server 103 or may be executed by the terminal device 101, and accordingly, the device for generating a voice command may be provided in the server 103 or may be provided in the terminal device 101. When the method for generating the voice command is executed by the server 103, the user sensing information collected by the sensing device 104 may be directly transmitted to the server 103 through the network, or may be transmitted to the server 103 through the terminal device 101.

It should be understood that the number of terminal devices, networks, servers and sensing devices in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, servers, and sensing devices, as desired for implementation.

Exemplary method

Fig. 2 is a flowchart illustrating a method for generating a voice command according to an exemplary embodiment of the present disclosure. The present embodiment is applicable to an electronic device (such as the terminal device 101 or the server 103 shown in fig. 1), and as shown in fig. 2, the method includes the steps of:

step 201, obtaining user sensing information, and determining first user information of a user located in a target area based on the user sensing information.

The user sensing information may be information collected by the sensing device 104 shown in fig. 1 for a user within the sensing range thereof. For example, a camera and a microphone array can be arranged in a target space such as a cockpit, and an image containing a user and a voice signal of the user acquired by the microphone array, which are shot by the camera, can be used as user sensing information. In addition, the target space can be provided with devices such as an infrared human body detector, a pressure sensor, an operation button, a display screen and the like, and the information of a user sensed by the devices can also be used as user sensing information.

The electronic device can determine whether the target area has a user through the user sensing information. The target area may be a pre-designated area, or may be an area where the user-induced information induces the presence of the user. For example, when the target space is within the cockpit, the target area may be the driver's seat, the passenger, the rear left window, the rear right window, the middle of the rear row, etc. When the target area is a driving location in the vehicle, the user is a user located at the driving location. In addition, the number of the target areas may be plural, and when the number of the target areas is greater than 1, the method may be performed for each target area, that is, the user corresponding to each target area may be determined. The first user information may include information that identifies user-sensed information. For example, when the user sensing information includes an image photographed by a camera, the first user information may include facial feature information of the user; in general, facial feature information of a user can be obtained by recognizing a facial image of the user in a target region in an image. For another example, when the user-sensed information includes an audio signal collected by a microphone, the first user information may include a voice signal of the user; generally, a sound source localization method may be used to extract a voice signal corresponding to a driving location from a multi-channel voice signal collected by a multi-channel microphone as a voice signal of a user.

In addition, the first user information may further include target location information indicating a location where the user is located, that is, a location of the target area. The target location information may be a code number indicating the target area, or may be a location coordinate where the user is located.

Step 202, determining target user information of a user in a target area based on the first user information and the user information base.

The user information base is used for storing information representing the identity of the user, namely information registered by the user, including facial image characteristic information, voiceprint information and the like. As an example, the first user information may be matched with information in the user information base, and if information matched with the first user information exists in the user information base, the registered information of the user matched with the first user information is used as target user information; and if the information matched with the first user information does not exist in the user information base, the first user information is used as target user information. Alternatively, the registration information of the user may be created in the user information base according to the first user information.

In response to receiving the voice control signal including the control instruction, a voice attribute of the voice control signal is determined 203.

The voice control signal may be a voice signal emitted by any user collected by a microphone or a microphone array, where the voice signal includes a control instruction for controlling the target device. The voice attributes may include, but are not limited to: the source position of the voice control signal, the audio characteristic information, the voiceprint information and the like, and text information obtained by identifying the voice control signal can be included, wherein the text information can include an operation target, instruction content and the like. Methods of determining the voice attributes may include, but are not limited to, sound source localization methods, audio feature extraction methods, voiceprint recognition methods, and the like.

Optionally, when detecting the user sensing information, if it is determined that the user sensing information includes a voice control signal, the first user information may be generated based on the voice control signal, and a voice attribute of the voice control signal may be determined.

Step 204, determining the matching relationship between the voice attribute and the target user information.

As an example, if the source location included in the voice attribute is consistent with the target area and the voiceprint information included in the voice attribute is consistent with the voiceprint information included in the target user information, it may be determined that the voice attribute matches the target user information.

Optionally, the voice attribute may further include text information obtained by identifying the voice control signal, and if the text information includes an operation target and instruction content that match with the operation authority of the source location, that is, the user located at the source location and sending the voice control signal has the operation authority to the operation target and has the authority to make the operation target execute the instruction content, it is determined that the voice control signal is valid. The correspondence between the specific location and the operation authority is not specifically limited in the present disclosure.

When the voice control signal is judged to be effective, the matching relationship between the voice attribute and the target user information can be determined by continuously utilizing other information included in the voice attribute; it may also be determined directly that the voice attributes match the target user information. When the voice control signal is judged to be invalid, the voice attribute can be directly determined to be not matched with the target user information. By judging the validity of the voice control signal by utilizing the text information included in the voice attribute, and then determining the matching relation between the voice attribute and the target user information, the judgment that all data included in the voice attribute are matched is not needed is realized, so that the steps for judging the matching relation are reduced, and the efficiency of generating the voice instruction is improved.

And step 205, generating a voice command corresponding to the voice control signal in response to the matching relationship being the matching.

In particular, the speech control signal may be converted into text, from which the speech instructions are generated. For example, if the voice control signal is converted to text "turn on the headlight", a voice command may be generated indicating that the headlight is turned on. Alternatively, the method for determining whether the voice control signal is valid by using the text information included in the voice attribute described in the above-mentioned alternative embodiment may be included in step 205. That is, when the new judgment voice attribute other than the text information included in the voice attribute is matched with the target user information, the text information included in the voice attribute can be further used to judge whether the voice control signal is valid. If the voice command is effective, the voice command is regenerated, so that the accuracy of generating the voice command is improved.

Optionally, when the method is applied to the field of vehicle control, if any person in the vehicle sends out a voice control signal, the voice control signal can be identified, and the voice attribute is determined. When the target area is a driver's seat, if the voice control signal is recognized to be sent by the driver, determining that the voice attribute is matched with the target user information, and further generating a voice instruction; when the target area is set to the co-driver seat (other seats are also possible), if the voice control signal is recognized to be sent by a person on the co-driver seat, the voice attribute is determined to be matched with the target user information, and then a voice instruction is generated.

When the number of the target areas is greater than 1, it may be determined whether the voice attributes and the target user information are matched based on the correspondence between the target areas and the user's operation authority, respectively. For example, the target area may include a driver's seat and a rear-row window level, when the rear-row window level user sends a voice control signal "open a window beside the driver", it is determined that the rear-row window level user does not have an operation authority to the window beside the driver, that is, the voice control signal is invalid, and no control instruction is generated at this time; when the back row leaning window level user sends out a voice control signal of opening the window beside the driver, the back row leaning window level user is judged to have the operation authority on the window beside the user, namely the voice control signal is effective, and then a control instruction is generated.

According to the method provided by the embodiment of the disclosure, the user information is recorded through the user information base, the target user information of the user in the target area is determined based on the first user information and the user information base which are recognized by the user, when the voice control signal is received, the voice attribute of the voice control signal is determined, if the voice attribute is matched with the target user information, a voice instruction is generated, automatic matching of the source of the voice control signal is realized, the voice control authority of the target user is not occupied by other users, and the risk of misrecognition of the voice control signal is reduced. Meanwhile, the automatic recording of the target user information of the user in the target area is realized, active registration of the user is not needed, and the convenience of the user is improved.

In some alternative implementations, as shown in fig. 3, step 201 includes:

in step 2011, a first facial image sequence of the user and facial feature information of the user located in the target area are determined based on the image sequence included in the user sensing information.

The first facial image sequence may be an image sequence obtained by performing facial recognition on an image sequence included in the user sensing information, and the facial feature information may be information obtained by performing facial feature recognition on a facial image included in the first facial image sequence. Alternatively, the facial feature information may include Face identity information (Face ID) for identifying the identity of the user.

In step 2012, first voiceprint information of the user located in the target area is determined based on the voice signal included in the user-sensed information.

Specifically, voiceprint recognition may be performed on the voice signal to obtain first voiceprint information.

Step 2013, determining first user information based on facial feature information of the user and/or first voiceprint information of the user.

Specifically, the facial feature information may be determined as the first user information, the first voiceprint information may be determined as the first user information, and the facial feature information and the first voiceprint information may be determined together as the first user information.

It should be appreciated that steps 2011 and 2012 described above may be performed in whole or in part, i.e., steps 2011 and/or 2022 may be performed. For example, step 2012 may not be performed when the user-sensed information includes only the image sequence, and step 2011 may not be performed when the user-sensed information includes only the voice signal.

According to the embodiment, the identification of the identity of the user based on richer information is realized by identifying the image sequence and/or the voice signal included in the sensing information, so that the identity of the user can be accurately determined.

In some alternative implementations, as shown in fig. 4, step 2012 includes:

in step 20121, a first sound source position and initial voiceprint information of the voice signal are determined based on the voice signal included in the user-sensed information.

Specifically, the electronic device may perform sound source localization on the voice signal based on a sound source localization method to determine a first sound source position representing a source position of the voice signal; and based on the voiceprint recognition method, voiceprint recognition is carried out on the voice signal so as to determine initial voiceprint information.

Step 20122, determining a matching relationship between the first sound source position and the position of the target area.

Specifically, if it is determined that the first sound source position is located within the target area, it is determined that the first sound source position matches the position of the target area.

In step 20123, in response to the matching relationship between the first sound source position and the position of the target area being a match, a first face image sequence of the user is obtained, and lip movement detection is performed on the first face image sequence, so as to obtain a first lip movement detection result of the user.

The lip movement detection result may indicate the meaning of the lip movement of the user. Optionally, the electronic device may detect in real time whether the lips of the user produce a speaking action, and generate a first lip movement detection result indicating whether to speak. Optionally, the electronic device may further detect semantics of the lip motion of the user in real time by using a lip recognition method based on a neural network, and generate a first lip motion detection result that indicates the semantics of the lip motion.

Step 20124, determining a matching relationship between the first lip movement detection result and the voice signal.

Optionally, when the first lip movement detection result indicates whether the user speaks, the speaking time included in the first lip movement detection result may be compared with the collection time of the voice signal, and if the two occur at the same time, it is determined that the first lip movement detection result is matched with the voice signal. Optionally, when the first lip movement detection result indicates the meaning of the lip movement of the user, the first lip movement detection result may be compared with the meaning of the voice signal, the similarity between the first lip movement detection result and the voice signal is determined, and if the similarity is greater than or equal to the threshold value, the first lip movement detection result is determined to be matched with the voice signal.

In response to the first lip movement detection result matching the voice signal, the initial voiceprint information is determined as first voiceprint information of the user of the target area, step 20125.

According to the embodiment, the lip movement of the user is detected to determine whether the lip movement state of the user is matched with the collected voice signal, so that whether the collected voice signal is sent by the user or not is accurately judged, and the voice control authority of the user is accurately controlled.

In some alternative implementations, as shown in fig. 5, step 202 includes:

step 2021, matching the facial feature information of the user and/or the first voiceprint information of the user included in the first user information with the user information in the user information base.

Specifically, if the first user information includes facial feature information and first voiceprint information of the user, the facial feature information of the user may be matched with facial feature information of a plurality of users stored in the user information base, and the first voiceprint information of the user may be matched with voiceprint information of a plurality of users stored in the user information base. If the first user information only includes facial feature information or first voiceprint information of the user, the facial feature information of the user may be matched with facial feature information stored in the user information base, or the first voiceprint information of the user may be matched with voiceprint information stored in the user information base.

In step 2022, responsive to the presence in the user information repository of matching user information matching the facial feature information of the user and/or the first voiceprint information of the user, target user information of the user in the target area is determined based on the matching user information and/or the first user information.

Specifically, if there are target facial feature information matching the facial feature information of the user and target voiceprint information matching the first voiceprint information of the user in the user information base, the target facial feature information and the target voiceprint information may be determined to match the user information. If only the target facial feature information or the target voiceprint information exists in the user information base, the target facial feature information or the target voiceprint information is determined to be matched with the user information.

Further, if there is matching user information, before the current moment of identification, the user has registered information indicating the identity of the user, the stored matching user information may be used as target user information, the first user information obtained by the current identification may be used as target user information, or the matching user information and the first user information may be used together as target user information.

According to the embodiment, the matching user information matched with the first user information is determined from the user information base, so that the matching user information can be directly used when the identity of the user sending the voice control signal is judged later, and the efficiency of judging the voice control authority of the user can be improved.

In some alternative implementations, as shown in fig. 6, step 2022 includes:

in step 20221, in response to the matching user information including facial feature information of the user and excluding first voiceprint information of the user, target user information of the user in the target area is determined based on the matching user information, and voiceprint information in the matching user information is updated based on the first voiceprint information of the user.

Specifically, when the matching user information only includes facial feature information of the user, the stored matching user information can be used as target user information, and meanwhile, the first voiceprint information is stored in the user information base, so that updating of the matching user information is completed, namely, the updated matching user information includes facial feature information and voiceprint information of the user.

In step 20222, in response to the matching user information including voiceprint information of the user and not including facial feature information of the user, target user information of the user in the target area is determined based on the matching user information, and facial feature information in the matching user information is updated based on the facial feature information of the user.

Specifically, when the matching user information only includes the stored voiceprint information of the user, the stored matching user information can be used as target user information, and meanwhile, the facial feature information obtained by the recognition is stored in the user information base, so that the updating of the matching user information is completed.

According to the embodiment, when the matched user information only comprises facial feature information or voiceprint information, the matched user information is updated based on the facial feature information or the first voiceprint information obtained through the identification, so that the matched user information is timely supplemented, more types of information can be adopted when the identity of the user sending the voice control signal is judged later, and the accuracy of judging the voice control authority of the user is further improved.

In some alternative implementations, as shown in fig. 5, after step 2021, the method further includes:

in step 2023, in response to there being no matching user information in the user information repository that matches the facial feature information of the user or the first voiceprint information of the user, the first user information is registered in the user information repository based on the facial feature information of the user and/or the first voiceprint information of the user, and the first user information is determined as target user information.

The registered first user information may include facial feature information of the user and/or first voiceprint information of the user, i.e. the target user information is information obtained by this identification. The first user information may be used as new matching user information the next time the steps comprised in the method are performed.

The embodiment realizes timely registration of the identity of the user when the matched user information does not exist in the user information base, thereby providing a judgment basis for the subsequent judgment of the identity of the user sending the voice control signal again, and being beneficial to improving the efficiency of judging the voice control authority of the user.

In some alternative implementations, as shown in fig. 7, step 203 includes:

step 2031, determining a second sound source location and second voiceprint information of a voice control signal.

The method for determining the second sound source position may be the same as the method for determining the first sound source position, and the method for determining the second voiceprint information may be the same as the method for determining the first voiceprint information. Optionally, if the voice control signal is included in the user sensing information, the second sound source position is the same as the first sound source position, that is, the first sound source position may be directly obtained and used as the second sound source position; if the voice control signal is newly obtained after the user sensing information is obtained, the second sound source position at the moment is newly identified, and the second sound source position may be the same as or different from the first sound source position.

Step 2032, determining a voice attribute of the voice control signal based on the second sound source location and the second voice print information.

In particular, the voice attribute may include a second sound source location and second voice print information.

In this embodiment, when the voice control signal is received, the voice control signal is subjected to sound source localization and voiceprint recognition, which can be compared with the obtained target user information in many ways, so that the accuracy of judging the identity of the user sending the voice control signal is improved.

In some alternative implementations, as shown in fig. 8, step 204 includes, based on the corresponding embodiment of fig. 7:

step 2041, determining a matching relationship between the second sound source position and the position of the target region, and a matching relationship between the second voiceprint information and the first voiceprint information.

Specifically, if it is determined that the second sound source position is located within the target area, it may be determined that the second sound source position matches the position of the target area. And the similarity of the first voiceprint information and the second voiceprint information can be determined, and if the similarity is greater than or equal to a preset similarity threshold value, the second voiceprint information is determined to be matched with the first voiceprint information.

In step 2042, a second sequence of facial images of the user is acquired in response to the second sound source location matching the location of the target region and the second sound information matching the first sound information.

The second sequence of facial images is a sequence of images containing the user's face extracted from the sequence of images currently taken on the user.

And 2043, performing lip movement detection on the second face image sequence to obtain a second lip movement detection result of the target user.

The method for detecting lip movement of the second face image sequence may be the same as the method for detecting lip movement of the first face image sequence. The second lip movement detection result may represent the same meaning type as the first lip movement detection result, i.e. may represent whether the lips of the user generate a speaking action or may represent the semantics of the lips action of the user.

Step 2044, based on the second lip movement detection result, determining a matching relationship between the voice attribute and the target user information.

Optionally, when the second lip movement detection result indicates whether the user speaks, the speaking time included in the second lip movement detection result may be compared with the collection time of the voice control signal, and if the two occur at the same time, the matching relationship between the voice attribute and the target user information is further determined. If the two occur at different moments, the voice control signal is not possible to come from the user in the target area, and the matching relation between the voice attribute and the target user information is not required to be further determined.

Optionally, when the second lip movement detection result indicates the meaning of the lip movement of the user, the second lip movement detection result may be compared with the meaning of the voice control signal, the similarity between the second lip movement detection result and the meaning of the voice control signal is determined, and if the similarity is greater than or equal to the threshold, the matching relationship between the voice attribute and the target user information may be further determined. If the similarity is smaller than the threshold value, the voice control signal is not from the user in the target area, and the matching relation between the voice attribute and the target user information is not required to be further determined.

According to the embodiment, before the matching relation between the voice attribute and the target user information is determined, lip movement detection is performed on the user, voice control signals sent by other users of the user in a non-target area can be accurately filtered, and therefore accuracy of judging authority of the user for voice control is further improved.

In some alternative implementations, as shown in fig. 9, step 2044 includes:

in step 20441, a confidence level of the voice control signal is determined in response to the second lip movement detection result indicating that the user's lips are occluded.

Specifically, the confidence level of the voice control signal indicates the magnitude of the likelihood that the voice control signal is issued by the user of the target area.

In general, when lip movement detection is performed, if a complete lip image cannot be detected from the images included in the second face image sequence, it is determined that the lips of the user are blocked. At this time, the voice control signal can be identified, so as to obtain a voice instruction and a corresponding basic confidence coefficient; alternatively, the voice attribute may include an operation target and instruction content, where the operation target and instruction content may be obtained by identifying a voice control signal when step 203 is performed, and a basic confidence corresponding to the operation target and instruction content may also be obtained. The basic confidence may represent the probability that the voice command coincides with the intent of the user issuing the voice control signal. The basic confidence coefficient can be used as the confidence coefficient of the voice control signal, and can be further combined with the lip movement detection result and/or the voice attribute to adjust the basic confidence coefficient so as to obtain the confidence coefficient of the voice control signal.

For example, if the second lip movement detection result is detected to indicate that the voice control signal is sent by the user in the target area, the set confidence coefficient increment can be increased on the basis of the basic confidence coefficient, and if the second lip movement detection result is detected to indicate that the lip of the user is blocked, the confidence coefficient is not increased or the set confidence coefficient increment is subtracted; if the second sound source position is detected to be consistent with the target area, increasing the set confidence coefficient increment on the basis of the basic confidence coefficient, and if the second sound source position is detected to be inconsistent with the target area, subtracting the set confidence coefficient increment on the basis of the basic confidence coefficient; if the second voiceprint information is detected to be consistent with the first voiceprint information, the set confidence increment is increased on the basis of the basic confidence, and if the second voiceprint information is detected to be inconsistent with the first voiceprint information, the set confidence increment is subtracted on the basis of the basic confidence.

In response to determining that the confidence level meets the preset confidence level condition, determining that the voice attribute matches the target user information, step 20442.

Optionally, a unified confidence threshold may be set, and if the confidence coefficient is greater than or equal to the confidence coefficient threshold, it may be determined that a confidence coefficient condition is satisfied, that is, it is determined that the voice attribute matches the target user information.

Alternatively, a confidence threshold may also be set separately for each voice command. The voice attribute may include a recognition result obtained by recognizing the voice control signal, where the recognition result includes an operation target and instruction content, and a corresponding confidence threshold is obtained according to the operation target and the instruction content, and if the confidence of the voice control signal is greater than the confidence threshold, it may be determined that a confidence condition is satisfied, that is, it is determined that the voice attribute matches with the target user information, and further the voice instruction is executed.

For example, a higher confidence threshold is set for a voice command related to the safety of vehicle travel, and a lower confidence threshold is set for a voice command unrelated to the safety of vehicle travel. If the voice control signal is to control a device that is not related to the running safety of the vehicle (e.g., to adjust the temperature of the air conditioner) or has a low correlation (e.g., to adjust the speed of the wiper), the voice command can be continuously executed even if the lips of the user are blocked or if the voice control signal is not issued by the user in the target area (i.e., the driver).

By setting the confidence level condition and determining the confidence level of the voice control signal, the embodiment can still judge whether the voice command can be executed according to the actual condition of the voice control signal when the lip of the user is blocked, thereby improving the flexibility of voice control and the accuracy of judging the control authority of the user for different voice control signals.

Exemplary apparatus

Fig. 10 is a schematic structural diagram of a voice command generating apparatus according to an exemplary embodiment of the present disclosure. The present embodiment may be applied to an electronic device, as shown in fig. 10, where the generating device of a voice command includes: the acquiring module 1001 is configured to acquire user sensing information, and determine first user information of a user located in a target area based on the user sensing information; a first determining module 1002, configured to determine target user information of a user in a target area based on the first user information and the user information base; a second determining module 1003, configured to determine a voice attribute of the voice control signal in response to receiving the voice control signal including the control instruction; a third determining module 1004, configured to determine a matching relationship between the voice attribute and the target user information; the generating module 1005 is configured to generate a voice command corresponding to the voice control signal in response to the matching relationship being a match.

In this embodiment, the obtaining module 1001 may obtain the user sensing information, and determine the first user information of the user located in the target area based on the user sensing information.

In this embodiment, the first determining module 1002 may determine the target user information of the user in the target area based on the first user information and the user information base.

In this embodiment, the second determining module 1003 may determine a voice attribute of the voice control signal in response to receiving the voice control signal including the control instruction.

In this embodiment, the third determining module 1004 may determine a matching relationship between the voice attribute and the target user information.

In this embodiment, the generating module 1005 may generate the voice command corresponding to the voice control signal in response to the matching relationship being a match.

In particular, the speech control signal may be converted into text, from which the speech instructions are generated. For example, if the voice control signal is converted to text "turn on the headlight", a voice command may be generated indicating that the headlight is turned on.

Alternatively, the method for determining whether the voice control signal is valid by using the text information included in the voice attribute described in the above-mentioned alternative embodiment may be included in step 205. That is, when the new judgment voice attribute other than the text information included in the voice attribute is matched with the target user information, the text information included in the voice attribute can be further used to judge whether the voice control signal is valid. If the voice command is effective, the voice command is regenerated, so that the accuracy of generating the voice command is improved.

Referring to fig. 11, fig. 11 is a schematic structural view of a voice command generating apparatus according to another exemplary embodiment of the present disclosure.

In some alternative implementations, the acquisition module 1001 includes: a first determining unit 10011, configured to determine, based on an image sequence included in the user sensing information, a first facial image sequence of the user located in the target area and facial feature information of the user; and/or, a second determining unit 10012, configured to determine, based on a voice signal included in the user sensing information, first voiceprint information of a user located in the target area; a third determining unit 10013 is configured to determine first user information based on facial feature information of the user and/or first voiceprint information of the user.

In some alternative implementations, the second determining unit 10012 includes: a first determining subunit 100121, configured to determine, based on the voice signal included in the user sensing information, a first sound source position of the voice signal and initial voiceprint information; a second determining subunit 100122 configured to determine a matching relationship between the first sound source position and the position of the target area; a detection subunit 100123, configured to obtain a first face image sequence of the user in response to the matching relationship between the first sound source position and the position of the target area being a match, and perform lip movement detection on the first face image sequence to obtain a first lip movement detection result of the user; a third determining subunit 100124, configured to determine a matching relationship between the first lip movement detection result and the voice signal; the fourth determination subunit 100125 is configured to determine the initial voiceprint information as the first voiceprint information of the user of the target area in response to the first lip movement detection result matching the voice signal.

In some alternative implementations, the first determining module 1002 includes: a matching unit 10021, configured to match facial feature information of a user and/or first voiceprint information of the user, which are included in the first user information, with user information in a user information base; a fourth determining unit 10022 is configured to determine, in response to the presence of matching user information that matches facial feature information of the user and/or first voiceprint information of the user in the user information base, target user information of the user in the target area based on the matching user information and/or the first user information.

In some alternative implementations, the fourth determining unit 10022 includes: a first updating subunit 100221, configured to determine, in response to the first voiceprint information that includes facial feature information of the user and does not include the user in the matching user information, target user information of the user in the target area based on the matching user information, and update the voiceprint information in the matching user information based on the first voiceprint information of the user; the second updating subunit 100222 is configured to determine, in response to the matching user information including voiceprint information of the user and not including facial feature information of the user, target user information of the user in the target area based on the matching user information, and update facial feature information in the matching user information based on the facial feature information of the user.

In some alternative implementations, the first determining module 1002 further includes: a registration unit 10023, configured to register, in response to no matching user information that matches facial feature information of the user or first voiceprint information of the user in the user information base, the first user information in the user information base based on the facial feature information of the user and/or the first voiceprint information of the user, and determine the first user information as target user information.

In some alternative implementations, the second determining module 1003 includes: a fifth determining unit 10031, configured to determine a second sound source position and second voice information of the voice control signal; a sixth determining unit 10032 is configured to determine a voice attribute of the voice control signal based on the second sound source position and the second voice information.

In some alternative implementations, the third determining module 1004 includes: a seventh determining unit 10041, configured to determine a matching relationship between the second sound source position and the position of the target region, and a matching relationship between the second sound information and the first sound information; an obtaining unit 10042, configured to obtain a second facial image sequence of the user in response to the second voice channel position matching the position of the target region and the second voice channel information matching the first voice channel information; a detection unit 10043, configured to perform lip movement detection on the second face image sequence to obtain a second lip movement detection result of the target user; an eighth determining unit 10044 is configured to determine a matching relationship between the voice attribute and the target user information based on the second lip movement detection result.

In some alternative implementations, the eighth determining unit 10044 includes: a fifth determining subunit 100441, configured to determine a confidence level of the voice control signal in response to the second lip movement detection result indicating that the lip of the user is blocked; a sixth determining subunit 100442 is configured to determine that the voice attribute matches the target user information in response to determining that the confidence level satisfies the preset confidence level condition.

According to the voice command generating device provided by the embodiment of the disclosure, the user information is recorded by setting the user information base, the target user information of the user in the target area is determined based on the first user information and the user information base for user identification, when the voice control signal is received, the voice attribute of the voice control signal is determined, if the voice attribute is matched with the target user information, the voice command is generated, the automatic matching of the source of the voice control signal is realized, the voice control authority of the target user is not occupied by other users, and the risk of misidentification of the voice control signal is reduced. Meanwhile, the automatic recording of the target user information of the user in the target area is realized, active registration of the user is not needed, and the convenience of the user is improved.

Exemplary electronic device

Next, an electronic device according to an embodiment of the present disclosure is described with reference to fig. 12. The electronic device may be either or both of the terminal device 101 and the server 103 as shown in fig. 1, or a stand-alone device independent thereof, which may communicate with the terminal device 101 and the server 103 to receive the acquired input signals therefrom.

Fig. 12 shows a block diagram of an electronic device according to an embodiment of the disclosure.

As shown in fig. 12, the electronic device 1200 includes one or more processors 1201 and memory 1202.

The processor 1201 may be a Central Processing Unit (CPU) or other form of processing unit having data processing and/or instruction execution capabilities, and may control other components in the electronic device 1200 to perform desired functions.

Memory 1202 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or nonvolatile memory. Volatile memory can include, for example, random Access Memory (RAM) and/or cache memory (cache) and the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on a computer readable storage medium and the processor 1201 may execute the program instructions to implement the methods of generating voice instructions and/or other desired functions of the various embodiments of the present disclosure above. Various content, such as user-sensed information, may also be stored in the computer-readable storage medium.

In one example, the electronic device 1200 may further include: an input device 1203 and an output device 1204, which are interconnected via a bus system and/or other forms of connection mechanism (not shown).

For example, when the electronic apparatus is the terminal apparatus 101 or the server 103, the input device 1203 may be a camera, a microphone, or the like for inputting an image, audio, or the like. When the electronic apparatus is a stand-alone apparatus, the input device 1203 may be a communication network connector for receiving an inputted image, audio, or the like from the terminal apparatus 101 and the server 103.

The output device 1204 may output various information to the outside, including a voice instruction. The output device 1204 may include, for example, a display, speakers, a printer, and a communication network and remote output devices connected thereto, etc.

Of course, only some of the components of the electronic device 1200 that are relevant to the present disclosure are shown in fig. 12, components such as buses, input/output interfaces, etc. are omitted for simplicity. In addition, the electronic device 1200 may include any other suitable components, depending on the particular application.

Exemplary computer program product and computer readable storage Medium

In addition to the methods and apparatus described above, embodiments of the present disclosure may also provide a computer program product comprising computer program instructions which, when executed by a processor, cause the processor to perform the steps in the method of generating speech instructions of the various embodiments of the present disclosure described in the "exemplary methods" section above.

The computer program product may write program code for performing the operations of embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium, having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform steps in the method of generating speech instructions of various embodiments of the present disclosure described in the "exemplary method" section above.

A computer readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example but not limited to, a system, apparatus, or device including electronic, magnetic, optical, electromagnetic, infrared, or semiconductor, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The basic principles of the present disclosure have been described above in connection with specific embodiments, but the advantages, benefits, effects, etc. mentioned in this disclosure are merely examples and are not to be considered as necessarily possessed by the various embodiments of the present disclosure. Furthermore, the specific details disclosed herein are for purposes of illustration and understanding only, and are not intended to be limiting, since the disclosure is not necessarily limited to practice with the specific details described.

Various modifications and alterations to this disclosure may be made by those skilled in the art without departing from the spirit and scope of the application. Thus, the present disclosure is intended to include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A voice instruction generation method comprises the following steps:

acquiring user induction information, and determining first user information of a user positioned in a target area based on the user induction information;

determining target user information of the user in the target area based on the first user information and a user information base;

responsive to receiving a voice control signal comprising a control instruction, determining a voice attribute of the voice control signal;

determining a matching relationship between the voice attribute and the target user information;

and generating a voice command corresponding to the voice control signal in response to the matching relation as matching.

2. The method of claim 1, wherein the determining first user information for the user located in the target area based on the user-sensed information comprises:

determining a first facial image sequence of a user positioned in the target area and facial feature information of the user based on an image sequence included in the user sensing information; and/or the number of the groups of groups,

Determining first voiceprint information of a user located in the target area based on a voice signal included in the user sensing information;

the first user information is determined based on facial feature information of the user and/or first voiceprint information of the user.

3. The method of claim 2, wherein the determining the first voiceprint information of the user located in the target area based on the voice signal included in the user-sensed information comprises:

determining a first sound source position and initial voiceprint information of a voice signal based on the voice signal included in the user sensing information;

determining a matching relationship between the first sound source position and the position of the target area;

responding to the matching relation between the first sound source position and the position of the target area to obtain a first face image sequence of the user, and performing lip movement detection on the first face image sequence to obtain a first lip movement detection result of the user;

determining a matching relationship between the first lip movement detection result and the voice signal;

and determining the initial voiceprint information as first voiceprint information of a user of the target area in response to the first lip movement detection result matching the voice signal.

4. The method of claim 2, wherein the determining target user information for users in the target area based on the first user information and a user information base comprises:

matching facial feature information of the user and/or first voiceprint information of the user, which are included in the first user information, with user information in the user information base;

in response to there being matching user information in the user information base that matches facial feature information of the user and/or first voiceprint information of the user, determining target user information of the user based on the matching user information and/or the first user information.

5. The method of claim 4, wherein the determining target user information for the user based on the matching user information and/or the first user information comprises:

in response to the matching user information including facial feature information of the user and excluding first voiceprint information of the user, determining target user information of the user based on the matching user information, and updating voiceprint information in the matching user information based on the first voiceprint information of the user;

And in response to the matching user information including voiceprint information of the user and not including facial feature information of the user, determining target user information of the user based on the matching user information, and updating facial feature information in the matching user information based on the facial feature information of the user.

6. The method of claim 4, wherein the method further comprises:

in response to no matching user information matching the facial feature information of the user or the first voiceprint information of the user exists in the user information base, first user information is registered in the user information base based on the facial feature information of the user and/or the first voiceprint information of the user, and the first user information is determined to be target user information.

7. The method of claim 2, wherein the determining the voice attribute of the voice control signal comprises:

determining a second sound source position and second voice information of the voice control signal;

based on the second sound source location and the second voice information, a voice attribute of the voice control signal is determined.

8. The method of claim 7, wherein the determining the matching relationship of the voice attribute and the target user information comprises:

Determining a matching relationship between the second sound source position and the position of the target area and a matching relationship between the second voiceprint information and the first voiceprint information;

acquiring a second facial image sequence of the user in response to the second acoustic source position matching the position of the target region and the second acoustic information matching the first acoustic information;

performing lip movement detection on the second face image sequence to obtain a second lip movement detection result of the target user;

and determining the matching relation between the voice attribute and the target user information based on the second lip movement detection result.

9. The method of claim 8, wherein the determining a matching relationship of the voice attribute and the target user information based on the second lip movement detection result comprises:

determining a confidence level of the voice control signal in response to the second lip movement detection result indicating that the lip of the user is occluded;

and determining that the voice attribute is matched with the target user information in response to determining that the confidence level meets a preset confidence level condition.

10. A voice command generating apparatus comprising:

the acquisition module is used for acquiring user induction information and determining first user information of a user positioned in a target area based on the user induction information;

The first determining module is used for determining target user information of the user in the target area based on the first user information and the user information base;

a second determining module for determining a voice attribute of a voice control signal in response to receiving the voice control signal including a control instruction;

a third determining module, configured to determine a matching relationship between the voice attribute and the target user information;

and the generating module is used for responding to the matching relation as matching and generating a voice instruction corresponding to the voice control signal.

11. A computer readable storage medium storing a computer program for execution by a processor to implement the method of any one of claims 1-9.

12. An electronic device, the electronic device comprising:

a processor;

a memory for storing executable instructions of the processor;

the processor being configured to read the executable instructions from the memory and execute the instructions to implement the method of any of the preceding claims 1-9.