Detailed Description
The embodiments of the present invention will be described in further detail with reference to the drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the embodiments of the invention and do not delimit the invention. It should be further noted that, for convenience of description, only some structures, not all structures, relating to the embodiments of the present invention are shown in the drawings.
Example one
Fig. 1A is a flowchart of a voice-based control method according to a first embodiment of the present invention, which is suitable for accurately determining a control instruction of a user to a terminal, and is particularly suitable for solving the problem of how to accurately determine a control instruction of a user to a terminal when external interferences such as multiple voices exist in a vehicle in a fixed-seat vehicle environment. The method can be executed by a voice-based control device provided by the embodiment of the invention, and the device can be realized in a software and/or hardware manner. Referring to fig. 1A, the method specifically includes:
s110, collecting at least two voice signals.
The voice signal is a signal containing a voice instruction of a user, and can be acquired by adopting acquisition equipment such as a microphone. For example, a voice acquisition system can be used to acquire at least two voice signals, wherein the voice acquisition system is pre-constructed and is used to acquire the voice signals; alternatively, the speech acquisition system may be composed of a plurality of microphones or a microphone array.
In an environment where a vehicle seat is fixed, the intention of any one person in the vehicle, for example, the intention of a driver or a passenger, can be accurately recognized when there is interference such as in-vehicle voice, that is, when there are a plurality of voice signals. The voice capturing system may be constructed in accordance with a seat configuration in the vehicle, and optionally, the voice capturing system includes at least two pairs of two-microphone units composed of two microphones, and a position of each pair of two-microphone units is determined in accordance with a position of a corresponding sounding point.
The two microphones are regarded as a pair of double-microphone units, the sound production points are mouths of people in the vehicle, the positions of the sound production points are over against the perpendicular bisector of the connecting line of the two microphones, namely, the perpendicular bisector of the connecting line between the two microphones of each pair of double-microphone units comprises the sound production points, and each sound production point corresponds to one pair of double-microphone units.
For example, the position of each pair of two-microphone units may be determined by a user or a position determination model, wherein the position determination model is a pre-trained model for determining the position of each pair of two-microphone units, and the position of the sound emitting point, the preset installation plane and the center point position are input into the position determination model, and the model outputs the installation position of the pair of two-microphone units by combining parameters of the model.
A. And determining a projection point of the sound production point on the installation plane according to the position of the sound production point and a preset installation plane.
The preset installation plane refers to a preset plane for installing a microphone, such as a center console. Due to the structure of the seat in the vehicle, different sounding points may correspond to different installation planes, and may also correspond to the same installation plane. In addition, because the position of the sounding point is changed due to different heights of the person, the position of the microphone is not fixed, and therefore in order to fix the microphone, the position of the sounding point is set by adopting a standard height or an average height within a controllable range such as 3-5 degrees.
For example, referring to fig. 1B, since both the driver and the passenger are located in the front row in the vehicle, a twin unit corresponding thereto may be provided for each person in the same installation plane. Specifically, the two sound production point positions are respectively the position S of the mouth of the driver 1 And the position S of the mouth of the co-driver 2 (ii) a The mounting plane is M1. In the vertical plane, a perpendicular line is drawn to the mounting plane through the position of the sounding point, and the intersection point of the perpendicular line and the mounting plane is the projection point of the sounding point on the mounting plane. For example, see FIG. 1B for a point S 1 Making a perpendicular line to the installation plane M1, wherein the intersection point of the perpendicular line and the installation plane M1 is a projection point S 1 / (ii) a Passing point S 2 Making a perpendicular line to the installation plane M1, wherein the intersection point of the perpendicular line and the installation plane M1 is a projection point S 2 / 。
B. And determining the installation position of the twin microphone unit corresponding to the sound generating point according to the first distance between the position of the projection point and the position of the central point and the linear relation between the first distance and the second distance.
The position of the central point is preset according to the sounding points, each sounding point corresponds to one central point, and specifically, the position of the central point refers to the position of an installation plane area, which is right opposite to a driver, of a person in the vehicle. For example, S in FIG. 1B 1 Corresponding center point position is S 1 Position O of the opposite mounting plane 1 And S 2 Corresponding center point position is S 2 Position O of the opposite mounting plane 2 (ii) a And three microphones MIC1, MIC0, and MIC2.
For each microphone, the distance between the position of the microphone and the position of the center point is the second distance, for example, MIC1 and O in fig. 1B 1 The distance between them. The distance between the location of the projected point and the location of the center point is a first distance, e.g., the distance between S1/and O1 in FIG. 1B. Optionally, a first distance between the location of the projected point and the location of the center point is 50 times a second distance between the location of the microphone and the location of the center point, e.g., S in fig. 1B 1 / And O 1 A distance S therebetween 1 / O 1 Is MIC1 and O 1 50 times the distance therebetween.
Specifically, after the projection point corresponding to the sound production point is determined and the position of the central point is predetermined according to the sound production point, the position of each microphone in each pair of double-microphone units can be uniquely determined according to the linear relationship between the first distance between the position of the projection point and the position of the central point and the second distance between the position of the microphone and the position of the central point.
It should be noted that, in a normal case, one sounding point corresponds to one pair of dual microphone units, and if there is an overlap between the installation positions of the two pairs of dual microphone units corresponding to the two sounding points, a common microphone may be used to construct the voice collecting system, as shown in fig. 1B, two pairs of dual microphone units consisting of three microphones, that is, dual microphone units MIC1 and MIC0 corresponding to the driver, and dual microphone units MIC2 and MIC0 corresponding to the co-driver.
For example, the same number of pairs as the number of positions may be provided according to the configuration of seats in the vehicle. For example, for a vehicle with five positions, two pairs of twin microphones can be arranged on the vehicle console to correspond to the driver and the co-driver respectively, and three pairs of twin microphones can be arranged behind the front seats to correspond to three persons in the rear row respectively.
It should be noted that, with the voice acquisition system configured in this construction manner, the voice signals acquired by the pair of dual-microphone units are one voice signal, that is, the voices acquired by the two microphones corresponding to the pair of dual microphones are synthesized into one voice signal. If a common voice acquisition system is adopted, the voice acquired by one microphone is a voice signal.
And S120, determining a target control instruction for the terminal according to the time for acquiring the at least two voice signals and the matching degree of the at least two voice signals and the current scene.
The matching degree refers to the degree of correlation between the semantics of the voice signal and the current environment, and can be determined by performing semantic analysis on the text content of the voice signal. For example, if the current scene is driving, the voice control of the operating system of the vehicle, such as opening navigation, closing an air conditioner or opening a window, is related to the driving scene; while other chat voices are not relevant to the driving scenario. The terminal is a device with an intelligent technology, optionally, in this embodiment, the terminal is a vehicle-mounted terminal, and the target control instruction refers to a voice instruction capable of controlling the terminal to execute a series of operations.
Specifically, the voice signals irrelevant to the current scene can be eliminated according to the matching degree of at least two voice signals; sequencing the rest voice signals according to the acquisition time of at least two voice signals to obtain the voice signal with the top sequencing; obtaining a corresponding control instruction according to the voice signal with the top sequencing; and executing the user intention according to the control instruction.
According to the technical scheme provided by the embodiment of the invention, the control instruction of the terminal is determined according to the acquisition time and the matching with the current scene for each voice signal acquired by the voice acquisition system, and the control instruction of the user on the terminal can be accurately determined under the condition that multiple paths of voice signals and external environment interference exist, so that the user experience is improved.
Example two
Fig. 2 is a flowchart of a voice-based control method according to a second embodiment of the present invention, which is further optimized based on the first embodiment. Referring to fig. 2, the method specifically includes:
s210, collecting at least two voice signals.
The collecting of the at least two voice signals may be collecting the at least two voice signals by using a voice collecting system, where the voice collecting system includes at least two pairs of dual-microphone units formed by two microphones, and the position of each pair of dual-microphone units is determined according to the position of the sounding point.
S220, processing the at least two voice signals by adopting a preset rule to obtain text contents corresponding to target signals in the voice signals in the at least two voice signals and the starting time of the target signals.
The preset rule is a preset rule for processing the voice signals collected by each pair of double microphones, and the rule can be used for separating the voice signals, keeping the voices in a specific angle range, inhibiting the voices in other ranges and performing voice recognition on the separated voice signals.
The target signal is a voice signal, namely a human voice part, obtained by processing each voice signal by adopting a preset rule and removing a non-voice signal part. Correspondingly, the starting time of the target signal is the starting time of the human voice part; the text content corresponding to the target signal is obtained by converting a voice signal corresponding to the human voice part into a text. Optionally, the text content may be a text itself corresponding to the target signal, or may be a keyword or the like.
Specifically, the voice signals collected by each pair of the double-microphone units are processed by adopting a preset rule, so that the target signals of the voice signals, the text content corresponding to the target signals and the starting time of the target signals can be obtained.
And S230, inputting each text content into the semantic understanding engine to obtain the matching degree of each target signal and the current scene.
The semantic understanding engine can be a pre-trained semantic analysis model, and can be used for performing processing such as sentence segmentation, word segmentation, keyword extraction, semantic analysis and the like on input text contents. The matching degree refers to the degree of correlation between the semantics of the target signal and the current environment, and can be determined by performing semantic analysis on the text content of the target signal. For example, if the current scene is driving, the voice of the operating system of the vehicle is controlled, such as opening navigation, closing an air conditioner or opening a window, to be related to the driving scene; while other chat voices are unrelated to the driving scenario. For example, if the text content is a keyword, the semantic understanding engine directly performs semantic analysis on the keyword to determine the matching degree with the current scene.
The matching degree can be determined by judging the priority or the number of the control commands contained in the target signal. For example, a control command corresponding to the open navigation is prioritized over a control command for opening a window. If the control instruction contained in the target signal A is to open the navigation and the control instruction contained in the target signal B is to open the window, the matching degree of the target signal A and the current scene is higher than that of the target signal B and the current scene.
S240, determining a target control instruction for the terminal according to each matching degree and the starting time of each target signal.
Specifically, if a plurality of target signals exist at the same time, the target signals irrelevant to the current scene can be removed according to the matching degree of each target signal; sequencing the rest target signals according to the starting time of the target signals to obtain the target signals with the top sequencing; obtaining a corresponding control instruction according to the text content corresponding to the target signal with the top sequencing; and executing the user intention according to the control instruction.
For example, as shown in fig. 1B, after the two pairs of dual microphones units in the directions of the driver and the co-driver collect the voice signals, and process the two voice signals to obtain the target signals and the start times of the target signals corresponding to the two voice signals, the semantic understanding engine is required to determine which direction of the target signals has a higher matching degree with the current scene, and preferentially process the target signals in the direction.
For example, if the start time of one target signal is prior, but the semantic understanding engine determines that the target signal is not related to the current scene, other target signals may be processed.
Optionally, if only one target signal exists and the semantic understanding engine determines that the target signal is related to the current scene, a corresponding control instruction may be obtained according to text content corresponding to the target signal; and executing the user intention according to the control instruction. If only one target signal exists and the semantic understanding engine judges that the target signal is irrelevant to the current scene, no processing is performed.
According to the technical scheme provided by the embodiment of the invention, the voice signals acquired by the voice acquisition system are processed by adopting a preset rule to obtain the text content corresponding to the target signal in each voice signal and the starting time of each target signal; the text content is input into a semantic understanding engine to obtain the matching degree of each target signal and the current scene, and a target control instruction to the terminal is determined according to the starting time and the matching degree of each target signal.
EXAMPLE III
Fig. 3 is a flowchart of a voice-based control method according to a third embodiment of the present invention, and this embodiment is further optimized based on the foregoing embodiments. Referring to fig. 3, the method specifically includes:
s310, collecting at least two voice signals.
And S320, processing the at least two voice signals by adopting a beam forming algorithm to obtain a preliminary voice signal corresponding to each voice signal in the at least two voice signals.
The beamforming algorithm is a method for reducing the dimension of a signal or acquiring a signal in a specific range, and is also a method for separating a signal, and the beamforming algorithm can retain voice in a specific angle range, such as within 10 degrees, near a perpendicular bisector corresponding to each pair of gemets, and suppress voice in other ranges. The preliminary voice signal is a voice signal obtained by processing the acquired voice signal by adopting a beam forming algorithm.
Specifically, for each pair of voice signals acquired by the twin microphones, a beam forming algorithm is adopted to obtain a preliminary voice signal corresponding to each voice signal. If the voice acquisition system is constructed in the manner shown in fig. 1B, the beam forming algorithm is used to obtain the preliminary voice signal containing the voice of the driver or the co-driver, that is, the beam forming algorithm is used to clearly know the position of the voice source.
S330, voice endpoint detection is carried out on each preliminary voice signal to obtain a target signal corresponding to each preliminary voice signal and the starting time of each target signal.
The voice endpoint detection means that the starting point and the ending point of various paragraphs in the voice signal are accurately judged from background noise and environmental noise of the input voice signal. In other words, in a signal stream under a complex application environment, a speech signal and a non-speech signal are distinguished, and the beginning and the end of the speech signal are determined.
The target signal is a voice signal obtained by removing a non-voice signal part from the preliminary voice signal, namely a human voice part. Correspondingly, the starting time of the target signal is the starting time of the voice part.
Specifically, for each pair of preliminary voice signals corresponding to the voice signals collected by the twin microphones, voice endpoint detection is performed to obtain target signals corresponding to the preliminary voice signals and start times of the target signals.
And S340, performing voice recognition on each target signal to obtain text content corresponding to each target signal.
The text content corresponding to the target signal is obtained by converting a voice signal corresponding to the human voice part into a text. Specifically, a speech recognition technology may be used to recognize each target signal, so as to obtain text content corresponding to each target signal. Optionally, the text content may be a text itself corresponding to the target signal, and may also be a keyword or the like.
And S350, inputting each text content into the semantic understanding engine to obtain the matching degree of each target signal and the current scene.
And S360, determining a target control instruction for the terminal according to each matching degree and the starting time of each target signal.
According to the technical scheme provided by the embodiment of the invention, each voice signal acquired by a voice acquisition system is processed by adopting a beam forming algorithm and voice endpoint detection to obtain text content corresponding to a target signal in each voice signal and the starting time of each target signal; the text content is input into a semantic understanding engine to obtain the matching degree of each target signal and the current scene, and a control instruction to the terminal is determined according to the starting time and the matching degree of each target signal.
Example four
Fig. 4 is a block diagram of a voice-based control apparatus according to a fourth embodiment of the present invention, which is capable of executing a voice-based control method according to any embodiment of the present invention, and includes functional modules corresponding to the executing method and beneficial effects. As shown in fig. 4, the apparatus may include:
an acquisition module 410, configured to acquire at least two voice signals;
and the target instruction determining module 420 is configured to determine a target control instruction for the terminal according to the time for acquiring the at least two voice signals and the matching degree between the at least two voice signals and the current scene.
According to the technical scheme provided by the embodiment of the invention, the control instruction of the terminal is determined according to the acquisition time and the matching with the current scene for each voice signal acquired by the voice acquisition system, and the control instruction of the user on the terminal can be accurately determined under the condition that multiple paths of voice signals and external environment interference exist, so that the user experience is improved.
For example, the target instruction determining module may include:
the signal time determining unit is used for processing the at least two voice signals by adopting a preset rule to obtain text contents corresponding to target signals in the voice signals in the at least two voice signals and the starting time of the target signals;
the matching degree determining unit is used for inputting each text content to the semantic understanding engine to obtain the matching degree of each target signal and the current scene;
and the target instruction determining unit is used for determining a target control instruction for the terminal according to the matching degree and the starting time of each target signal.
Illustratively, the signal time determination unit is specifically configured to:
processing the at least two voice signals by adopting a beam forming algorithm to obtain a preliminary voice signal corresponding to each voice signal in the at least two voice signals;
performing voice endpoint detection on each preliminary voice signal to obtain a target signal corresponding to each preliminary voice signal and the starting time of each target signal;
and carrying out voice recognition on each target signal to obtain text content corresponding to each target signal.
Optionally, the acquisition module 410 may be configured to: the voice acquisition system is used for acquiring at least two voice signals, and it should be noted that the voice acquisition system in this embodiment includes at least two pairs of dual microphone units formed by two microphones, and the position of each pair of dual microphone units is determined according to the position of the corresponding sounding point.
Illustratively, the position of each pair of twin units is determined by:
determining a projection point of the sounding point on the installation plane according to the position of the sounding point and a preset installation plane;
and determining the installation position of the double-microphone unit corresponding to the sounding point according to the distance between the position of the projection point and the position of the central point and the linear relation between the first distance and the second distance, wherein the second distance is the distance between the position of the microphone and the position of the central point, and the position of the central point is preset according to the sounding point.
EXAMPLE five
Fig. 5 is a schematic structural diagram of an apparatus provided in the fifth embodiment of the present invention, and fig. 5 shows a block diagram of an exemplary apparatus suitable for implementing the embodiment of the present invention. The device 12 shown in fig. 5 is only an example and should not impose any limitation on the functionality and scope of use of embodiments of the present invention.
As shown in FIG. 5, device 12 is in the form of a general purpose computing device. The components of device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and the processing unit 16.
Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Device 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by device 12 and includes both volatile and nonvolatile media, removable and non-removable media.
The system memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 30 and/or cache memory 32. Device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 5, and commonly referred to as a "hard drive"). Although not shown in FIG. 5, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. System memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.
A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in system memory 28, such program modules 42 including but not limited to an operating system, one or more application programs, other program modules, and program data, each of which or some combination of which may comprise an implementation of a network environment. Program modules 42 generally carry out the functions and/or methodologies of embodiments described herein.
Device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with device 12, and/or with any devices (e.g., network card, modem, etc.) that enable device 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Also, the device 12 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet) via the network adapter 20. As shown, the network adapter 20 communicates with the other modules of the device 12 over the bus 18. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
The processing unit 16 executes various functional applications and data processing, such as implementing a voice-based control method provided by an embodiment of the present invention, by running a program stored in the system memory 28.
Example six
A sixth embodiment of the present invention further provides a computer-readable storage medium, on which a computer program (or referred to as computer-executable instructions) is stored, where the computer program, when executed by a processor, can implement the voice-based control method according to any of the above embodiments.
Computer storage media for embodiments of the invention may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for embodiments of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
It is to be noted that the foregoing description is only exemplary of the invention and that the principles of the technology may be employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the embodiments of the present invention have been described in more detail through the above embodiments, the embodiments of the present invention are not limited to the above embodiments, and many other equivalent embodiments can be included without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.