CN110188179B - Voice directional recognition interaction method, device, equipment and medium - Google Patents

Voice directional recognition interaction method, device, equipment and medium Download PDF

Info

Publication number
CN110188179B
CN110188179B CN201910466749.8A CN201910466749A CN110188179B CN 110188179 B CN110188179 B CN 110188179B CN 201910466749 A CN201910466749 A CN 201910466749A CN 110188179 B CN110188179 B CN 110188179B
Authority
CN
China
Prior art keywords
face
voice
image
angle
text content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910466749.8A
Other languages
Chinese (zh)
Other versions
CN110188179A (en
Inventor
嵇望
汪斌
林达
李林峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Xiaoyuan Robot Co ltd
Original Assignee
Zhejiang Utry Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Utry Information Technology Co ltd filed Critical Zhejiang Utry Information Technology Co ltd
Priority to CN201910466749.8A priority Critical patent/CN110188179B/en
Publication of CN110188179A publication Critical patent/CN110188179A/en
Application granted granted Critical
Publication of CN110188179B publication Critical patent/CN110188179B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering

Abstract

The patent application relates to the field of man-machine voice interaction and discloses a voice directional recognition interaction method which comprises the following steps: picking up a sound signal right in front for recognition to obtain a voice text content, and acquiring the voice text content; acquiring a face image which simultaneously meets an image acquisition angle and an acquisition distance based on the image acquisition angle and the acquisition distance; judging whether to reply or not according to the voice text content and the face image; the image acquisition angle is 60-70 degrees, and the acquisition distance is less than or equal to 1 m. The application also discloses a voice directional recognition interaction device, electronic equipment and a computer storage medium, and the voice directional recognition interaction method provided by the patent accords with the daily communication habit, can effectively eliminate the sound of outsiders and the sound of the environment, and realizes effective anthropomorphic communication with the user who is interacting in the front.

Description

Voice directional recognition interaction method, device, equipment and medium
Technical Field
The invention relates to the field of man-machine voice interaction, in particular to a voice directional recognition interaction method, a voice directional recognition interaction device, voice directional recognition interaction equipment and a voice directional recognition interaction medium.
Background
At present, robots or voice assistants are generally applied in complex environments, such as noisy environments of conference rooms, outdoors, shopping malls and the like, so that various problems of noise, reverberation, voice interference, echo and the like can occur, in the man-machine voice interaction process, sound in the surrounding 360-degree range can be identified by an array microphone for receiving sound, and in order to solve the problem of misrecognition of environmental sound, a 'wakeup word' technology is adopted in voice interaction. In practical application, only after the robot or the voice interaction assistant receives the awakening word, the voice content is identified; otherwise, no identification is performed.
The technology of the 'awakening word' is a main trigger mode when the mainstream robot or intelligent equipment carries out human-computer interaction at present. The problem with using wake-up words is that for a person as the subject of an interaction, he has to learn to use wake-up words, and if a robot is encountered in a strange place, the person as the subject of the interaction does not know at all which wake-up word to wake up, and if there is no wake-up word, he will not be able to communicate with it. Furthermore, the user needs to speak the "awakening word" before talking with the robot every time, such interaction process not only mechanically affects the rhythm of interaction, but also forgets to speak the "awakening word" or frequently speaks the "awakening word", so that the main interactive person finishes speaking a large period of speech, and the robot does not listen.
Generally, the robot interacts with the robot by standing right in front of the robot, but sound of an outsider and sound of the environment are mixed in the sound receiving process due to sound reception of the omnidirectional array microphone, namely, human voice or noise behind or around the robot is also collected and recognized, so that the accuracy of voice recognition is reduced, the robot can mistakenly respond even if the recognition is correct, and effective communication with a user interacting in front cannot be achieved.
In order to solve the above problems, chinese patent CN105204628A discloses a voice control method based on visual awakening, which includes that after receiving at least part of voice signals, a voice control device starts an image receiving unit installed thereon, the image receiving unit acquires images and transmits the images to an image recognition unit for recognition, and when recognizing a human face whose sight line is facing to the voice control device, voice recognition is performed. However, the patent still does not solve the interference of the environmental noise, and when a plurality of sound sources are present in the 360-degree range of the voice control device, for example, the image receiving unit recognizes a human face and simultaneously receives a plurality of surrounding voice signals at the voice control device, the recognition effect of the voice control device is interfered by the external environmental noise.
Disclosure of Invention
In order to overcome the defects of the prior art, one of the objectives of the present invention is to provide a voice directional recognition interaction method, which combines a human face image and a voice signal to determine a specific interaction object and then performs a targeted reply, so as to conform to the daily communication habit.
One of the purposes of the invention is realized by adopting the following technical scheme:
a voice directional recognition interaction method is characterized by comprising the following steps:
acquiring collected voice text content;
acquiring a face image which simultaneously meets the image acquisition angle and the acquisition distance;
judging whether to reply or not according to the voice text content and the face image;
the image acquisition angle is 60-70 degrees, the acquisition distance is less than or equal to 1m, and the acquisition method of the voice text content comprises the following steps: after directional picking up and signal enhancement are carried out on the sound signal in front of the front, voice recognition is carried out.
Further, the acquisition steps of the face image are as follows: extracting features of the collected image data, judging whether the image contains a human face through a human face detection algorithm, and if not, not processing the image data; if the image contains the face, calculating 3D angle information and distance information of the face in the image by using a face angle estimation algorithm and a face distance estimation algorithm, and if the 3D angle information and the distance information of the face meet the conditions, reserving the image data as the face image; and if the condition is not met, not collecting.
Further, when the voice text content and the face image are acquired simultaneously, a reply is made to the voice text content, otherwise, no reply is made.
Further, the face angle estimation algorithm adopts a face detection algorithm based on a convolutional neural network, and comprises the following steps:
establishing a face picture library, extracting and analyzing the features of the face picture library, extracting the forms and positions of the five sense organs, and counting to obtain a statistical analysis result;
training the statistical analysis result by adopting a deep convolutional neural network method to obtain an established part classifier, grading the face in the image data according to the face classifier, then carrying out rule analysis according to the score of each characteristic part to obtain a face candidate region, and finally combining a boundary regression algorithm to obtain a final face detection result.
Furthermore, the face angle estimation algorithm adopts an LVQ algorithm to train a 90-degree model of the face in a lens in advance, and 3D angle information of the face is obtained finally by inputting eye features of the face to match corresponding angles.
Further, after the directional pickup of the sound signal, signal enhancement is performed by adopting a generalized sidelobe canceller algorithm, specifically: the energy normalization of the sound signal is carried out, then a forward voice reference signal on a main lobe is generated through a fixed beam former, a noise reference signal is generated through a side lobe canceller, and finally a noise component on the main lobe signal is eliminated through a noise canceller.
The second purpose of the present invention is to provide a voice directional recognition interactive device, which is realized by adopting the following technical scheme:
speech orientation recognition interaction device, comprising:
the voice pickup equipment is used for directionally picking up the sound signal right in front and carrying out voice recognition to obtain voice text content;
the image acquisition equipment is provided with an image acquisition angle and an acquisition distance in advance and acquires a human face image which meets the image acquisition angle and the acquisition distance simultaneously;
and the processing unit is used for acquiring the voice text content and the face image and judging whether to reply or not.
Further, the sound reception range of the directional pickup of the voice pickup device is as follows: the sound receiving angle is 60-70 degrees, and the sound receiving distance is less than or equal to 1 m.
It is a further object of the invention to provide an electronic device comprising a processor, a storage medium and a computer program stored in the storage medium, which computer program, when executed by the processor, performs the above-mentioned speech orientation recognition interaction method.
It is a further object of the present invention to provide a computer-readable storage medium storing one of the objects of the invention, on which a computer program is stored, which computer program, when being executed by a processor, is adapted to carry out the above-mentioned speech orientation recognition interaction method.
Compared with the prior art, the invention has the beneficial effects that:
directionally identifying sound signals from the front, limiting the angle and distance of the picked sound signals, and performing signal enhancement on the picked sound signals, so that the interference of environmental noise is eliminated, and the influence on the identification effect caused by picking a plurality of surrounding sound signals is avoided; the angle and the distance of the image acquisition equipment for acquiring the face image are limited, the communication method is more consistent with a daily communication mode, and only when the voice signal and the face image within a specific range and distance are acquired simultaneously, the corresponding answer is made, so that the communication method is more consistent with the daily communication mode, effective communication is facilitated, and the personification effect of man-machine communication is improved.
Drawings
FIG. 1 is a flow chart of a voice directional recognition interaction method according to an embodiment 1 of the present invention;
fig. 2 is a schematic diagram of a speech orientation recognition interaction device according to embodiment 2 of the present invention;
fig. 3 is a block diagram of the electronic device according to embodiment 3 of the present invention.
Detailed Description
The present invention will now be described in more detail with reference to the accompanying drawings, in which the description of the invention is given by way of illustration and not of limitation. The various embodiments may be combined with each other to form other embodiments not shown in the following description.
Example 1
The voice directional recognition interaction method obtains a voice signal and a face image meeting requirements through direction so as to perform voice interaction, and as shown in fig. 1, the method comprises the following steps:
acquiring collected voice text content;
acquiring a face image which simultaneously meets the image acquisition angle and the acquisition distance;
judging whether to reply according to the acquired voice text content and the acquired face image;
the image acquisition angle is 60-70 degrees, and the acquisition distance is less than or equal to 1 m.
The method for acquiring the voice text content comprises the following steps: after directional picking up and signal enhancement are carried out on the sound signal in front of the front, voice recognition is carried out.
And when the voice text content and the face image are acquired simultaneously, replying is made according to the voice text content, otherwise, no reply is made.
In this embodiment, based on a face detection algorithm, a face angle estimation algorithm, and a distance detection algorithm, it is determined whether a face appears in an angle range of 60 to 70 degrees and a face distance is within a distance range of 1m, and the steps of collecting a face image by using the face angle estimation algorithm and the distance detection algorithm are as follows: firstly, feature extraction is carried out on collected image data, whether a face is contained in an image or not is judged through a face detection algorithm, and if the face is not contained, the image data are not processed.
If the image contains the face, calculating 3D angle information and distance information of the face in the image by using a face angle estimation algorithm and a distance detection algorithm, and if the 3D angle information and the distance information of the face meet the conditions, reserving the image data as the face image; if the condition is not satisfied, no processing is performed.
The face detection algorithm is based on a convolutional neural network method, and specifically comprises the following steps:
establishing a face picture library, extracting and analyzing the features of the face picture library, extracting the forms and positions of the five sense organs, and counting to obtain a statistical analysis result;
training the statistical analysis result by adopting a deep convolutional neural network method to establish a part classifier, grading the face in the image data according to the face classifier, then carrying out rule analysis according to the score of each characteristic part to obtain a face candidate region, and finally combining a boundary regression algorithm to obtain a final face detection result.
After a face detection result is obtained, a face head portrait is extracted from original image data, head portrait characteristics are extracted from the face head portrait, and 3D angle information of the face head portrait is obtained by calculating the characteristics through a face angle estimation algorithm.
The face angle estimation algorithm mentioned in the present invention adopts a forward neural network of learning vector quantization lvq (learning vector quantization). Firstly, preparing a group of images with different human face angles, wherein the images come from 100 different persons, each person has 90 images, and the human face angles are respectively as follows: left, front right, from left to right in sequence, 1 face image at every 1 degree.
Firstly, feature vectors describing the positions of eyes in the pictures are extracted to be used as input of the LVQ neural network, 90 angles are respectively represented by 1, 2, 3. By training the images of the training set, a network with a prediction function is obtained, and angle judgment can be performed on any given face image.
The specific method for extracting the eye position feature vector comprises the steps of preprocessing 9000 collected images, cutting the face positions of the human eyes according to the size of 320 x 360, naming the cut images according to the format of 'personnel number _ face angle', converting the cut images into binary gray images, dividing the images into 6 rows and 8 columns, describing the position information of the human eyes by 8 submatrices of a 2 nd row, directly relating the number of pixels with the value of '1' in the 8 submatrices to the face angle after edge detection is carried out by a Sobel edge operator, and only counting the pixels with the value of '1' in the 8 submatrices of the 2 nd row.
The Sobel operator performs edge detection, and the Sobel operator is a group of directional operators and detects edges from different directions. The Sobel operator strengthens the weight of the pixels in the upper, lower, left and right directions of the central pixel, and the operation result is an edge image. The operator calculates the equations (1), (2) and (3):
f′x(x,y)=f(x-1,y+1)+2f(x,y+1)+f(x+1,y+1)-f(x-1,y-1)-2f(x,y-1)-f(x+1,y-1) (1)
f′y(x,y)=f(x-1,y-1)+2f(x-1,y)+f(x-1,y+1)-f(x+1,y-1)-2f(x+1,y)-f(x+1,y+1) (2)
G[f(x,y)]=|f′x(x,y)|+|f′y(x,y)| (3)
f in formula (II)'x(x,y)、f′y(x, y) denotes the first differential in the x-and y-directions, respectively, G [ f (x, y)]For the Sobel operator gradient, f (x, y) is the input image with integer pixel coordinates. After the gradient is found, a constant T can be set when G [ f (x, y)]If the value is greater than T, the point is marked as a boundary point, the image is set to 0, and the other points are set to 255, and the value of the constant T is appropriately adjusted to achieve the best effect. After an edge detection result is obtained in an input image, image number information at the position of human eyes is extracted, the number of pixel points with the value of 1 in 8 sub-matrixes of the 2 nd row of a divided grid is counted, and the extracted pixel point number is represented by a matrix of 100 multiplied by 8 and is used as an input layer of the LVQ neural network. Extracting characteristic vectors from 9000 prepared samples with different face angles to serve as a training set, wherein the testing set is the randomly extracted characteristic vectors of 200 pictures with different face angles. And (3) creating a neural network with the hidden layer neuron number of 10, inputting the training set and the test set into the neural network for training and learning, and finally obtaining a neural network model capable of predicting the face angle until the creation of the neural network model capable of predicting the face angle is completed.
The face angle estimation algorithm is characterized in that a 90-degree model of a face in a lens is trained in advance by utilizing an LVQ algorithm, and 3D angle information of the face is obtained finally by inputting eye features of the face to match corresponding angles.
In this embodiment, the distance detection algorithm adopts a known monocular distance measurement algorithm, which is not described herein again
The voice directional recognition interaction method provided by the invention can be applied to intelligent voice interaction equipment, the intelligent voice interaction equipment can be a robot with action capability or rotation capability, and can also be a non-mobile robot (similar to a video telephone), and the robot does not answer any answer when the robot needs to speak with the robot and must stand in a visual area of the robot and does not receive voice in the visual area.
In this embodiment, before speech recognition, directional pickup and signal enhancement are performed on a sound signal from the front, and a generalized sidelobe canceller algorithm is selected for speech signal enhancement, where the generalized sidelobe canceller algorithm specifically includes: the energy normalization of the sound signal is carried out, then a forward sound reference signal on a main lobe is generated through a fixed beam former, a noise reference signal is generated through a side lobe canceller, and finally a noise component on the main lobe signal is eliminated through a noise canceller.
The energy normalization is realized based on an energy normalization module by specifically adopting the following formula (4):
Figure BDA0002079259400000081
the fixed beam former forms fixed beams by superposing all array element signals on the same acquisition point data and dividing the superposed signals by the number of the array elements so as to generate forward sound reference signals on a main lobe, and the forward sound reference signals are specifically realized by adopting a formula (5):
Figure BDA0002079259400000082
the generalized lobe canceller algorithm adopted in this embodiment introduces the side lobe canceller to perform adaptive noise cancellation, so as to further enhance the main lobe signal, and the side lobe canceller is used to process frame data with a length of 512 points, specifically, the processing procedure is as in formulas (6), (7), and (8):
Figure BDA0002079259400000091
HL(k)=[hm,0(k),hm,1(k),...hm,511(k),]T(7)
D(k)=[d(k),d(k-1),...,d(k-511)]T(8)
wherein HL(k) Is a self-adaptive adjustment and limited parameter vector of the L array element, and is used for obtaining a pure noise signal.
Finally, a noise canceller is used for eliminating the noise on the main lobe signal, namely, the synthesized noise signal is subtracted from the main lobe signal to further enhance the forward sound signal, and the specific algorithm is shown in the formulas (9), (10) and (11):
Figure BDA0002079259400000092
wherein the content of the first and second substances,
WL(k)=[WL,0(k),L,1(k),...L,511(k]T(10)
YL(k)=[YL(k),YL(k-1),...YL(k-511)]T(11)
WL(k) is the cancellation parameter in the noise canceller.
In the above algorithm, WL(k) And HL(k) The adaptive adjustment is performed by the normalized minimum mean square error algorithm, which is common knowledge and will not be described herein again.
In a more preferred embodiment of the present invention, if the distance from the human face to the image capturing device calculated by the distance detection algorithm is greater than 1m, the robot is controlled to move forward to a range of 1 m.
Example 2
The embodiment discloses a device corresponding to the voice directional recognition interaction method of embodiment 1, and please refer to fig. 2, including:
the voice pickup device 210 is configured to directionally pick up a sound signal right in front, and perform voice recognition to obtain a voice text content;
the image acquisition equipment 220 is preset with an image acquisition angle and an acquisition distance, and acquires a face image which simultaneously meets the image acquisition angle and the acquisition distance;
and the processing unit 230 is configured to acquire the voice text content and the face image, and determine whether to reply.
In the present embodiment, the voice pickup device 210 is a fixedly installed non-steerable array microphone, and the array microphone is set by adjusting the beam-picking range of the directional microphone to be right in front, controlling the angle to be 60-70 degrees, and having the farthest picking-up distance of 1 m.
The image capturing device 220 is implemented by installing a camera, and in a preferred embodiment of the present invention, if the distance from the human face to the image capturing device 2 calculated by the above distance detection algorithm is greater than 1m, the apparatus of the present embodiment is controlled to move forward to a range of 1 m.
Example 3
Fig. 3 is a schematic diagram of an electronic device according to embodiment 3 of the present invention, as shown in fig. 3, the electronic device includes a processor 310, a memory 320, an input device 330, and an output device 340; the number of the processors 310 in the computer device may be one or more, and one processor 310 is taken as an example in fig. 3; the processor 310, the memory 320, the input device 330 and the output device 340 in the electronic apparatus may be connected by a bus or other means, and the connection by the bus is exemplified in fig. 3.
The memory 320 is a computer-readable storage medium for storing software programs, computer-executable programs, and modules, such as program instructions/modules (e.g., the processing unit 230 in the speech orientation recognition interaction device) corresponding to the speech orientation recognition interaction method in the embodiment of the present invention. The processor 310 executes various functional applications and data processing of the electronic device by executing software programs, instructions and modules stored in the memory 320, namely, implements the voice direction recognition interaction method of embodiment 1.
The memory 320 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 320 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 320 may further include memory located remotely from the processor 310, which may be connected to the electronic device through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 330 is used in the present embodiment to receive data such as voice text content and face images. The output device 340 may include a display device such as a display screen, and in this embodiment, is used for outputting voice responses.
Example 4
Embodiment 4 of the present invention also provides a storage medium containing computer-executable instructions, which when executed by a computer processor, are configured to perform a method for voice-oriented recognition interaction, the method including:
acquiring collected voice text content;
acquiring a face image which simultaneously meets the image acquisition angle and the acquisition distance;
and judging whether to reply or not according to the voice text content and the face image.
Of course, the storage medium provided by this embodiment contains computer-executable instructions, and the computer-executable instructions are not limited to the method operations described above, and may also perform related operations in the voice directional recognition interaction method provided by any embodiment of the present invention.
From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes instructions for enabling an electronic device (which may be a mobile phone, a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.
It should be noted that, in the embodiment of the apparatus for speech directional recognition interaction method, the included units and modules are only divided according to functional logic, but are not limited to the above division, as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.
Various other modifications and changes may be made by those skilled in the art based on the above-described technical solutions and concepts, and all such modifications and changes should fall within the scope of the claims of the present invention.

Claims (7)

1. A voice directional recognition interaction method is characterized by comprising the following steps:
acquiring collected voice text content;
acquiring a face image which simultaneously meets the image acquisition angle and the acquisition distance;
judging whether to reply or not according to the voice text content and the face image;
when the voice text content and the face image are acquired simultaneously, a reply is made to the voice text content, otherwise, no reply is made;
the image acquisition angle is 60-70 degrees, the acquisition distance is less than or equal to 1m, and the acquisition method of the voice text content comprises the following steps: after directional pickup and signal enhancement are carried out on the sound signals in front, voice recognition is carried out;
the human face image acquisition steps are as follows: extracting features of the collected image data, judging whether the image contains a human face through a human face detection algorithm, and if not, not processing the image data; if the image contains the face, calculating 3D angle information and distance information of the face in the image by using a face angle estimation algorithm and a face distance estimation algorithm, and if the 3D angle information and the distance information of the face meet the conditions, reserving the image data as the face image; and if the condition is not met, not collecting.
2. The speech directional recognition interaction method according to claim 1, wherein the face angle estimation algorithm adopts an LVQ algorithm to pre-train a 90-degree model of the face in a shot, and 3D angle information of the face is finally obtained by inputting eye features of the face to match corresponding angles.
3. The voice directional recognition interaction method of claim 1, wherein signal enhancement is performed after the directional pick-up of the acoustic signal by using a generalized sidelobe canceller algorithm, specifically: the energy normalization of the sound signal is carried out, then a forward voice reference signal on a main lobe is generated through a fixed beam former, a noise reference signal is generated through a side lobe canceller, and finally a noise component on the main lobe signal is eliminated through a noise canceller.
4. A speech orientation recognition interaction device, comprising:
the voice pickup equipment is used for directionally picking up the sound signal right in front and carrying out voice recognition to obtain voice text content;
the image acquisition equipment is provided with an image acquisition angle and an acquisition distance in advance and acquires a human face image which meets the image acquisition angle and the acquisition distance simultaneously;
the image acquisition angle is 60-70 degrees, the acquisition distance is less than or equal to 1m, and the acquisition method of the voice text content comprises the following steps: after directional pickup and signal enhancement are carried out on the sound signals in front, voice recognition is carried out;
extracting features of the collected image data, judging whether the image contains a human face through a human face detection algorithm, and if not, not processing the image data; if the image contains the face, calculating 3D angle information and distance information of the face in the image by using a face angle estimation algorithm and a face distance estimation algorithm, and if the 3D angle information and the distance information of the face meet the conditions, reserving the image data as the face image; if the condition is not met, not collecting;
the processing unit is used for acquiring the voice text content and the face image and judging whether to reply or not; and when the voice text content and the face image are acquired simultaneously, replying is made according to the voice text content, otherwise, no reply is made.
5. The voice orientation recognition interaction device of claim 4 wherein the voice pickup apparatus directs the pickup with a sound pickup range of: the sound receiving angle is 60-70 degrees, and the sound receiving distance is less than or equal to 1 m.
6. An electronic device comprising a processor, a storage medium, and a computer program, the computer program being stored in the storage medium, wherein the computer program, when executed by the processor, implements the speech orientation recognition interaction method of any one of claims 1 to 3.
7. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the speech direction recognition interaction method of any one of claims 1 to 3.
CN201910466749.8A 2019-05-30 2019-05-30 Voice directional recognition interaction method, device, equipment and medium Active CN110188179B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910466749.8A CN110188179B (en) 2019-05-30 2019-05-30 Voice directional recognition interaction method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910466749.8A CN110188179B (en) 2019-05-30 2019-05-30 Voice directional recognition interaction method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN110188179A CN110188179A (en) 2019-08-30
CN110188179B true CN110188179B (en) 2020-06-19

Family

ID=67719234

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910466749.8A Active CN110188179B (en) 2019-05-30 2019-05-30 Voice directional recognition interaction method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN110188179B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110619895A (en) * 2019-09-06 2019-12-27 Oppo广东移动通信有限公司 Directional sound production control method and device, sound production equipment, medium and electronic equipment
CN112908334A (en) * 2021-01-31 2021-06-04 云知声智能科技股份有限公司 Hearing aid method, device and equipment based on directional pickup
CN114699777A (en) * 2022-04-13 2022-07-05 南京晓庄学院 Control method and system of toy dancing robot

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6393136B1 (en) * 1999-01-04 2002-05-21 International Business Machines Corporation Method and apparatus for determining eye contact
CN106024003A (en) * 2016-05-10 2016-10-12 北京地平线信息技术有限公司 Voice positioning and enhancement system and method combining images
CN107679506A (en) * 2017-10-12 2018-02-09 Tcl通力电子(惠州)有限公司 Awakening method, intelligent artifact and the computer-readable recording medium of intelligent artifact
CN108733420A (en) * 2018-03-21 2018-11-02 北京猎户星空科技有限公司 Awakening method, device, smart machine and the storage medium of smart machine
CN109640224A (en) * 2018-12-26 2019-04-16 北京猎户星空科技有限公司 A kind of sound pick-up method and device
CN109754814A (en) * 2017-11-08 2019-05-14 阿里巴巴集团控股有限公司 A kind of sound processing method, interactive device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104820556A (en) * 2015-05-06 2015-08-05 广州视源电子科技股份有限公司 Method and device for waking up voice assistant

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6393136B1 (en) * 1999-01-04 2002-05-21 International Business Machines Corporation Method and apparatus for determining eye contact
CN106024003A (en) * 2016-05-10 2016-10-12 北京地平线信息技术有限公司 Voice positioning and enhancement system and method combining images
CN107679506A (en) * 2017-10-12 2018-02-09 Tcl通力电子(惠州)有限公司 Awakening method, intelligent artifact and the computer-readable recording medium of intelligent artifact
CN109754814A (en) * 2017-11-08 2019-05-14 阿里巴巴集团控股有限公司 A kind of sound processing method, interactive device
CN108733420A (en) * 2018-03-21 2018-11-02 北京猎户星空科技有限公司 Awakening method, device, smart machine and the storage medium of smart machine
CN109640224A (en) * 2018-12-26 2019-04-16 北京猎户星空科技有限公司 A kind of sound pick-up method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于改进广义旁瓣抵消器的语音增强;王晓荣等;《杭州电子科技大学学报》;20071015;第27卷(第05期);第88-91页 *
学习矢量量化神经网络在人脸朝向识别中的应用;李萍;《忻州师范学院学报》;20180428;第34卷(第02期);第59-61页 *

Also Published As

Publication number Publication date
CN110188179A (en) 2019-08-30

Similar Documents

Publication Publication Date Title
CN111833899B (en) Voice detection method based on polyphonic regions, related device and storage medium
CN107534725B (en) Voice signal processing method and device
US20230081645A1 (en) Detecting forged facial images using frequency domain information and local correlation
CN110188179B (en) Voice directional recognition interaction method, device, equipment and medium
KR100754385B1 (en) Apparatus and method for object localization, tracking, and separation using audio and video sensors
AU2022200439B2 (en) Multi-modal speech separation method and system
Schauerte et al. Multimodal saliency-based attention for object-based scene analysis
WO2019080551A1 (en) Target voice detection method and apparatus
US10582117B1 (en) Automatic camera control in a video conference system
CN108877787A (en) Audio recognition method, device, server and storage medium
US10964326B2 (en) System and method for audio-visual speech recognition
CN110837758B (en) Keyword input method and device and electronic equipment
CN110136162B (en) Unmanned aerial vehicle visual angle remote sensing target tracking method and device
CN110718227A (en) Multi-mode interaction based distributed Internet of things equipment cooperation method and system
CN111930336A (en) Volume adjusting method and device of audio device and storage medium
CN110287848A (en) The generation method and device of video
CN110110666A (en) Object detection method and device
CN115775564A (en) Audio processing method and device, storage medium and intelligent glasses
CN113516990A (en) Voice enhancement method, method for training neural network and related equipment
CN114120984A (en) Voice interaction method, electronic device and storage medium
CN114417908A (en) Multi-mode fusion-based unmanned aerial vehicle detection system and method
CN112507829B (en) Multi-person video sign language translation method and system
CN109522865A (en) A kind of characteristic weighing fusion face identification method based on deep neural network
CN112487246A (en) Method and device for identifying speakers in multi-person video
CN110679586B (en) Bird repelling method and system for power transmission network and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20220621

Address after: 310051 4f, Yuanyuan building, No. 528, Liye Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province

Patentee after: ZHEJIANG XIAOYUAN ROBOT Co.,Ltd.

Address before: 23 / F, World Trade Center, 857 Xincheng Road, Binjiang District, Hangzhou City, Zhejiang Province, 310051

Patentee before: ZHEJIANG UTRY INFORMATION TECHNOLOGY Co.,Ltd.

TR01 Transfer of patent right