CN110188179B

CN110188179B - Voice directional recognition interaction method, device, equipment and medium

Info

Publication number: CN110188179B
Application number: CN201910466749.8A
Authority: CN
Inventors: 嵇望; 汪斌; 林达; 李林峰
Original assignee: Zhejiang Utry Information Technology Co ltd
Current assignee: Zhejiang Xiaoyuan Robot Co ltd
Priority date: 2019-05-30
Filing date: 2019-05-30
Publication date: 2020-06-19
Anticipated expiration: 2039-05-30
Also published as: CN110188179A

Abstract

The patent application relates to the field of man-machine voice interaction and discloses a voice directional recognition interaction method which comprises the following steps: picking up a sound signal right in front for recognition to obtain a voice text content, and acquiring the voice text content; acquiring a face image which simultaneously meets an image acquisition angle and an acquisition distance based on the image acquisition angle and the acquisition distance; judging whether to reply or not according to the voice text content and the face image; the image acquisition angle is 60-70 degrees, and the acquisition distance is less than or equal to 1 m. The application also discloses a voice directional recognition interaction device, electronic equipment and a computer storage medium, and the voice directional recognition interaction method provided by the patent accords with the daily communication habit, can effectively eliminate the sound of outsiders and the sound of the environment, and realizes effective anthropomorphic communication with the user who is interacting in the front.

Description

Voice directional recognition interaction method, device, equipment and medium

Technical Field

The invention relates to the field of man-machine voice interaction, in particular to a voice directional recognition interaction method, a voice directional recognition interaction device, voice directional recognition interaction equipment and a voice directional recognition interaction medium.

Background

At present, robots or voice assistants are generally applied in complex environments, such as noisy environments of conference rooms, outdoors, shopping malls and the like, so that various problems of noise, reverberation, voice interference, echo and the like can occur, in the man-machine voice interaction process, sound in the surrounding 360-degree range can be identified by an array microphone for receiving sound, and in order to solve the problem of misrecognition of environmental sound, a 'wakeup word' technology is adopted in voice interaction. In practical application, only after the robot or the voice interaction assistant receives the awakening word, the voice content is identified; otherwise, no identification is performed.

The technology of the 'awakening word' is a main trigger mode when the mainstream robot or intelligent equipment carries out human-computer interaction at present. The problem with using wake-up words is that for a person as the subject of an interaction, he has to learn to use wake-up words, and if a robot is encountered in a strange place, the person as the subject of the interaction does not know at all which wake-up word to wake up, and if there is no wake-up word, he will not be able to communicate with it. Furthermore, the user needs to speak the "awakening word" before talking with the robot every time, such interaction process not only mechanically affects the rhythm of interaction, but also forgets to speak the "awakening word" or frequently speaks the "awakening word", so that the main interactive person finishes speaking a large period of speech, and the robot does not listen.

Generally, the robot interacts with the robot by standing right in front of the robot, but sound of an outsider and sound of the environment are mixed in the sound receiving process due to sound reception of the omnidirectional array microphone, namely, human voice or noise behind or around the robot is also collected and recognized, so that the accuracy of voice recognition is reduced, the robot can mistakenly respond even if the recognition is correct, and effective communication with a user interacting in front cannot be achieved.

In order to solve the above problems, chinese patent CN105204628A discloses a voice control method based on visual awakening, which includes that after receiving at least part of voice signals, a voice control device starts an image receiving unit installed thereon, the image receiving unit acquires images and transmits the images to an image recognition unit for recognition, and when recognizing a human face whose sight line is facing to the voice control device, voice recognition is performed. However, the patent still does not solve the interference of the environmental noise, and when a plurality of sound sources are present in the 360-degree range of the voice control device, for example, the image receiving unit recognizes a human face and simultaneously receives a plurality of surrounding voice signals at the voice control device, the recognition effect of the voice control device is interfered by the external environmental noise.

Disclosure of Invention

In order to overcome the defects of the prior art, one of the objectives of the present invention is to provide a voice directional recognition interaction method, which combines a human face image and a voice signal to determine a specific interaction object and then performs a targeted reply, so as to conform to the daily communication habit.

One of the purposes of the invention is realized by adopting the following technical scheme:

a voice directional recognition interaction method is characterized by comprising the following steps:

acquiring collected voice text content;

acquiring a face image which simultaneously meets the image acquisition angle and the acquisition distance;

judging whether to reply or not according to the voice text content and the face image;

the image acquisition angle is 60-70 degrees, the acquisition distance is less than or equal to 1m, and the acquisition method of the voice text content comprises the following steps: after directional picking up and signal enhancement are carried out on the sound signal in front of the front, voice recognition is carried out.

Further, the acquisition steps of the face image are as follows: extracting features of the collected image data, judging whether the image contains a human face through a human face detection algorithm, and if not, not processing the image data; if the image contains the face, calculating 3D angle information and distance information of the face in the image by using a face angle estimation algorithm and a face distance estimation algorithm, and if the 3D angle information and the distance information of the face meet the conditions, reserving the image data as the face image; and if the condition is not met, not collecting.

Further, when the voice text content and the face image are acquired simultaneously, a reply is made to the voice text content, otherwise, no reply is made.

Further, the face angle estimation algorithm adopts a face detection algorithm based on a convolutional neural network, and comprises the following steps:

establishing a face picture library, extracting and analyzing the features of the face picture library, extracting the forms and positions of the five sense organs, and counting to obtain a statistical analysis result;

training the statistical analysis result by adopting a deep convolutional neural network method to obtain an established part classifier, grading the face in the image data according to the face classifier, then carrying out rule analysis according to the score of each characteristic part to obtain a face candidate region, and finally combining a boundary regression algorithm to obtain a final face detection result.

Furthermore, the face angle estimation algorithm adopts an LVQ algorithm to train a 90-degree model of the face in a lens in advance, and 3D angle information of the face is obtained finally by inputting eye features of the face to match corresponding angles.

Further, after the directional pickup of the sound signal, signal enhancement is performed by adopting a generalized sidelobe canceller algorithm, specifically: the energy normalization of the sound signal is carried out, then a forward voice reference signal on a main lobe is generated through a fixed beam former, a noise reference signal is generated through a side lobe canceller, and finally a noise component on the main lobe signal is eliminated through a noise canceller.

The second purpose of the present invention is to provide a voice directional recognition interactive device, which is realized by adopting the following technical scheme:

speech orientation recognition interaction device, comprising:

the voice pickup equipment is used for directionally picking up the sound signal right in front and carrying out voice recognition to obtain voice text content;

the image acquisition equipment is provided with an image acquisition angle and an acquisition distance in advance and acquires a human face image which meets the image acquisition angle and the acquisition distance simultaneously;

and the processing unit is used for acquiring the voice text content and the face image and judging whether to reply or not.

Further, the sound reception range of the directional pickup of the voice pickup device is as follows: the sound receiving angle is 60-70 degrees, and the sound receiving distance is less than or equal to 1 m.

It is a further object of the invention to provide an electronic device comprising a processor, a storage medium and a computer program stored in the storage medium, which computer program, when executed by the processor, performs the above-mentioned speech orientation recognition interaction method.

It is a further object of the present invention to provide a computer-readable storage medium storing one of the objects of the invention, on which a computer program is stored, which computer program, when being executed by a processor, is adapted to carry out the above-mentioned speech orientation recognition interaction method.

Compared with the prior art, the invention has the beneficial effects that:

directionally identifying sound signals from the front, limiting the angle and distance of the picked sound signals, and performing signal enhancement on the picked sound signals, so that the interference of environmental noise is eliminated, and the influence on the identification effect caused by picking a plurality of surrounding sound signals is avoided; the angle and the distance of the image acquisition equipment for acquiring the face image are limited, the communication method is more consistent with a daily communication mode, and only when the voice signal and the face image within a specific range and distance are acquired simultaneously, the corresponding answer is made, so that the communication method is more consistent with the daily communication mode, effective communication is facilitated, and the personification effect of man-machine communication is improved.

Drawings

FIG. 1 is a flow chart of a voice directional recognition interaction method according to an embodiment 1 of the present invention;

fig. 2 is a schematic diagram of a speech orientation recognition interaction device according to embodiment 2 of the present invention;

fig. 3 is a block diagram of the electronic device according to embodiment 3 of the present invention.

Detailed Description

The present invention will now be described in more detail with reference to the accompanying drawings, in which the description of the invention is given by way of illustration and not of limitation. The various embodiments may be combined with each other to form other embodiments not shown in the following description.

Example 1

The voice directional recognition interaction method obtains a voice signal and a face image meeting requirements through direction so as to perform voice interaction, and as shown in fig. 1, the method comprises the following steps:

acquiring collected voice text content;

judging whether to reply according to the acquired voice text content and the acquired face image;

the image acquisition angle is 60-70 degrees, and the acquisition distance is less than or equal to 1 m.

The method for acquiring the voice text content comprises the following steps: after directional picking up and signal enhancement are carried out on the sound signal in front of the front, voice recognition is carried out.

And when the voice text content and the face image are acquired simultaneously, replying is made according to the voice text content, otherwise, no reply is made.

In this embodiment, based on a face detection algorithm, a face angle estimation algorithm, and a distance detection algorithm, it is determined whether a face appears in an angle range of 60 to 70 degrees and a face distance is within a distance range of 1m, and the steps of collecting a face image by using the face angle estimation algorithm and the distance detection algorithm are as follows: firstly, feature extraction is carried out on collected image data, whether a face is contained in an image or not is judged through a face detection algorithm, and if the face is not contained, the image data are not processed.

If the image contains the face, calculating 3D angle information and distance information of the face in the image by using a face angle estimation algorithm and a distance detection algorithm, and if the 3D angle information and the distance information of the face meet the conditions, reserving the image data as the face image; if the condition is not satisfied, no processing is performed.

The face detection algorithm is based on a convolutional neural network method, and specifically comprises the following steps:

training the statistical analysis result by adopting a deep convolutional neural network method to establish a part classifier, grading the face in the image data according to the face classifier, then carrying out rule analysis according to the score of each characteristic part to obtain a face candidate region, and finally combining a boundary regression algorithm to obtain a final face detection result.

After a face detection result is obtained, a face head portrait is extracted from original image data, head portrait characteristics are extracted from the face head portrait, and 3D angle information of the face head portrait is obtained by calculating the characteristics through a face angle estimation algorithm.

The face angle estimation algorithm mentioned in the present invention adopts a forward neural network of learning vector quantization lvq (learning vector quantization). Firstly, preparing a group of images with different human face angles, wherein the images come from 100 different persons, each person has 90 images, and the human face angles are respectively as follows: left, front right, from left to right in sequence, 1 face image at every 1 degree.

Firstly, feature vectors describing the positions of eyes in the pictures are extracted to be used as input of the LVQ neural network, 90 angles are respectively represented by 1, 2, 3. By training the images of the training set, a network with a prediction function is obtained, and angle judgment can be performed on any given face image.

The specific method for extracting the eye position feature vector comprises the steps of preprocessing 9000 collected images, cutting the face positions of the human eyes according to the size of 320 x 360, naming the cut images according to the format of 'personnel number _ face angle', converting the cut images into binary gray images, dividing the images into 6 rows and 8 columns, describing the position information of the human eyes by 8 submatrices of a 2 nd row, directly relating the number of pixels with the value of '1' in the 8 submatrices to the face angle after edge detection is carried out by a Sobel edge operator, and only counting the pixels with the value of '1' in the 8 submatrices of the 2 nd row.

The Sobel operator performs edge detection, and the Sobel operator is a group of directional operators and detects edges from different directions. The Sobel operator strengthens the weight of the pixels in the upper, lower, left and right directions of the central pixel, and the operation result is an edge image. The operator calculates the equations (1), (2) and (3):

f′_x(x，y)＝f(x-1，y+1)+2f(x，y+1)+f(x+1，y+1)-f(x-1，y-1)-2f(x，y-1)-f(x+1，y-1) (1)

f′_y(x，y)＝f(x-1，y-1)+2f(x-1，y)+f(x-1，y+1)-f(x+1，y-1)-2f(x+1，y)-f(x+1，y+1) (2)

G[f(x，y)]＝|f′_x(x，y)|+|f′_y(x，y)| (3)

f in formula (II)'_x(x，y)、f′_y(x, y) denotes the first differential in the x-and y-directions, respectively, G [ f (x, y)]For the Sobel operator gradient, f (x, y) is the input image with integer pixel coordinates. After the gradient is found, a constant T can be set when G [ f (x, y)]If the value is greater than T, the point is marked as a boundary point, the image is set to 0, and the other points are set to 255, and the value of the constant T is appropriately adjusted to achieve the best effect. After an edge detection result is obtained in an input image, image number information at the position of human eyes is extracted, the number of pixel points with the value of 1 in 8 sub-matrixes of the 2 nd row of a divided grid is counted, and the extracted pixel point number is represented by a matrix of 100 multiplied by 8 and is used as an input layer of the LVQ neural network. Extracting characteristic vectors from 9000 prepared samples with different face angles to serve as a training set, wherein the testing set is the randomly extracted characteristic vectors of 200 pictures with different face angles. And (3) creating a neural network with the hidden layer neuron number of 10, inputting the training set and the test set into the neural network for training and learning, and finally obtaining a neural network model capable of predicting the face angle until the creation of the neural network model capable of predicting the face angle is completed.

The face angle estimation algorithm is characterized in that a 90-degree model of a face in a lens is trained in advance by utilizing an LVQ algorithm, and 3D angle information of the face is obtained finally by inputting eye features of the face to match corresponding angles.

In this embodiment, the distance detection algorithm adopts a known monocular distance measurement algorithm, which is not described herein again

The voice directional recognition interaction method provided by the invention can be applied to intelligent voice interaction equipment, the intelligent voice interaction equipment can be a robot with action capability or rotation capability, and can also be a non-mobile robot (similar to a video telephone), and the robot does not answer any answer when the robot needs to speak with the robot and must stand in a visual area of the robot and does not receive voice in the visual area.

In this embodiment, before speech recognition, directional pickup and signal enhancement are performed on a sound signal from the front, and a generalized sidelobe canceller algorithm is selected for speech signal enhancement, where the generalized sidelobe canceller algorithm specifically includes: the energy normalization of the sound signal is carried out, then a forward sound reference signal on a main lobe is generated through a fixed beam former, a noise reference signal is generated through a side lobe canceller, and finally a noise component on the main lobe signal is eliminated through a noise canceller.

The energy normalization is realized based on an energy normalization module by specifically adopting the following formula (4):

the fixed beam former forms fixed beams by superposing all array element signals on the same acquisition point data and dividing the superposed signals by the number of the array elements so as to generate forward sound reference signals on a main lobe, and the forward sound reference signals are specifically realized by adopting a formula (5):

the generalized lobe canceller algorithm adopted in this embodiment introduces the side lobe canceller to perform adaptive noise cancellation, so as to further enhance the main lobe signal, and the side lobe canceller is used to process frame data with a length of 512 points, specifically, the processing procedure is as in formulas (6), (7), and (8):

H_L(k)＝[h_m，0(k)，h_m，1(k)，...h_m，511(k)，]^T(7)

D(k)＝[d(k)，d(k-1)，...，d(k-511)]^T(8)

wherein H_L(k) Is a self-adaptive adjustment and limited parameter vector of the L array element, and is used for obtaining a pure noise signal.

Finally, a noise canceller is used for eliminating the noise on the main lobe signal, namely, the synthesized noise signal is subtracted from the main lobe signal to further enhance the forward sound signal, and the specific algorithm is shown in the formulas (9), (10) and (11):

wherein the content of the first and second substances,

W_L(k)＝[W_L，0(k)，_L，1(k)，..._L，511(k]^T(10)

Y_L(k)＝[Y_L(k)，Y_L(k-1)，...Y_L(k-511)]^T(11)

W_L(k) is the cancellation parameter in the noise canceller.

In the above algorithm, W_L(k) And H_L(k) The adaptive adjustment is performed by the normalized minimum mean square error algorithm, which is common knowledge and will not be described herein again.

In a more preferred embodiment of the present invention, if the distance from the human face to the image capturing device calculated by the distance detection algorithm is greater than 1m, the robot is controlled to move forward to a range of 1 m.

Example 2

The embodiment discloses a device corresponding to the voice directional recognition interaction method of embodiment 1, and please refer to fig. 2, including:

the voice pickup device 210 is configured to directionally pick up a sound signal right in front, and perform voice recognition to obtain a voice text content;

the image acquisition equipment 220 is preset with an image acquisition angle and an acquisition distance, and acquires a face image which simultaneously meets the image acquisition angle and the acquisition distance;

and the processing unit 230 is configured to acquire the voice text content and the face image, and determine whether to reply.

In the present embodiment, the voice pickup device 210 is a fixedly installed non-steerable array microphone, and the array microphone is set by adjusting the beam-picking range of the directional microphone to be right in front, controlling the angle to be 60-70 degrees, and having the farthest picking-up distance of 1 m.

The image capturing device 220 is implemented by installing a camera, and in a preferred embodiment of the present invention, if the distance from the human face to the image capturing device 2 calculated by the above distance detection algorithm is greater than 1m, the apparatus of the present embodiment is controlled to move forward to a range of 1 m.

Example 3

Fig. 3 is a schematic diagram of an electronic device according to embodiment 3 of the present invention, as shown in fig. 3, the electronic device includes a processor 310, a memory 320, an input device 330, and an output device 340; the number of the processors 310 in the computer device may be one or more, and one processor 310 is taken as an example in fig. 3; the processor 310, the memory 320, the input device 330 and the output device 340 in the electronic apparatus may be connected by a bus or other means, and the connection by the bus is exemplified in fig. 3.

The memory 320 is a computer-readable storage medium for storing software programs, computer-executable programs, and modules, such as program instructions/modules (e.g., the processing unit 230 in the speech orientation recognition interaction device) corresponding to the speech orientation recognition interaction method in the embodiment of the present invention. The processor 310 executes various functional applications and data processing of the electronic device by executing software programs, instructions and modules stored in the memory 320, namely, implements the voice direction recognition interaction method of embodiment 1.

The memory 320 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 320 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 320 may further include memory located remotely from the processor 310, which may be connected to the electronic device through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 330 is used in the present embodiment to receive data such as voice text content and face images. The output device 340 may include a display device such as a display screen, and in this embodiment, is used for outputting voice responses.

Example 4

Embodiment 4 of the present invention also provides a storage medium containing computer-executable instructions, which when executed by a computer processor, are configured to perform a method for voice-oriented recognition interaction, the method including:

acquiring collected voice text content;

and judging whether to reply or not according to the voice text content and the face image.

Of course, the storage medium provided by this embodiment contains computer-executable instructions, and the computer-executable instructions are not limited to the method operations described above, and may also perform related operations in the voice directional recognition interaction method provided by any embodiment of the present invention.

From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes instructions for enabling an electronic device (which may be a mobile phone, a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.

It should be noted that, in the embodiment of the apparatus for speech directional recognition interaction method, the included units and modules are only divided according to functional logic, but are not limited to the above division, as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

Various other modifications and changes may be made by those skilled in the art based on the above-described technical solutions and concepts, and all such modifications and changes should fall within the scope of the claims of the present invention.

Claims

1. A voice directional recognition interaction method is characterized by comprising the following steps:

acquiring collected voice text content;

when the voice text content and the face image are acquired simultaneously, a reply is made to the voice text content, otherwise, no reply is made;

the image acquisition angle is 60-70 degrees, the acquisition distance is less than or equal to 1m, and the acquisition method of the voice text content comprises the following steps: after directional pickup and signal enhancement are carried out on the sound signals in front, voice recognition is carried out;

the human face image acquisition steps are as follows: extracting features of the collected image data, judging whether the image contains a human face through a human face detection algorithm, and if not, not processing the image data; if the image contains the face, calculating 3D angle information and distance information of the face in the image by using a face angle estimation algorithm and a face distance estimation algorithm, and if the 3D angle information and the distance information of the face meet the conditions, reserving the image data as the face image; and if the condition is not met, not collecting.

2. The speech directional recognition interaction method according to claim 1, wherein the face angle estimation algorithm adopts an LVQ algorithm to pre-train a 90-degree model of the face in a shot, and 3D angle information of the face is finally obtained by inputting eye features of the face to match corresponding angles.

3. The voice directional recognition interaction method of claim 1, wherein signal enhancement is performed after the directional pick-up of the acoustic signal by using a generalized sidelobe canceller algorithm, specifically: the energy normalization of the sound signal is carried out, then a forward voice reference signal on a main lobe is generated through a fixed beam former, a noise reference signal is generated through a side lobe canceller, and finally a noise component on the main lobe signal is eliminated through a noise canceller.

4. A speech orientation recognition interaction device, comprising:

extracting features of the collected image data, judging whether the image contains a human face through a human face detection algorithm, and if not, not processing the image data; if the image contains the face, calculating 3D angle information and distance information of the face in the image by using a face angle estimation algorithm and a face distance estimation algorithm, and if the 3D angle information and the distance information of the face meet the conditions, reserving the image data as the face image; if the condition is not met, not collecting;

the processing unit is used for acquiring the voice text content and the face image and judging whether to reply or not; and when the voice text content and the face image are acquired simultaneously, replying is made according to the voice text content, otherwise, no reply is made.

5. The voice orientation recognition interaction device of claim 4 wherein the voice pickup apparatus directs the pickup with a sound pickup range of: the sound receiving angle is 60-70 degrees, and the sound receiving distance is less than or equal to 1 m.

6. An electronic device comprising a processor, a storage medium, and a computer program, the computer program being stored in the storage medium, wherein the computer program, when executed by the processor, implements the speech orientation recognition interaction method of any one of claims 1 to 3.

7. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the speech direction recognition interaction method of any one of claims 1 to 3.