CN114253502A

CN114253502A - Dynamic volume adjusting method based on face characteristic point calculation

Info

Publication number: CN114253502A
Application number: CN202111546566.0A
Authority: CN
Inventors: 陈再蝶; 朱晓秋; 刘明锋; 樊伟东
Original assignee: Zhejiang Kangxu Technology Co ltd
Current assignee: Zhejiang Kangxu Technology Co ltd
Priority date: 2021-12-16
Filing date: 2021-12-16
Publication date: 2022-03-29

Abstract

The invention discloses a dynamic volume adjusting method calculated according to human face characteristic points, which comprises the following steps: s1, inputting the face of the user through a user face input module, and storing the face in a user database as the basic data of later pedestrian tracking and face recognition; s2, carrying out real-time pedestrian tracking on the user through a user tracking module; and S3, acquiring the position and size of the face image through the face frame detection module. In the invention, mature face recognition, face key point detection and pedestrian tracking technologies are comprehensively utilized, an internal system of the sound box detects the face key points of a user in real time, then the playing volume is intelligently adjusted according to the key point information, and the volume is finally adjusted by combining personal information recorded by the user, such as gender and age, so that the two hands of the user are liberated and the attention degree of the sound box is released.

Description

Dynamic volume adjusting method based on face characteristic point calculation

Technical Field

The invention relates to the technical field of intelligent voice, in particular to a dynamic volume adjusting method calculated according to human face characteristic points.

Background

With the rapid development of artificial intelligence technology in recent years, especially the development of key core technology of three intelligent voices, namely voice recognition, natural language processing and voice text, intelligent voices can realize a series of actions such as man-machine conversation, man-machine interaction, intelligent judgment decision and the like through key links such as voice acquisition, voice recognition, natural language understanding, voice synthesis and the like, the application field of intelligent voices is also continuously expanded, such as intelligent homes, intelligent vehicle-mounted systems, intelligent robots, AI education, intelligent customer service and the like, the core of intelligent voices is man-machine interaction and becomes an application field of key attention of domestic and foreign science and technology enterprises.

Volume adjustment in human-computer interaction generally goes through three steps: (1) human voice signals are converted into characters by intelligent equipment; (2) the real semantics of these words are understood by the machine, such as volume up commands, etc.; (3) after understanding, the sound box will react according to the instruction and will also automatically synthesize the output result into speech and return to the user, and these three processes correspond to speech recognition, natural language processing and speech synthesis technology, although the intelligent sound box using human-computer speech interaction has many application examples, and the superior intelligent sound box can also meet the speech interaction capability basically required by the user to complete the adjustment of the volume, however, the control process still has the following disadvantages:

firstly, the natural language processing model which is relied on needs abundant data volume and sufficient calculation power to carry out model training and obtaining, when certain untrained voice interacts with the intelligent sound box, the natural language processing model can not accurately understand, so that the volume adjustment fails, and the man-machine interaction content of a user is limited at the moment;

secondly, the used voice recognition technology has different timbre, tone quality, volume and other differences due to man-machine interaction, and also can affect the generalization of a voice recognition algorithm model, and the training data are mostly standard voice pronunciations, which causes the deletion of proprietary data, for example, a certain Xiaozhongdialect and a unique regional pronunciation are used as the man-machine interaction, which causes the failure of volume adjustment in the sound box interaction process;

except that above only adopt the degree of depth learning model to do volume control, in traditional intelligent audio amplifier equipment man-machine interaction's volume control, more rely on additional audio amplifier, these audio amplifiers include multiple inductors such as infrared induction and signal strength receiver, come user's instruction, react and adjust according to the instruction, however, utilize the traditional man-machine interaction control audio amplifier's of sensor volume control, there is following shortcoming:

(1) the sensor is easily interfered by external signals, and the interference degrees are different, so that the judgment of the system on the signal intensity is seriously influenced;

(2) the signals acquired by the sensors are only rough estimation and have larger difference with the true values;

(3) the user is in the state of stagnating in the volume control process, can not compromise other works, violates the original purpose of research and development of intelligent equipment.

Disclosure of Invention

In order to solve the technical problems mentioned in the background art, a dynamic volume adjustment method based on face feature point calculation is proposed.

In order to achieve the purpose, the invention adopts the following technical scheme:

a dynamic volume adjustment method based on face feature point calculation comprises the following steps:

s1, inputting the face of the user through a user face input module, and storing the face in a user database as the basic data of later pedestrian tracking and face recognition;

s2, carrying out real-time pedestrian tracking on the user through a user tracking module;

s3, obtaining the position and size of the face image through the face frame detection module, and giving the user the choice "whether to recognize as a designated face? "is the user selected" is recognized as a designated face? If yes, the human face frame detection module returns all different human face areas for pedestrian tracking after detecting the human faces in the data stream;

s4, tracking the face of the user in all face areas output by the face frame detection module, comparing the registered face of the user in the user database by using the face recognition module, judging whether the registered face is the registered user, if so, entering a face key point detection process, and if not, entering a new user registration link;

s5, cutting the user face in the real-time data stream detected by the face detection module, inputting the cut face partial image into the face key point detection module, positioning key area information of the face by the face key point detection module, and outputting key point coordinates and rotation angles of the face in the face partial image;

s6, the volume adjusting module estimates the distance range between the user and the sound box by obtaining the key point coordinates and the rotation angle of the face, the volume adjusting range of the volume adjusting module is set to eight levels and corresponds to eight different distance ranges, when the real-time distance of the user is unchanged in the corresponding level range, the volume is kept unchanged, and when the real-time distance of the user is changed in the corresponding level range, the volume is automatically adjusted to the volume corresponding to the corresponding distance range.

As a further description of the above technical solution:

when the face of the user is recorded through the user portrait recording module, the user tracking module tracks the maximum target of the face part in the data stream, and stores the face of the registered user in the user database to complete the registration of the new user;

when the user tracking module does not perform pedestrian tracking adjustment of the volume for the user, in step S3, if the user selects "whether to designate face recognition? And if not, the face of the user is not registered, and the face frame detection module detects the face in the data stream and returns to output the face area with the largest area for pedestrian tracking and key point detection.

As a further description of the above technical solution:

the human face key point detection module is used for positioning key region positions of the human face, including eyebrows, eyes, a nose, a mouth and face contours, and the detection method used by the human face key point detection module is one of model-based ASM/AAM, CPR (shape regression) or deep learning-based methods.

As a further description of the above technical solution:

the method also comprises the step of inputting unnecessary information of the user, the gender and the age of the user are input through a user information input module and are stored in a user database corresponding to a registered user, and before the volume is adjusted through a volume adjusting module, whether the user inputs the information of the user is inquired, and the volume is finely adjusted.

As a further description of the above technical solution:

the volume adjusting module comprises a buffer adjusting mechanism, and a delay parameter is added into the buffer adjusting mechanism, so that the volume is not immediately reduced or increased in the volume adjusting process, and the volume is adjusted after the delay parameter.

As a further description of the above technical solution:

the face frame detection module is trained through a training data set, the labeling mode of the training data set is rectangular labeling, the labeled region is a rectangle containing a face part, and the face frame detection module outputs face rectangular region coordinate values after a target detection algorithm based on deep learning is utilized.

As a further description of the above technical solution:

the sound box is internally provided with a computer system, a high-definition camera and a sound input and output module, the computer system is used for data operation and flow support, the sound input and output module is used for interacting with a user, and the high-definition camera is used for image acquisition.

In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:

1. the invention comprehensively utilizes mature face recognition, face key point detection and pedestrian tracking technologies, uses an internal system of a sound box to detect face key points of a user in real time, then intelligently adjusts the playing volume according to key point information, liberates the two hands of the user and the attention degree of the sound box, uses the pedestrian tracking technology to track the user in an image stream in order to achieve the expected effect, then determines a face partial image of the user, performs key point detection on the face image at the corresponding moment, analyzes the relationship between the face key points, estimates the real-time distance range between the user and the sound box, and finally performs volume adjustment by combining personal information recorded by the user, such as gender and age.

2. In the present invention, when a new user uses the present invention, since pedestrian tracking and face recognition are involved, it is necessary to provide a voice prompt "whether to recognize a specific face? The ' link selection ' is ' to register and store the face of the new user, the invention can also choose not to track and adjust the volume for the user, and then ' whether to recognize the designated face ' needs to be prompted in voice? If the link is selected to be 'no', the system tracks the pedestrians on the detected maximum face part and adjusts the volume.

3. In the invention, a buffer adjustment mechanism is required in the volume adjustment link so as to avoid influencing the playing effect and damaging the hearing health of a short-distance user during adjustment.

Drawings

Fig. 1 is a schematic workflow diagram illustrating a dynamic volume adjustment method according to face feature point calculation according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example one

Referring to fig. 1, the present invention provides a technical solution: a dynamic volume adjustment method based on face feature point calculation comprises the following steps:

s2, carrying out real-time pedestrian tracking on the user through the user tracking module, wherein the use of the user tracking module enables the invention to achieve better real-time capability and reduces the time for face recognition;

specifically, the face frame detection is regarded as a special case of the target detection, the task in the target detection is to search the positions and sizes of all objects in a given class of images, the face frame detection is to detect a class of faces, the face frame detection is used in the invention, a face frame detection module is trained through a training data set, the marking mode of the training data set is rectangular marking, the marked area is a rectangle containing a face part, and the face frame detection module outputs the coordinate value of the face rectangular area after a target detection algorithm based on deep learning is utilized

specifically, the face recognition can be summarized as inputting a face image to be detected, firstly extracting face features of the image, including global features, face features and the like, comparing the face features with N face image features in a comparison library to find out features with the highest similarity, then comparing the features with a budget threshold value, and outputting identity information corresponding to the features;

specifically, the human face key point detection module positions key region positions of the human face, including eyebrows, eyes, a nose, a mouth, a face contour and the like, and the used detection method is one of a model-based ASM/AAM method, a cascade shape regression-based CPR method or a deep learning-based method, and the detection method is taken as the prior art, so that the detection method is not developed here;

s6, the volume adjusting module estimates the distance range between the user and the sound box by obtaining the key point coordinates and the rotation angle of the face, the volume adjusting range of the volume adjusting module is set to eight levels and corresponds to eight different distance ranges, when the real-time distance of the user is unchanged in the corresponding level range, the volume is kept unchanged, and when the real-time distance of the user is changed in the corresponding level range, the volume is automatically adjusted to the volume corresponding to the corresponding distance range;

specifically, firstly, different loudspeaker box hardware devices correspond to different volume-distance relationships, hardware devices selected according to corresponding loudspeaker boxes are tested in a sound laboratory, the corresponding volume-distance relationships are given, the volume-distance relationships are divided into eight levels, so that the stable effect of the loudspeaker box volume in the adjusting process is guaranteed, secondly, the volume adjusting module comprises a buffering adjusting mechanism, a delay parameter is added into the buffering adjusting mechanism, the volume is not immediately reduced or increased in the volume adjusting process, the volume is adjusted according to the delay parameter, the slow adjustment of the volume is achieved, the playing effect is prevented from being influenced when the sound adjusting module is used for adjusting, and the hearing of a user in a short distance is prevented from being damaged.

Referring to fig. 1, when a face of a user is recorded by the user portrait recording module, the user tracking module tracks a target with a largest face portion in a data stream, and stores the face of a registered user in the user database to complete registration of a new user;

Referring to fig. 1, the method further includes entering unnecessary information of the user, entering the gender and age of the user through a user information entry module, storing the gender and age in a user database corresponding to a registered user, inquiring whether the user enters the information before adjusting the volume through a volume adjustment module, and finely adjusting the volume, so as to assist the volume adjustment module.

Referring to fig. 1, the sound box is provided with a computer system, a high-definition camera, and a sound input and output module, the computer system has sufficient computing power for data computing and process support, the sound input and output module is used for interacting with a user, and the high-definition camera is used for image acquisition.

Firstly, the invention comprehensively utilizes mature face recognition, face key point detection and pedestrian tracking technologies, an internal system of a sound box detects face key points of a user in real time, then intelligently adjusts the playing volume according to key point information, frees the two hands of the user and the attention of the sound box, tracks the user in an image stream by using the pedestrian tracking technology in order to achieve the expected effect, then determines a face partial image of the user, detects the key points of the face image at the corresponding moment, analyzes the relationship between the face key points, estimates the real-time distance range between the user and the sound box, assists in combining personal information recorded by the user, such as gender and age, and finally adjusts the volume;

second, when a new user uses the present invention, since pedestrian tracking and face recognition are involved, it is necessary to provide a voice prompt "whether to recognize a specified face? The ' link selection ' is ' to register and store the face of the new user, the invention can also choose not to track and adjust the volume for the user, and then ' whether to recognize the designated face ' needs to be prompted in voice? If the link is selected to be 'no', the system tracks the detected maximum face part and adjusts the volume;

finally, the invention needs to make a buffer adjustment mechanism in the volume adjustment link so as to avoid influencing the playing effect and damaging the hearing health of a short-distance user during adjustment.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims

1. A dynamic volume adjustment method based on face feature point calculation is characterized by comprising the following steps:

2. The method of claim 1, wherein when the face of the user is entered via the user profile entry module, the user tracking module tracks the object with the largest face portion in the data stream, and stores the face of the registered user in the user database to complete the registration of the new user;

3. The method of claim 1, wherein the face keypoint detection module locates the positions of the key regions of the face, including eyebrows, eyes, nose, mouth and face contours, using a detection method selected from the group consisting of model-based ASM/AAM, CPR based on cascaded shape regression, and deep learning based methods.

4. The method as claimed in claim 1, further comprising entering unnecessary information of the user, entering the gender and age of the user through the user information entry module, storing the gender and age in the user database corresponding to the registered user, inquiring whether the user enters the information before adjusting the volume through the volume adjustment module, and fine-tuning the volume.

5. The method as claimed in claim 1, wherein the volume adjustment module comprises a buffer adjustment mechanism, and a delay parameter is added to the buffer adjustment mechanism, so that the volume is not immediately decreased or increased during volume adjustment, but is adjusted according to the delay parameter.

6. The method of claim 1, wherein the face frame detection module is trained by a training data set, the training data set is labeled as rectangles, the labeled region is a rectangle containing a face portion, and the face frame detection module outputs coordinates of the face rectangular region by using a target detection algorithm based on deep learning.

7. The method as claimed in claim 1, wherein the speaker box has a computer system, a high-definition camera, and a voice input and output module, the computer system is used for data operation and process support, the voice input and output module is used for interaction with a user, and the high-definition camera is used for image capture.