WO2018018906A1 - 一种语音门禁和安静环境监控方法及系统 - Google Patents

一种语音门禁和安静环境监控方法及系统 Download PDF

Info

Publication number
WO2018018906A1
WO2018018906A1 PCT/CN2017/077792 CN2017077792W WO2018018906A1 WO 2018018906 A1 WO2018018906 A1 WO 2018018906A1 CN 2017077792 W CN2017077792 W CN 2017077792W WO 2018018906 A1 WO2018018906 A1 WO 2018018906A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
speaker
turning point
recognition
information
Prior art date
Application number
PCT/CN2017/077792
Other languages
English (en)
French (fr)
Inventor
全小虎
李明
蔡泽鑫
Original Assignee
深圳市鹰硕音频科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市鹰硕音频科技有限公司 filed Critical 深圳市鹰硕音频科技有限公司
Publication of WO2018018906A1 publication Critical patent/WO2018018906A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • G10L17/24Interactive procedures; Man-machine interfaces the user being prompted to utter a password or a predefined phrase
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal

Definitions

  • the invention relates to a voice access control and quiet environment monitoring method and system, which is used for voice recognition in a closed environment and quiet environment monitoring in the closed environment, in particular, a monitoring method for a student dormitory environment when the sleep time is not visible And system.
  • voice has gradually become the most natural, most convenient and most effective communication tool for people to interact with the outside world, while voice is the daily life of people.
  • voice is the daily life of people.
  • human society is gradually entering the information age.
  • Intelligent voice technology has gradually emerged from numerous pattern recognitions and plays an increasingly important role.
  • Voice-related technologies are gradually integrated into social platforms, e-commerce, smart wear, smart home and even the financial industry, and play an important role. This makes it possible to use voice technology to ease the pressure on dormitory management.
  • CN102708867A (Publication Date, October 3, 2012) discloses a voice recording and voice-based anti-recording and false identification method and system, which can be used in the field of identity authentication, which is specifically for generating a fixed text with a user ID, and The random text is spliced into the prompt text, the user reads the voice of the prompt text, establishes the user's voiceprint model and the voice model, and saves the fixed text and voiceprint model with the user ID and the voice model.
  • the fixed text with the user ID is 4-7 Chinese characters.
  • CN204791241U (Publication Date November 18, 2015) discloses a voice interactive access control system mounted on a door, including an access controller and an electronic lock.
  • the access controller includes a microphone, a wireless network module, a camera, etc., running an Android or Windows operating system.
  • the access controller periodically acquires the ultrasonic sensor and the magnetic state of the door. When the sensor detects that someone is staying before the access control, the system automatically lights up the touch display screen and plays the greeting voice through the speaker.
  • the microphone waits to accept the user's voice and sends the user voice to the voice. Pattern recognition module.
  • CN102760434A (Publication Date October 31, 2012) discloses a method and a terminal for updating a voiceprint feature model, acquiring an original audio stream including at least one speaker, and acquiring the sound according to a preset speaker segmentation and clustering algorithm a separate audio stream of each of the at least one speaker in the original audio stream, the respective audio streams of each of the at least one speaker are respectively matched with the original voiceprint feature model, and the matching is successful. Audio stream.
  • CN104376619A discloses a monitoring method applied to a first device, the A device is mounted on or outside the door and has a first acquisition unit. First, the image and sound information outside the door are collected by the first device.
  • the first collecting unit may be an image or sound collecting device. When a visitor enters a certain area from the security door, the first collecting unit captures the visiting image of the visitor. The recording and real-time shooting are performed, and the above image and sound information is transmitted to the information processing apparatus installed in the first device, thereby judging the identity of the visitor.
  • the difficulty of signal collection often determines the cost, ease of use and intuitive experience of the user.
  • the acquisition and transmission of voice signals determines that the voice signal is relatively easy to acquire and acquire.
  • the acquisition process is also very simple. In practical applications, the cost of the sound card microphone is extremely low.
  • endpoint detection and detection of active speech signals have been widely used; speaker segmentation clustering and speaker recognition technology as the most effective speech analysis technology, can achieve automatic monitoring of human and high-reliability dormitory quiet environment.
  • the patient's direct call is most effective, and by voice recognition and monitoring, the patient can be determined by the sound of the call, which can provide quick guidance for the medical staff.
  • the invention is mainly applied to the monitoring of the quiet environment of the dormitory of the school accommodation student (resting environment such as sleep period), but the application scenario of the present invention is not limited thereto, and the quiet environment monitoring is required for any closed environment requiring identification entry and the closed environment.
  • the scenarios and methods of the present invention are applicable.
  • the method and system of the present invention collects the voiceprint information of the user each time the user (student) reads different prompt texts through the same voice recognition process in the access control system and gradually establishes the voiceprint model of each user without special sound. Pattern model training improves efficiency and saves labor costs.
  • the invention also improves the method of segmentation clustering, and improves the efficiency and accuracy of clustering.
  • the invention also provides improvements in other related aspects.
  • the invention also improves the efficiency and accuracy of recognition by managing the information of the fixed space personnel.
  • the invention provides a method for intelligent voice access control and quiet environment monitoring of a student dormitory based on voice recognition and voiceprint recognition, comprising the following steps:
  • a voice access control identification step for performing voice verification before the access control, and performing voice recognition and voiceprint recognition on the collected audio of the person to be verified;
  • the voice access control identification step further includes:
  • Ad Admit the audio read, first recognize whether it is the correct character string by voice recognition, and then use the voiceprint to verify whether it is a valid verifier, thereby judging whether to open the access control.
  • the quiet environment monitoring step further includes:
  • the speaker segmentation cluster analysis includes a speaker segmentation step, a speaker clustering step, and a voiceprint recognition step;
  • the speaker segmentation step is used to find a turning point of the speaker switching, including detection of a single turning point and detection of multiple turning points;
  • the single turning point detection includes distance-based sequential detection, cross detection, and turning point confirmation;
  • the plurality of turning point detections are used to find a plurality of speaker turning points in the entire speech, and are completed on the basis of the single turning point detection, and the steps are as follows:
  • Step 1) Firstly, set a large time window with a length of 5-15 seconds, and perform single turning point detection in the window;
  • Step 2) If the speaker turning point is not found in the previous step, move the window to the right for 1-3 seconds, repeat step 1 until the speaker turning point is found, or the voice segment ends;
  • Step 3) If the speaker turning point is found, record the turning point and set the window starting point to this turning point, and repeat steps 1) - 2).
  • Sign( ⁇ ) is a symbol function, and d cross is the distance value at the intersection of two distance curves;
  • d(i) in the formula is the calculated distance in the end region, and if the final result is positive, accept this point as the speaker turning point. If negative, reject this point as a speaker turning point.
  • the pop-up verification string is a randomly generated multi-bit string, and the information that needs to be read each time is not fixed.
  • the endpoint detection is implemented by a 360 degree ring microphone array to ensure the sensitivity of the audio acquisition and the quality of the acquired audio.
  • the voice access control step further includes the step ae):
  • each of the read audio is saved as a verification human voice pattern training audio until the verification human voice pattern model is successfully constructed.
  • the voiceprint model of the step be) is trained on the basis of the audio data saved in the step ae).
  • the facial image acquisition is started at the same time, the facial image of the person to be verified is acquired, and after the facial image is obtained, the central processing step is performed to obtain the information of the person to be verified, and the collected voice signal is obtained.
  • the registration information to form an associated database.
  • the information of the person to be verified is activated.
  • the system does not activate their information, but sends their information to the manager.
  • the registration is expanded to all the registered personnel for comparison. If the comparison is successful, a prompt prompting illegal entry or not effective punching is generated;
  • At least one annular microphone array At least one annular microphone array
  • An ambient brightness recognition unit for detecting the brightness of the dormitory environment, automatically turning the monitoring on or off;
  • a sound playback device that communicates with people in the monitored environment.
  • the central processing step sends and displays the identity information and the audio data and the issuance time information sent to the manager, and transmits the information to the monitoring device associated with the system background or the central processing step, so that the monitor can intuitively and conveniently perform corresponding Management, easy to take appropriate management measures.
  • a voice access control and quiet environment monitoring system including a voice access control module, a quiet environment monitoring module, and a central processing module.
  • the voice access recognition module is configured to perform voice verification before the access control, and sequentially perform the audio of the collected person to be verified. Speech recognition and voiceprint recognition;
  • the quiet environment monitoring module is configured to perform voice monitoring in a quiet environment, and includes endpoint detection, speaker segmentation clustering, and voiceprint recognition in sequence;
  • the voice access recognition module and the quiet environment monitoring module are both connected to the central processing module.
  • the quiet environment monitoring module further includes a speaker segmentation module, a speaker clustering module, and a voiceprint recognition module;
  • the speaker segmentation module is configured to find a turning point of speaker switching, including detection of a single turning point and detection of multiple turning points;
  • the single turning point detection includes distance-based sequential detection, cross detection, and turning point confirmation;
  • the plurality of turning point detections are used to find a plurality of speaker turning points in the entire speech, and are completed on the basis of the single turning point detection, and the steps are as follows:
  • Step 1) Firstly, set a large time window with a length of 5-15 seconds, and perform single turning point detection in the window;
  • Step 2) If the speaker turning point is not found in the previous step, move the window to the right for 1-3 seconds, repeat step 1 until the speaker turning point is found, or the voice segment ends;
  • Step 3) If the speaker turning point is found, record the turning point and set the window starting point to this turning point, and repeat steps 1) - 2).
  • Sign( ⁇ ) is a symbol function, and d cross is the distance value at the intersection of two distance curves;
  • d(i) in the formula is the calculated distance in the end region, and if the final result is positive, accept this point as the speaker turning point. If negative, reject this point as a speaker turning point.
  • the voice access recognition module is disposed outside the door of the closed environment, and includes a microphone for collecting audio, a button for triggering the door recognition, and a display device for displaying a character string.
  • the voice access recognition module further includes a voice playback device that interacts with the to-be-verified person;
  • the infrared detection unit is used in place of the button so that system verification is automatically turned on when the person to be verified approaches.
  • the voice access recognition module further includes a facial image collection device for collecting an avatar of the person to be verified.
  • the voice access recognition module further includes an interface for connecting the mobile terminal, and the functions of the microphone, the button, the display device, and the facial image collection device are connected by the microphone, the screen virtual button, and the display of the mobile terminal after the mobile terminal is connected through the interface. Screen and camera implementation.
  • the mobile terminal is installed with an APP or PC software client that implements a voice access control function.
  • the mobile terminal is connected to the access control opening and closing system by wire or wirelessly to determine the opening and closing access control system according to the result of the verification.
  • the voice recognition is started by triggering the door recognition recognition button, and the facial image collecting device is synchronously turned on, and the facial image of the person to be verified is collected, and the facial image is obtained, and then sent to the central processing module, and the central processing module performs the comparison. Yes, the registration information of the person to be verified is obtained, and the collected voice signal is associated with the registration information to form an associated database.
  • the system activates the information of the person to be verified. For those who have registered but did not enter the dormitory, the system does not activate their information, but sends their information to the system administrator.
  • the system first compares this information with the activation when comparing;
  • the registration is expanded to all the registered personnel for comparison. If the comparison is successful, a prompt prompting illegal entry or not effective punching is generated;
  • the warning message of illegal intrusion is generated, and the administrator can confirm the information through voice interaction.
  • the quiet environment monitoring module is disposed in each unit of the enclosed environment and includes at least one annular microphone array.
  • An ambient brightness recognition unit for detecting the brightness of the dormitory environment, automatically turning the monitoring on or off;
  • a sound playback device that communicates with people in the monitored environment.
  • the central processing module is separately disposed in the background of the system, and can be integrally configured with the voice access control module; or is integrally provided with the quiet environment monitoring module for processing and displaying the monitoring information obtained by the quiet environment monitoring module.
  • the central processing module sends and displays the identity information and the audio data and the issuance time information sent to the manager, and transmits the information to the monitoring device connected to the system background or the central processing module, so that the monitor can intuitively and conveniently perform corresponding Management, easy to take appropriate management measures.
  • the intelligent intelligent access control system and the quiet environment automatic monitoring system of the invention make the collection of the access control and monitoring information data safe, convenient and simple, so that the monitoring indicators become intuitive and effective, and help the school dormitory management become Simple, yet reliable and effective.
  • FIG. 1 is a schematic diagram of a system architecture in accordance with the present invention.
  • FIG. 2 is a schematic diagram of a voice access control identification step in accordance with the present invention.
  • Figure 3 is a schematic diagram showing the steps of monitoring a quiet environment according to the present invention.
  • FIG. 4 is a schematic diagram of another voice access control identification step according to the present invention.
  • Figure 5 is a schematic diagram of speech model training in accordance with the present invention.
  • Figure 6 is a schematic diagram showing the construction of a speech model in accordance with the present invention.
  • Figure 7 is a schematic diagram of a speech model association in accordance with the present invention.
  • Figure 8 is a schematic diagram of voice verification in accordance with the present invention.
  • Figure 9 is a schematic view showing the training steps of the voiceprint model according to the present invention.
  • Figure 10 is a schematic diagram of i-vector training in accordance with the present invention.
  • FIG. 11 is a schematic diagram of a conventional fixed beamforming system in the prior art
  • FIG. 12 is a schematic diagram of values of time intervals of calculating a channel optimal delay in a beamforming method according to the present invention.
  • FIG. 13 is a schematic diagram of a speaker segmentation clustering process according to the present invention.
  • Figure 14 is a flow chart of single inflection point detection in accordance with the present invention.
  • Figure 15 is a schematic illustration of distance-based sequential detection in accordance with the present invention.
  • Figure 16 is a graph showing sequential detection distances in accordance with the present invention.
  • 17 is a schematic diagram of finding a second speaker voice template according to the present invention.
  • Figure 18 is a schematic illustration of a cross-detection speaker turning point in accordance with the present invention.
  • Figure 19 is a schematic view of erroneous turning point detection in accordance with the present invention.
  • Figure 20 is a schematic view showing the turning point confirmation according to the present invention.
  • 21 is a block diagram of an IHC algorithm in accordance with the present invention.
  • the voice access control and quiet environment monitoring system of the present invention comprises: a voice access control module, a quiet environment monitoring module and a central processing module, wherein the voice access control module and the quiet environment monitoring module are both connected to the central processing module. connection.
  • the central processing module can control the two modules, and the two can be connected by wire or wireless, and can be a wired network or a wireless network.
  • the voice access recognition module is disposed outside the door of the closed environment, and includes a microphone for collecting audio, a button for triggering the door recognition, a display device for displaying a character string, and a face image collecting device.
  • the voice access recognition module may further comprise a voice playback device that interacts with the to-be-verified person.
  • the microphone may be a mono microphone, usually disposed outside the door to facilitate the collection of access control voice data, and the microphone may also be a microphone of other mobile devices such as a mobile phone.
  • the button may be a touch button or may be replaced with an infrared detection unit to automatically turn on system verification when the person to be verified approaches.
  • the display device may be a variety of commonly used displays or display screens, or a display screen of a mobile phone or other mobile device for displaying a character string and other various prompt information to the user.
  • the facial image capturing device may be a camera or a camera, and the camera may be provided separately, or a camera of a mobile phone or other mobile device may be used.
  • the voice playing device may be a separately set speaker, or may be a sound playing device of a mobile phone or other mobile device.
  • control of the access control system can be realized by a networked mobile terminal such as a smart phone without separately installing an identification and verification device related to the access control system.
  • a mobile device equipped with a voice access control APP such as a smart phone
  • the microphone, the camera, the screen, the button, etc. of the smart phone can be called, thereby playing a corresponding role, and the smart phone passes through the network, such as wireless.
  • a network connected to the central processing module.
  • the mobile terminal such as a mobile phone, is connected to the access control opening and closing system by wired or wireless means such as Bluetooth to determine the opening and closing system according to the result of the verification.
  • mobile terminals can particularly meet those temporarily closed environments, such as temporary dorms or emergency situations after the access control system is damaged.
  • an interface can be reserved for connecting to a mobile terminal, such as a smart phone, even outside of a normal access control system.
  • the voice recognition is started by triggering the door recognition recognition button, and the facial image collecting device is synchronously turned on, and the facial image of the person to be verified is collected, and the facial image is obtained, and then sent to the central processing module, and the central processing module performs the comparison. Yes, the registration information of the person to be verified is obtained, and the collected voice signal is associated with the registration information to form an associated database.
  • the system activates the information of the person to be verified. For those who have registered but did not enter the dormitory, the system does not activate their information, but sends their information to the system administrator. .
  • the information of these entrants is activated in order to more easily identify and compare voice information during the monitoring phase.
  • the system first compares this information with the activation when comparing.
  • the horn can make various prompts or instructions to the user.
  • various identity cards that are frequently used such as commonly used passports, employee cards, etc.
  • the quiet environment monitoring module is disposed in each unit of the enclosed environment, such as in each student dormitory, including at least one annular microphone array. Further, an ambient brightness recognition unit may be further included for detecting the brightness of the dormitory environment and automatically turning the monitoring on or off. Further, a sound playing device that communicates with a person in the monitored environment may also be included.
  • the circular microphone array may be a 360-degree circular microphone array, which may be disposed at a central position of the indoor ceiling or other suitable position, so as to conveniently and accurately collect and monitor the voice signal.
  • the quiet environment is a dormitory or other closed environment, and the monitoring is mainly turned on in an invisible environment or a weak light environment, and of course, can be used in a fixed daytime light period.
  • the central processing module may be separately disposed in the background of the system, may be integrally configured with the voice access control module, or may be integrally configured with the quiet environment monitoring module, and may process and display the monitoring information obtained by the quiet environment monitoring module.
  • the central processing module obtains the registered and activated voice model of the dormitory, and performs quick comparison to maximize the recognition speed and accuracy. . If the matching information is not found in the activation personnel, the comparison is extended to all registered personnel for comparison. If the comparison is successful, a prompt prompting illegal entry or not effective punching is generated. If there is no comparison, the warning message of illegal intrusion is generated, and the administrator can confirm the information through voice interaction.
  • an abnormal sound model is saved in the system for dealing with abnormal speech sounds, such as the sound of a football game played, the sound of a basketball game, the sound of playing music, or the calling sound, such as a life-saving sound, a shout, Sound models such as fire, so that security protection may be provided in an emergency.
  • abnormal speech sounds such as the sound of a football game played, the sound of a basketball game, the sound of playing music, or the calling sound, such as a life-saving sound, a shout, Sound models such as fire, so that security protection may be provided in an emergency.
  • the central processing module sends and displays the identity information and the audio data and the time of issuance sent to the administrator. For example, the noisy time period, the noisy degree, the noisy identity, etc. are transmitted to the monitoring device connected with the system background or the central processing module, so that the monitor can perform corresponding management intuitively and conveniently, and it is convenient to take corresponding management measures.
  • the administrator can receive this information through the APP client or PC software client, or display it on the display or monitor screen that is being set up.
  • the voice access control module, the quiet environment monitoring module, and the central processing module are integrated in the Linux embedded system based on the ARM architecture in the system of the present invention.
  • the voice access control module, the quiet environment monitoring module, and the central processing module are integrated in the embedded system in the system of the present invention.
  • the voice access control and quiet environment monitoring method of the present invention includes the following steps:
  • a voice access control identification step for performing voice verification before the access control, and performing voice recognition and voiceprint recognition on the collected audio of the person to be verified;
  • Quiet environment monitoring steps for voice monitoring in a quiet environment including endpoint detection, speaker segmentation clustering, and voiceprint recognition.
  • the voice access control identification step further includes:
  • the person to be verified triggers the voiceprint verification, such as by pressing a button for triggering the door recognition, or by infrared automatic sensing, or by the user to pass the pass card;
  • the verification string is a randomly generated multi-bit string, and the information of each verification is not fixed;
  • Ad Admit the audio read, first recognize whether it is the correct character string by voice recognition, and then use the voiceprint to verify whether it is a valid verifier, thereby judging whether to open the access control.
  • the voiceprint model of the registrant may be trained in advance, and the valid verifier determines whether it is one of the registrants who have registered in advance.
  • the present invention preferably establishes each person's speech model step by step by collecting and storing the audio to be verified by the person to be verified each time the verification character string is read. For each verifier, each time the audio read is saved as a verification voiceprint model training audio until the verification of the human voice pattern model is successfully constructed.
  • the quiet environment monitoring step further includes:
  • the quiet environment monitoring module is automatically activated during the nighttime when the lights are turned off or any other student rest period, and the monitoring mode is turned on;
  • an indoor brightness detecting unit may be configured to automatically switch the monitoring module according to the brightness of the room;
  • Start endpoint detection to determine whether it is a quiet environment, such as monitoring whether there is someone talking or noisy in the dormitory through voice endpoint detection; the endpoint detection is implemented by a 360-degree circular microphone array to ensure the sensitivity of audio collection and the quality of the collected audio. ;
  • the voiceprint model is trained on the basis of the audio data saved in the step ae);
  • the identity information and the audio data and the time of issuance sent by the identity information are sent and displayed to the administrator, for example, the noisy time period, the noisy degree, the noisy identity, etc. are transmitted to the system background or the central processing module.
  • the monitoring device is provided for the monitor to perform corresponding management in an intuitive and convenient manner, and it is convenient to take corresponding management measures.
  • the monitoring method and system of the present invention can also be used for other related services, especially voice services in an invisible environment, such as a call for help in a student dormitory emergency state, which can acquire and analyze the audio of the rescuer to the system.
  • the manager provides alarm or warning services.
  • the monitoring device can transmit through the transmitting device in the form of text information, voice mail or picture information, for example, by means of short message, multimedia message, WeChat, etc.
  • the method of identifying the random number string in the step a) of the voice access recognition, can be used to prevent the counterfeit person from using the recording to pass the access control verification with respect to the fixed text mode.
  • the speech recognition process for the acquired reading audio is acquired by the same microphone as the voice access control or directly collected by the microphone. Using the same microphone for acquisition can reduce the impact of channel differences on the recognition results.
  • the voiceprint recognition technique used in the step be) of the quiet environment monitoring is the same as the voiceprint technique employed in the step a) of the voice access control, and includes the following steps:
  • the model training step is mainly to pre-use a large number of labeled speaker data to train a global model related to the text-independent speaker confirmation system. This step is done offline before the registration and verification steps.
  • the speaker data can be obtained by collecting valid each reading audio.
  • the system gradually and continuously improves and improves the training model, and the accuracy of speech recognition can be continuously improved.
  • this step uses the trained voiceprint model to add the new target voiceprint registrant to the model database.
  • this step compares the voice data of the speaker to be verified with the registration step, compares it with the model of the student in the corresponding dormitory, determines whether it is one of the students in the dormitory, and then decides whether to verify. by.
  • the relevant information is activated to facilitate the use in the monitoring process, and the recognition speed and accuracy can be improved.
  • the present invention employs an i-vector/PLDA text-independent speaker confirmation method.
  • the voiceprint model training includes: (1) MFCC feature extraction, (2) GMM-UBM modeling, (3) i-vector extractor training, and (4) PLDA training.
  • the parameters shown in Fig. 9, such as ( ⁇ ), (T), ( ⁇ , ⁇ ), are trained in the first training step, also known as the voiceprint model.
  • the present invention adopts the speech feature parameter MFCC feature vector (Mel frequency inverse coefficient).
  • UBM is a general-purpose background model trained by a large number of various types of speaker's speech feature parameters (MFCC).
  • MFCC speaker's speech feature parameters
  • the present invention is modeled using GMM-UBM (Gaussian Mixture Model - General Background Model).
  • GMM-UBM can be expressed by linear weighting of m D-dimensional Gaussian density functions, where M [Gaussian number] and D [MFCC dimension] can be set or known in advance:
  • Xi represents the ith component of the feature
  • pj is a multidimensional normal distribution:
  • EM Expectation Maximum
  • the MFCC feature vector of the extracted speech is projected onto each Gaussian component of the GMM-UBM model, and the mean is averaged in the time domain to obtain the corresponding Baum-Welch statistic.
  • the specific calculation method is as follows:
  • the dimension C of N is equal to the Gaussian mixture number m.
  • This T is a matrix, which is a parameter that the i-vector extractor needs to train.
  • is a diagonal covariance matrix with a dimension of CD*CD
  • the EM algorithm (GMM-UBM also used a similar algorithm) estimates T to get the optimal T.
  • This x is the i-vector feature vector that needs to be extracted.
  • PLDA Probabilistic Form Linear Discriminant Analysis Method. It uses speaker annotated data for training and strictly distinguishes between speaking human differences and speaking human differences.
  • the jth i-vector for the i-th person in the training data is now represented by ⁇ ij .
  • the PLDA method considers that i-vector data can be generated from an implicit variable in a low-dimensional space, expressed as:
  • ⁇ i is described by the difference subspace between spoken humans, and its size depends only on the identity of the speaker, ie the same person is the same.
  • ⁇ ij is a noise term. Its size is not only related to the identity of the speaker, but also depends on other factors that can affect the difference in the human being, so each sentence will be different.
  • the EM algorithm is used to estimate the parameters, thereby obtaining the optimal value of [ ⁇ , ⁇ ]. After obtaining these parameters, ⁇ can be obtained according to the formula (10).
  • Step sequence original speech -> MFCC -> i-vector -> ⁇ .
  • the score score is compared with the set threshold to determine whether it is the same speaker.
  • a 360-degree microphone array is used to accurately and accurately collect voice data.
  • environmental factors such as reverberation and background noise have a large impact, and most of the collected voices are noisy speech.
  • the sensitivity of the speech signal is highly demanded for the purity of the speech, and an array composed of a plurality of microphones is used to process the channel signals from different directions in time and space, which will improve the signal noise. Better, get clearer and clearer voice data.
  • the microphone is used to enhance the signal to noise ratio, and the method of improving the signal-to-noise ratio mainly adopts Wiener filtering and beamforming.
  • Wiener filtering can remove noise by filtering for the data collected by each microphone.
  • the invention adopts the Wiener filtering algorithm to denoise the signal collected by each microphone and polluted by the stationary noise.
  • Beamforming is the process of superimposing the signal delay of each microphone.
  • Figure 11 it is a schematic diagram of a conventional fixed beamforming system.
  • the conventional system includes: delay compensation, and weighted summation, which can be described using equation (15):
  • y(n) represents the signal after beamforming
  • M is the number of microphones
  • ⁇ i is the weight of the i-th microphone
  • ⁇ ti represents the time difference from the source to the i-th microphone element and the array reference element.
  • the conventional fixed beamforming method first, time compensation is performed on signals received on respective microphones in the array to synchronize the voice signals of the respective channels; then, the signals of the respective channels are weighted and averaged, where the weighting coefficient ⁇ i is a fixed constant, usually 1/M, which is the origin of the traditional method called fixed beamforming.
  • the addition of the time delay compensation unit ⁇ t i only changes the phase of the received signal, cancels the delay of the sound waves of the microphones at different positions in the receiving direction, and synchronizes the voice signals of the respective channels such that their contributions in the summed output are the same.
  • the invention is based on the conventional fixed beamforming method and is optimized in three aspects: (1) selection of reference channels, (2) calculation of N optimal delays for each channel, and (3) taking dynamics Channel weight calculation method, not a fixed 1/M.
  • W m [n] is the relative weight of the mth microphone at time n, and the weight of the weight is 1 at the time n.
  • x m [n] is the signal received by the mth channel at time n.
  • TDOA (m, ref) [n] is the delay of the mth channel relative to the reference channel and is used to align the signal at time n.
  • TDOA (m, ref) [n] is calculated once every few frames by the cross-correlation method, and the cross-correlation delay estimation method used here is GCC-PHAT (Generalized Cross Correlation with Phase Transform).
  • the optimized beamforming algorithm used in the present invention is capable of automatically finding the best quality microphone channel from the middle of the sound source and using this channel as a reference channel.
  • M is the total number of channels of the microphone array
  • K 200 (divide the audio file into 200 segments), and then average the K as the denominator after each calculation.
  • Xcorr[i,j;k] represents the cross-correlation peak of channel i and channel j at the kth segment.
  • Reference channel selection The channel with the largest value.
  • TDOA Time Delay of Arrival
  • the analysis window and the size of the analysis segment need to be balanced.
  • large analysis windows or analysis segments will reduce the accuracy of TDOA.
  • using a small analysis window will reduce the robustness of the entire algorithm. If the analysis window is too small, it will increase the computational complexity of the system while not improving the quality of the output signal.
  • the size of the analysis window and the analysis segment are often determined by experience. The algorithm performs well under the conditions of 500ms analysis window and 250ms analysis segment.
  • X i (f) and X ref (f) are Fourier transforms of two signals
  • F -1 represents an inverse Fourier transform
  • [] * denotes a complex conjugate
  • denotes a modulo operation.
  • the cross-correlation function of the signal i and the signal ref the value of the cross-correlation function ranges from 0 to 1.
  • each of the two analysis windows calculates N relatively largest At this point N is taken 4 (which can also be modified to other values), and the most appropriate delay is selected from the N best delays before the weighted summation is performed.
  • the purpose of the endpoint detection of the step bc) is to determine a portion having a voice and a silence portion from the collected audio signals, and the present invention employs an endpoint detection method based on short-term energy. Because in a closed environment, such as a student dormitory environment, there is generally no other noisy noise, and the resulting signal signal noise is relatively high. The endpoint detection method based on short-term energy is simpler to implement in the case of ensuring detection accuracy. The hardware requirements are lower.
  • the sampling point of the time domain signal of one piece of audio is s(l), and the mth sampling point to the nth frame is Sn(m) after windowing, and E(n) is used to indicate the short time of the nth frame.
  • Energy then:
  • n is the number of frames and N is the number of samples in each frame.
  • the short-time energy of each frame After calculating the short-time energy of each frame, it is judged to be a silent frame or a frame with speech by comparison with a threshold value set in advance.
  • a threshold value set in advance.
  • the part of the signal that is muted is lower in energy
  • the part that is spoken is higher in energy.
  • the segmentation clustering of the speaker in the step bd) comprises the steps of (1) speaker segmentation and (2) speaker clustering.
  • FIG. 13 a schematic diagram of a speaker segmentation clustering process.
  • speaker segmentation is to find the turning point when the speaker changes, so that the input speech is segmented into speech segments by speaker: segment 1, segment 2, segment 3..., segment N (for example, segment 1 , segment 3 may be the same person's voice, but because there is another person's voice in the middle, so cut according to the speaker turning point), and each voice segment contains only the voice data of a single speaker; speaker clustering
  • the goal is to aggregate the speech segments of the same speaker so that each class contains only one speaker's data, and each person's data is as much as possible in one type of data (the above example, segment 1 and segmentation) Can be put together)
  • the speaker clustering of the present invention is performed by using the LSP feature, that is, the LSP (Line Spectrum Pair) feature data is extracted by the original voice, and the subsequent calculation is performed.
  • LSP Line Spectrum Pair
  • the focus of speaker segmentation is to find the turning point of speaker switching, including the detection of a single turning point and the detection of multiple turning points:
  • the single turning point detection includes the following steps: voice feature segment extraction, distance-based sequential detection, cross-detection, and turning point confirmation.
  • voice feature segment extraction is the same as the foregoing corresponding manner, or the foregoing extracted voice feature may be directly used, and details are not described herein again.
  • FIG. 15 a schematic diagram of sequential detection of single turning points based on distance is shown.
  • the detection method assumes that there is no turning point during the first short interval of the speech segment.
  • the speech segment 1-3 seconds
  • the template and each sliding segment are calculated by distance.
  • the present invention adopts a "generalized likelihood ratio".
  • d(t) represents the distance value between the sliding window at time t and the template window of speaker 1.
  • the distance curve after the sequential detection As shown in Fig. 16, the distance curve after the sequential detection. As can be seen from Fig. 16, when the sliding window is within the range of the first speaker, the template segment and the moving window are the speech of the first speaker, so the distance value is small. When the moving window reaches the range of the second speaker, the sliding window becomes the voice of the second speaker, so the distance value gradually increases. Therefore, it can be assumed that when the distance value is the largest, the probability of having the voice of the second speaker in the vicinity is the greatest.
  • the template window of the second speaker is determined by finding the maximum point of the distance curve.
  • the second distance curve can be obtained by the same method as described above. As shown in Figure 18, the intersection of the two curves is the speaker turning point.
  • sign( ⁇ ) is a sign function
  • d cross is the distance value at the intersection of two distance curves.
  • d(i) in the formula (22) is the calculated distance in the end region. If the final result is positive, accept this point as the speaker turning point; if negative, reject this point as the speaker turning point.
  • Finding multiple speaker turning points in the entire speech can be done on the basis of a single turning point detection.
  • the steps are as follows:
  • Step 1) First set a large time window (length is 5-15 seconds), and make a single turning point detection in the window.
  • Step 2) If the speaker turning point is not found in the previous step, move the window to the right (1-3 seconds) and repeat step 1 until the speaker turning point is found, or the voice segment ends.
  • Step 3) If the speaker turning point is found, record the turning point and set the window starting point to this turning point, repeating the step Step 1) - Step 2).
  • segment 1 to segment N All the turning points of multiple speakers can be found and segmented according to this: segment 1 to segment N.
  • the segmentation of the speaker is completed by the detection of the single turning point and the detection of the plurality of turning points.
  • speaker clustering is a specific application of clustering technology in speech signal processing. The goal is to classify the speech segments so that each class contains only the same speaker data, and the same speaker's data is merged into the same class.
  • the present invention proposes an improved Hierarchical Clustering (IHC) method, which combines and determines the number of categories by minimizing the sum of squared errors in the class, and the specific steps are as shown in FIG. 21. Shown as follows:
  • the "generalized likelihood ratio" is used as the distance of the metric.
  • the error square sum criterion is the minimum squared sum of errors within the class. In the speaker clustering application, the distance between the data of the same speaker is relatively small, and the distance between different speaker data is relatively large, so the error square sum criterion can achieve better results.
  • the first step of the IHC algorithm is to use the distance metric as the similarity, and the improved error square sum criterion as the criterion function, and gradually merge the two to form a cluster tree.
  • the present invention employs a category determination method based on hypothesis testing, which uses the principle of hypothesis testing to test each merge operation on the cluster tree, check the rationality of the merger, and determine the final number of categories. Once an unreasonable merger is found, the number of categories prior to the merger is considered to be the final number of speaker categories.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)

Abstract

一种基于端点检测、说话人分段聚类和声纹识别的语音门禁和安静环境监控方法及系统,利用基于语音识别的门禁代替了传统的门锁钥匙的形式,并且识别内容采取随机字符串的形式,进一步增强了安全性。该方法及系统可以方便安静环境比如学生宿舍的管理,可以替代了传统的老师亲自去宿舍查寝的费时费力的管理办法,方便而又可靠,对学生的休息质量提供了可靠的保障。

Description

一种语音门禁和安静环境监控方法及系统 技术领域
本发明涉及一种语音门禁和安静环境监控方法及系统,用于进入封闭环境的语音识别和在所述封闭环境的安静环境监控,特别是用于睡眠时段不可视情况下学生宿舍环境的监控方法及系统。
背景技术
近年来,随着学校住宿条件日益完善,越来越多的家长将住读视为一个减轻照看孩子的压力,方便孩子好好学习的选择。这使得在校的住读生的数量逐渐增加,而学校方面,同时管理这么多的住读学生,实在难以方方面面全部兼顾,宿舍管理无疑是各个学校面临的一个比较严峻的挑战。尤其是在晚上熄灯后的这段时间,如果学生没有按照规定好好休息,那么会严重影响学生的睡眠以及第二天的学习质量。
另一方面,随着语音技术飞速发展与逐步成熟,人们对于人机交互的便捷的追求,语音渐渐成为人与外界交互使用最自然,最方便,最有效的交流工具,同时语音又是人们日常生活的最主要的信息载体之一。随着移动互联网,智能设备的发展,人类社会正逐步全面的进入信息化时代,智能语音技术慢慢从众多的模式识别中崭露头角,扮演者越来越重要的角色。语音相关的技术渐渐融入到社交平台,电子商务,智能穿戴,智能家居甚至金融行业当中,并发挥着重要的作用。这使得采用语音技术减轻宿舍管理的压力成为可能。
CN102708867A(公开日2012年10月3日)公开了一种基于声纹和语音的防录音假冒身份识别方法及系统,可用于身份认证领域,该方法具体为生成带用户ID的固定文本,并与随机文本拼接为提示文本,采集用户读取提示文本的语音,建立用户的声纹模型和语音模型,并保存带用户ID的固定文本和声纹模型及语音模型。例如,带用户ID的固定文本为4-7个汉字。
CN204791241U(公开日2015年11月18日)公开了一种安装在门上的语音交互式门禁系统,包括门禁控制器和电子锁。门禁控制器包括麦克风、无线网络模块、摄像头等,运行Android或Windows操作系统。门禁控制器定时获取超声波传感器和门磁状态,传感器检测到门禁前有人逗留时,系统自动点亮触摸显示屏,并通过扬声器播放问候语音,麦克风等待接受用户说话语音,并将用户语音发送给声纹识别模块。
CN102760434A(公开日2012年10月31日)公开了一种声纹特征模型的更新方法和终端,获取包含至少一个说话者的原始音频流,根据预设的说话人分割与聚类算法,获取该原始音频流中该至少一个说话者中每一个说话者的分别的音频流,将该至少一个说话者中每一个说话者的分别的音频流分别与原始声纹特征模型进行匹配,获取匹配成功的音频流。
CN104376619A(公开日2015年2月25日)公开了一种应用于第一设备的监控方法,该第 一设备安装在门上或门外,具有第一采集单元。首先由第一设备采集门外的图像和声音信息,该第一采集单元可为图像或声音采集设备,当有访客进入距离防盗门的一定区域内,第一采集单元捕捉到访客的来访画面时进行录音和实时拍摄,并将上述图像和声音信息传送给安装在第一设备中的信息处理装置,由此判断访客的身份。
通过对现有技术的分析可以发现,现有技术中没有门禁和安静环境监控的一体化系统,现有技术的门禁系统只是门禁作用,声纹模型需要事先进行专门训练获得,对于安静环境中多人说话的语音分割、聚类和提取方式还有改进的余地,特别是对于用于已知固定人员的声纹识别,没有专门的方法和系统。
目前,指纹识别、虹膜识别、人脸识别等技术在门禁打卡等领域的应用,语音相对于这些识别技术,有以下优势:
1、语音作为自然产生的信号,而不是人身体的组成部分,通常而言对用户不会产生伤害或者额外的威胁。
2、随着智能设备,或者嵌入式系统的智能化,以及移动互联网的发展,信号采集的难易程度,往往决定着产品的成本,易用性和用户的直观体验。随着麦克风的普及,语音信号的采集和传输,决定了语音信号是相对最易获取采集的信号,采集过程也十分简单,在实际应用当中,声卡麦克风的成本也极其低廉。
另一方面,端点检测检测活动语音信号已得到广泛的应用;说话人分段聚类和说话人识别技术作为最为有效的语音分析技术,可以实现省人力,高可靠性的宿舍安静环境自动监控。
除了之外,对于任何需要进行身份识别进入的封闭区域,特别是在不可视环境下,也需要对于安静环境保持情况进行监控,比如,不可视环境下,住院病人的夜间监护,当其他手段不方便使用时,病人直接的呼叫最为有效,而通过声音的识别和监控,通过呼叫的声音就能确定病人,可以为医护人员提供快速的指引。
发明内容
本发明主要应用于学校住宿学生的宿舍安静环境(休息环境比如睡眠时段)的监控,但是本发明的应用场景不限于此,对于任何需要身份识别进入的封闭环境以及所述封闭环境需要安静环境监测的场景,本发明的方法和系统都适用。
本发明的方法和系统通过门禁系统同的语音识别过程中因为用户(学生)每次读不同的提示文本而采集用户的声纹信息并且逐渐建立每个用户的声纹模型而无需进行专门的声纹模型训练,提高了效率节约了人力成本。本发明还改进了分段聚类的方法,提高聚类的效率和准确率。本发明还对相关的其他方面进行了改进。本发明还通过对于固定空间人员信息的管理,提高了识别的效率和准确率。本发明的技术方案具体内容如下:
本发明提供了一种基于语音识别和声纹识别的学生宿舍智能语音门禁和安静环境监控方法,包括以下步骤:
—语音门禁识别步骤,用于在门禁前进行语音验证,对于采集到的待验证人的音频先后进行语音识别和声纹识别;
—安静环境监控步骤,用于在安静环境中进行语音监控,先后包括端点检测、说话人分段聚类和声纹识别;
—中央处理步骤,用于对于语音门禁识别步骤和安静环境监控步骤的数据进行处。
所述语音门禁识别步骤,进一步包括:
aa)待验证人触发声纹验证;
ab)弹出验证字符串;
ac)待验证人念读所述验证字符串;
ad)录取所述念读的音频,首先通过语音识别识别是否说的为正确的字符串,接着采用声纹验证是否为有效的验证人,由此判断是否打开门禁。
所述安静环境监控步骤,进一步包括:
ba)在规定时间段开启监控;
bb)启动端点检测,判断是否为安静环境;
bc)如果判断为非安静环境,通过端点检测这段音频;
bd)对于检测到的所述这段音频,进行说话人分段聚类分析,分析之后将分别区分并得到不同说话人各自的音频数据;
be)根据已经保存的声纹模型,通过声纹识别对于所述音频数据中的每个音频进行声纹识别,以得到音频发出者的身份信息;
bf)将所述身份信息及其发出的音频数据和发出时间等信息发送并显示给管理者。
在所述步骤bd)中,
所述说话人分段聚类分析包括说话人分割步骤、说话人聚类步骤和声纹识别步骤;
所述说话人分割步骤用于找到说话人切换的转折点,包括单一转折点的检测和多个转折点的检测;
所述单一转折点检测包括基于距离的顺序检测、交叉检测和转折点确认;
所述多个转折点检测用于找到整段语音中的多个说话人转折点,在所述单一转折点检测的基础上完成,步骤如下:
步骤1):首先设定一较大的时间窗,长度为5-15秒,在窗内作单转折点检测;
步骤2):若在上一步骤没找到说话人转折点,则将窗向右移动1-3秒,重复步骤1,直到找到说话人转折点,或者语音段结束;
步骤3):若找到说话人转折点,则记录此转折点,并将窗口起始点设到此转折点上,重复步骤1)-步骤2)。
所述转折点的确认公式:
Figure PCTCN2017077792-appb-000001
sign(·)为符号函数,dcross为两条距离曲线交叉处的距离值;
其中,通过利用说话人的距离曲线起始到交叉点的这段区域,公式中的d(i)就是这一端区域内计算出来的距离,若最后结果为正,则接受此点为说话人转折点;若为负,则拒绝此点为说话人转折点。
在所述语音门禁识别步骤中,弹出的验证字符串为随机生成的多位字符串,每次需要念读的信息都是不固定的。
所述端点检测通过360度环形麦克风阵列来实现,以保证音频采集的灵敏度和采集的音频的质量。
在所述步骤ad)录取所述念读的音频的基础上,所述语音门禁识别步骤还包括步骤ae):
对于每个验证人,将每次所述念读的音频保存为验证人声纹模型训练音频,直到验证人声纹模型构建成功。
所述步骤be)的声纹模型是在所述步骤ae)保存的音频数据的基础上训练得到的。
待验证人在触发声纹验证时,同时启动面部图像采集,采集待验证人的面部图像,获得面部图像后,在中央处理步骤进行比对,获得待验证人的信息,并且将采集的语音信号与所述注册信息进行关联,形成关联数据库。
当待验证者进入封闭环境以后,激活待验证者的信息,对于那些已经注册但是没有进入宿舍的人员,系统不激活他们的信息,但是将他们的信息发送给管理者。
在所述步骤be)中,首先与激活的这些信息进行对比;
如果经过比对,没有在激活人员里找到匹配的人员信息,就扩大到所有注册人员进行比对,如果比对成功,产生提示非法进入或者未有效打卡的提示;
如果没有比对成功,就产生非法闯入的警示信息。
在封闭环境的每个单元中设置:
至少一个环形麦克风阵列;
环境亮度识别单元,用于检测宿舍环境的亮度,自动开启或关闭监控;和
与被监控环境中人员进行交流的声音播放装置。
所述中央处理步骤将所述身份信息及其发出的音频数据和发出时间信息发送并显示给管理者,传输到与系统后台或者中央处理步骤关联的监控装置,供监控者直观方便的进行相应的管理,便于采取相应的管理措施。
一种语音门禁和安静环境监控系统,包括语音门禁识别模块、安静环境监控模块和中央处理模块,
所述语音门禁识别模块,用于在门禁前进行语音验证,对于采集到的待验证人的音频先后进行 语音识别和声纹识别;
所述安静环境监控模块,用于在安静环境中进行语音监控,依次包括端点检测、说话人分段聚类和声纹识别;
所述语音门禁识别模块和安静环境监控模块均与中央处理模块相连接。
所述安静环境监控模块进一步包括说话人分割模块、说话人聚类模块和声纹识别模块;
所述说话人分割模块用于找到说话人切换的转折点,包括单一转折点的检测和多个转折点的检测;
所述单一转折点检测包括基于距离的顺序检测、交叉检测和转折点确认;
所述多个转折点检测用于找到整段语音中的多个说话人转折点,在所述单一转折点检测的基础上完成,步骤如下:
步骤1):首先设定一较大的时间窗,长度为5-15秒,在窗内作单转折点检测;
步骤2):若在上一步骤没找到说话人转折点,则将窗向右移动1-3秒,重复步骤1,直到找到说话人转折点,或者语音段结束;
步骤3):若找到说话人转折点,则记录此转折点,并将窗口起始点设到此转折点上,重复步骤1)-步骤2)。
所述转折点的确认公式:
Figure PCTCN2017077792-appb-000002
sign(·)为符号函数,dcross为两条距离曲线交叉处的距离值;
其中,通过利用说话人的距离曲线起始到交叉点的这段区域,公式中的d(i)就是这一端区域内计算出来的距离,若最后结果为正,则接受此点为说话人转折点;若为负,则拒绝此点为说话人转折点。
所述语音门禁识别模块设置在封闭环境的门外,包括用于采集音频的麦克风、用于触发门禁识别的按钮、和用于显示字符串的显示装置。
所述语音门禁识别模块还包括与待验证者交互的语音播放装置;
使用红外检测单元替代所述按钮,使得当有待验证者接近时自动开启系统验证。
所述语音门禁识别模块进一步包括面部图像采集装置,用于采集待验证者的头像。
所述语音门禁识别模块进一步包括连接移动终端的接口,所述移动终端通过接口连接后,所述的麦克风、按钮、显示装置和面部图像采集装置的功能由移动终端的麦克风、屏幕虚拟按钮、显示屏、摄像头实现。
所述移动终端安装有实现语音门禁识别功能的APP或者PC软件客户端。
所述移动终端通过有线或无线方式与门禁开闭系统连接,以根据验证的结果决定开闭门禁系统。
待验证人在进门前,通过触发门禁识别的按钮,启动语音识别,面部图像采集装置同步开启,采集待验证人的面部图像,获得面部图像后,发送到中央处理模块,由中央处理模块进行比对,获得待验证人的注册信息,并且将采集的语音信号与所述注册信息进行关联,形成关联数据库。
当待验证者进入封闭环境以后,系统就激活待验证者的信息,对于那些已经注册但是没有进入宿舍的人员,系统不激活他们的信息,但是将他们的信息发送到系统管理者。
系统在进行对比时首先与激活的这些信息进行对比;
如果经过比对,没有在激活人员里找到匹配的人员信息,就扩大到所有注册人员进行比对,如果比对成功,产生提示非法进入或者未有效打卡的提示;
如果没有比对成功,就产生非法闯入的警示信息,管理员可以通过语音交互进行信息的确认。
所述安静环境监控模块设置在封闭环境的每个单元中,包括至少一个环形麦克风阵列。
环境亮度识别单元,用于检测宿舍环境的亮度,自动开启或关闭监控;和
与被监控环境中人员进行交流的声音播放装置。
所述中央处理模块单独设置在系统后台,可以与所述语音门禁识别模块一体设置;或者与所述安静环境监控模块一体设置,用于处理和显示所述安静环境监控模块获得的监控信息。
所述中央处理模块将所述身份信息及其发出的音频数据和发出时间信息发送并显示给管理者,传输到与系统后台或者中央处理模块连接的监控装置,供监控者直观方便的进行相应的管理,便于采取相应的管理措施。
本发明的宿舍智能门禁与安静环境自动监控系统凭着语音的先进技术,使得门禁和监控信息数据的采集变得安全方便且简单,使得监控的指标变得直观有效,并且帮助学校宿舍管理变得简单方便却又可靠有效。
附图说明
图1为根据本发明的系统架构示意图;
图2为根据本发明的语音门禁识别步骤示意图;
图3为根据本发明的安静环境监控步骤示意图;
图4为根据本发明的另一语音门禁识别步骤示意图;
图5为根据本发明的语音模型训练示意图;
图6为根据本发明的语音模型构建示意图;
图7为根据本发明的语音模型关联示意图;
图8为根据本发明的语音验证示意图;
图9为根据本发明的声纹模型训练步骤示意图;
图10为根据本发明的i-vector训练示意图;
图11为现有技术中传统的固定波束成形系统示意图;
图12为本发明波束成形方法中计算通道最佳时延的时间间隔取值示意图;
图13为根据本发明的说话人分段聚类流程示意图;
图14为根据本发明的单一转折点检测流程图;
图15为根据本发明的基于距离的顺序检测示意图;
图16为根据本发明的顺序检测距离曲线图;
图17为根据本发明的寻找第二说话人语音模板示意图;
图18为根据本发明的交叉检测说话人转折点示意图;
图19为根据本发明的错误的转折点检测示意图;
图20为根据本发明的转折点确认示意图;和
图21为根据本发明的IHC算法框图。
具体实施方式
以下将结合附图,对本发明的具体实施方式进行进一步详细描述。
如图1所示,本发明的语音门禁和安静环境监控系统,包括:语音门禁识别模块、安静环境监控模块和中央处理模块,所述语音门禁识别模块和安静环境监控模块均与中央处理模块相连接。所述中央处理模块可以对所述两个模块进行控制,他们之间可以通过有线方式或者无线方式进行连接,可以是有线网络或者无线网络方式。
所述语音门禁识别模块设置在封闭环境的门外,包括用于采集音频的麦克风、用于触发门禁识别的按钮、用于显示字符串的显示装置、和面部图像采集装置等。优选的,所述语音门禁识别模块还可以包括与待验证者交互的语音播放装置。
所述麦克风可以是单声道麦克风,通常设置在门外侧,以方便采集门禁语音数据,所述麦克风也可以是其他移动设备比如手机的麦克风。
所述按钮可以是触摸式按钮,也可以使用红外检测单元替代,使得当有待验证者接近时自动开启系统验证。
所述显示装置可以是各种常用的显示器或者显示屏,或者手机或者其他移动设备的显示屏,用于向用户显示字符串以及其他的各种提示信息。
所述面部图像采集装置可是摄像头或者照相机,摄像头可以是单独的设置的,也可以使用手机或其他移动设备的摄像头。
所述语音播放装置可以是单独设置的喇叭,也可以是手机或者其他移动设备的声音播放装置。
本发明优选的是,可以不用单独安装门禁系统有关的识别验证装置,通过一个联网的移动终端比如智能手机,即可实现门禁系统的控制。
优选的是,使用安装有语音门禁APP的移动设备比如智能手机作为识别和验证装置,可以调用智能手机的麦克风、摄像头、屏幕、按钮等,从而起到相应的作用,智能手机通过网络,比如无线网络,与所述中央处理模块连接。
所述移动终端比如手机通过有线或无线方式比如蓝牙与门禁开闭系统连接,以根据验证的结果决定开闭系统。
使用移动终端能够特别满足那些临时封闭的环境,比如临时宿舍或者门禁系统损坏后的紧急情况。
优选的,即使在正常的门禁系统之外也可以预留接口,用于连接移动终端,比如智能手机。
待验证人在进门前,通过触发门禁识别的按钮,启动语音识别,面部图像采集装置同步开启,采集待验证人的面部图像,获得面部图像后,发送到中央处理模块,由中央处理模块进行比对,获得待验证人的注册信息,并且将采集的语音信号与所述注册信息进行关联,形成关联数据库。
当待验证者进入封闭环境以后,比如宿舍以后,系统就激活待验证者的信息,对于那些已经注册但是没有进入宿舍的人员,系统不激活他们的信息,但是将他们的信息发送到系统管理者。
激活这些进入者的信息,是为了在监控阶段更方便的识别和比对语音信息。系统在进行对比时首先与激活的这些信息进行对比。
在上述整个验证识别过程中,喇叭可以向用户进行各种提示或者说明。
可选的,可以设置经常使用的各种身份卡证,比如常用的通行证,员工卡等,以此来进行身份的识别,可以替换或者辅助面部识别装置。
所述安静环境监控模块设置在封闭环境的每个单元中,比如在每个学生宿舍内,包括至少一个环形麦克风阵列。进一步的,还可以包括环境亮度识别单元,用于检测宿舍环境的亮度,自动开启或关闭监控。更进一步的,还可以包括与被监控环境中人员进行交流的声音播放装置。
所述环形麦克风阵列可以是360度环形麦克风阵列,可以设置在室内天花板中心位置或其他适合位置,方便灵敏准确的采集监控语音信号。
所述的安静环境为宿舍或者其他封闭环境,监控的开启主要是在不可视环境下或者光线较弱的环境下,当然也可以在固定的白天光线较好的时段使用。
所述中央处理模块可以单独设置在系统后台,可以与所述语音门禁识别模块一体设置,也可以与所述安静环境监控模块一体设置,可以处理和显示所述安静环境监控模块获得的监控信息。
根据采集的语音数据的来源,比如封闭区域的某个单元比如某个宿舍,中央处理模块获取这个宿舍注册的并且被激活的人员语音模型,进行快速比对,最大程度的提高识别速度和准确率。如果经过比对,没有在激活人员里找到匹配的人员信息,就扩大到所有注册人员进行比对,如果比对成功,产生提示非法进入或者未有效打卡的提示。如果没有比对成功,就产生非法闯入的警示信息,管理员可以通过语音交互进行信息的确认。
可选的是,系统中保存了异常声音模型,用于处理非正常说话声音,比如播放的足球比赛的声音、篮球比赛的声音、播放音乐的声音、或者呼叫声,比如救命声、呼喊声、火情等声音模型,以便于在紧急情况下也可能提供安全防护。
所述中央处理模块将所述身份信息及其发出的音频数据和发出时间等信息发送并显示给管理者, 比如将这些喧闹时间段、喧闹程度、喧闹者身份等传输到与系统后台或者中央处理模块连接的监控装置,供监控者直观方便的进行相应的管理,便于采取相应的管理措施。
管理者可以通过APP客户端或者PC软件客户端接收这些信息,或者正在设置好的显示或监控屏幕上进行显示。
本发明的系统中所述语音门禁识别模块、安静环境监控模块、和中央处理模块在基于ARM架构的Linux嵌入式系统中集成。本发明的系统中所述语音门禁识别模块、安静环境监控模块、和中央处理模块集成在嵌入式系统中。
如图2-4所示,本发明的语音门禁和安静环境监控方法,包括以下步骤:
包括以下步骤:
—语音门禁识别步骤,用于在门禁前进行语音验证,对于采集到的待验证人的音频先后进行语音识别和声纹识别;
—安静环境监控步骤,用于在安静环境中进行语音监控,先后包括端点检测、说话人分段聚类和声纹识别。
所述语音门禁识别步骤,进一步包括:
aa)待验证人触发声纹验证,比如通过按压用于触发门禁识别的按钮,或者通过红外自动感应,或者通过用户刷通行卡;
ab)弹出验证字符串,所述验证字符串为随机生成的多位字符串,每次验证的信息都不是固定的;
ac)待验证人念读所述验证字符串;
ad)录取所述念读的音频,首先通过语音识别识别是否说的为正确的字符串,接着采用声纹验证是否为有效的验证人,由此判断是否打开门禁。
可选的是,可以事先训练注册人(验证人)的声纹模型,所述有效验证人即判断是否为事先已经注册的注册人之一。
但是,通常对于大量学生集中采集或者进行声纹注册费时费力,而且还存在不准确的可能,需要反复进行操作,效率极低。因此,本发明优选的是,通过收集和保存待验证人每次念读所述验证字符串的音频,逐步建立每个人的语音模型。对于每个验证人,将每次所述念读的音频保存为验证人声纹模型训练音频,直到验证人声纹模型构建成功
所述安静环境监控步骤,进一步包括:
ba)在规定时间段开启监控,比如对于学生宿舍,在晚上熄灯或者其他任何学生休息时间段内,安静环境监控模块自动启动,开启监控模式;
可选的,可以设置室内亮度检测单元,用于根据室内亮度情况,自动切换监控模块;
bb)启动端点检测,判断是否为安静环境,比如通过语音端点检测监控宿舍内是否有人说话喧闹;所述端点检测通过360度环形麦克风阵列来实现,以保证音频采集的灵敏度和采集的音频的质量;
bc)如果判断为非安静环境,通过端点检测这段音频;
bd)对于检测到的所述这段音频,进行说话人分段聚类分析,分析之后将分别区分并得到不同说话人各自的音频数据;
be)根据已经保存的声纹模型,通过声纹识别对于所述音频数据中的每个音频进行声纹识别,以得到音频发出者的身份信息;
所述声纹模型是在所述步骤ae)保存的音频数据的基础上训练得到的;
bf)将所述身份信息及其发出的音频数据和发出时间等信息发送并显示给管理者。
具体的,将所述身份信息及其发出的音频数据和发出时间等信息发送并显示给管理者,比如将这些喧闹时间段、喧闹程度、喧闹者身份等传输到与系统后台或者中央处理模块连接的监控装置,供监控者直观方便的进行相应的管理,便于采取相应的管理措施。
可选的是,本发明的监控方法和系统还可以用于其他相关服务,特别是不可视环境中的语音服务,比如学生宿舍紧急状态下的呼救,可以通过获取和分析呼救者的音频向系统管理者提供报警或警示服务等。
所述的监控装置可以以文本信息、语音信箱或图片信息的方式通过发射设备进行传输,比如以短信、彩信、微信等通信方式。
根据本发明的方法,在语音门禁识别的所述步骤ad)中,采用识别随机数字串的方式,相对于固定文本方式,可以用来防止伪冒人利用录音通过门禁验证。
如图4所示,对于采集的念读音频的语音识别过程。其中,所述训练模型所使用的数据是通过与语音门禁识别相同的麦克风采集的或者直接由所述麦克风采集的。采用相同的麦克风进行采集,可以减小信道差异对识别结果的影响。
根据本发明的方法,所述安静环境监控中的步骤be)使用的声纹识别技术与所述语音门禁识别中的步骤ad)采用的声纹技术一样,包括如下步骤:
(一)模型训练步骤;
(二)个人模型注册步骤;和
(三)验证步骤。
以下具体描述各个步骤的具体执行方式:
(一)模型训练步骤
如图5所示,模型训练步骤主要就是预先使用大量有标注的说话人数据训练出一个文本无关说话人确认系统相关的全局模型。此步骤在注册步骤和验证步骤之前离线完成。
所述说话人数据可以通过收集有效的每次念读音频获得。本发明中优选的是,通过收集有效的每次所述念读音频进行训练模型数据的采集,这样可以大大节约音频数据的采集时间,节约人力和物力,而且可以改进用户体验。
进一步的,通过这样的采集方式,使得系统逐渐和持续完善和改进训练模型,可以不断提高语音识别的准确率。
此外,从管理的角度,系统逐渐完善的过程也给了管理者和被管理者一个接受这种监控的缓存时间。
(二)个人模型注册步骤
如图6和图7所示,此步骤使用训练好的声纹模型,将新来的目标声纹注册人添加到模型数据库中。
(三)验证步骤
如图8所示,此步骤将待验证说话人的语音数据进行与注册步骤同样的处理后,与相应宿舍内学生的模型进行比对,判断是否为该宿舍的学生之一,然后决定是否验证通过。本发明中优选的是,对于通过验证的学生,激活其有关信息,以方便监控过程中使用,可以提高识别速度和准确性。
对于所述模型训练步骤(一),本发明采用了i-vector/PLDA文本无关说话人确认方式。
如图9所示,所述声纹模型训练,包括:(1)MFCC特征提取、(2)GMM-UBM建模、(3)i-vector提取器训练、(4)PLDA训练。
图9中所示的参数,比如(θ)、(T)、(Φ,Σ)是第一训练步骤训练出来的,也就是所谓的声纹模型。
(1)MFCC特征向量提取
所有原始的语音数据需要采用数字信号处理技术提取出可代表原始语音数据相关特性,并可供计算机计算的特征向量,本发明采用语音特征参数MFCC特征向量(梅尔频率倒普系数)。
(2)GMM-UBM建模
UBM是由大量各种类型的说话人的语音特征参数(MFCC)训练而成的通用背景模型。本发明采用GMM-UBM(高斯混合模型-通用背景模型)进行建模。
如公式(1)所示,GMM-UBM可用m个D维的高斯密度函数的线性加权表示,其中,M【高斯个数】,D【MFCC维数】都是事先可以设定或者知道的:
Figure PCTCN2017077792-appb-000003
xi表示特征中的第i个分量,j表示第j个高斯,i=0,1,…,D;j=1,…,M。
公式(1)中,pj为多维正态分布:
Figure PCTCN2017077792-appb-000004
那么,GMM-UBM模型就是指求得最优的θ={αj,μj,Σj}参数,使用期望最大化算法(Expectation Maximum,EM)对最优θ进行估计。
所谓模型就是一些参数,这里的参数就是指αj,μj,Σj(j=1到M)了,为了方便统一全部用θ来表示,那么建模就是求最优的θ,求的方法就是EM算法,求出来了,就完成建模了,这个θ 就是模型。
(3)I-vector提取器训练:
在训练之前,将提取的语音的MFCC特征向量投影到GMM-UBM模型的每个高斯分量上,并在时域内求均值,从而得到对应的Baum-Welch统计量。具体计算方式如下:
对于训练得到的GMM-UBM的参数θ={αj,μj,Σj}和语音的MFCC特征序列{y1,y2,···,yL}(特征序列维数为D,同GMM-UBM训练步骤),零阶统计量N=[N1,N2,…,NC]可以通过公式(3)计算得到:
Figure PCTCN2017077792-appb-000005
N的维度C等于高斯混合数m。一阶统计量F=[F1TF2T…FCT]则通过公式(4)得到:
Figure PCTCN2017077792-appb-000006
由于N的取值并非严格意义上服从一个概率密度函数,因此需要用零阶统计量对一阶统计量进行归一化处理,公式(5)如下:
Figure PCTCN2017077792-appb-000007
Figure PCTCN2017077792-appb-000008
表示的是一段语音特征序列与GMM-UBM某个高斯分量的均值在时域上的平均差异。最后得到均值中心化向量:
Figure PCTCN2017077792-appb-000009
(
Figure PCTCN2017077792-appb-000010
和N接下来的公式会用到)。
接下来需要将
Figure PCTCN2017077792-appb-000011
投影到一个低秩的总体差异空间中:
Figure PCTCN2017077792-appb-000012
这个T是一个矩阵,就是i-vector提取器需要训练出来的一个参数。
这个T的估计(训练)算法:
对于给定的第j句语音段,隐含变量的先验分布和条件分布服从公式(8)表示的多维高斯分布:
Figure PCTCN2017077792-appb-000013
其中,Σ是一个维度为CD*CD的对角协方差矩阵;
使用EM算法(GMM-UBM也采用过类似的算法)对T进行估计,得到最优的T。
(4)PLDA训练
提取i-vector特征向量:
在进行PLDA训练前,需要先提取i-vector特征向量,使用i-vector来训练。提取方法如下:
根据公式(7)训练出来T之后就可以将
Figure PCTCN2017077792-appb-000014
投影到T上,得到隐含变量x了:
Figure PCTCN2017077792-appb-000015
这个x就是需要提取的i-vector特征向量。
PLDA训练:
PLDA是概率形式线性鉴别分析方法的英文缩写。它利用说话人标注数据进行训练,并严格区分说话人类间差异和说话人类内差异。
对于训练数据中的第i个人的第j个i-vector现在用ηij表示。PLDA方法认为i-vector数据可以由一个低维空间内的隐含变量产生,表示为:
ηij=Φβiij……(10)
Φβi由说话人类间差异子空间描述,它的大小只依赖说话人身份,即同一个人是相同的。εij是噪声项,它的大小除了跟说话人的身份有关,还依赖与其它能影响说话人类内差异的因素,因此每一句话都会有区别。
设第i个说话人有Mi个i-vector,可以计算出对应说话人的充分统计量:
Figure PCTCN2017077792-appb-000016
Figure PCTCN2017077792-appb-000017
对于第i个说话人,隐含变量β的先验概率和条件分布均服从多维高斯分布:
Figure PCTCN2017077792-appb-000018
如图10所示,与i-vector训练方法类似,采用EM算法来估计参数,由此可以得到【φ,Σ】的最优值。在得到了这些参数后,就可以根据公式(10)求得β。
对于所述个人模型注册步骤(二):
i-vector/PLDA文本无关说话人确认系统训练完毕之后,注册人的个人模型其实就是根据i-vector/PLDA文本无关说话人确认系统的流程,求得公式(10)中的βi。
步骤顺序:原始语音->MFCC->i-vector->β。
对于所述验证步骤(三):
对于待验证人的语音数据,同样采取注册过程一样的步骤,得到待验证人的β,现在有待验 证人的β和某个宿舍4个人的β1-4(假设一个宿舍4人),那么用待测试者的β(下面用βj表示)和4个人的β都做比对打分,下面假设跟某一个人(用βi表示)打分的情况:
使用贝叶斯推理中的假设验证理论,计算两个i-vector由同一个隐含变量β产生的似然度最为最后的分数。具体计算过程如下:
H1为假设两个i-vector来自同一个说话人,即βj=βi;H0为假设两个i-vector是不同说话人产生的,即βj≠βi;
根据公式(*),使用对数似然比计算最后的得分:
Figure PCTCN2017077792-appb-000019
最后将得分score与设定的阈值比较,来判断是否为为同一说话人。
本发明的系统中,采用360度麦克风阵列来精准灵敏的采集语音数据。往往在语音数据的采集过程当中,混响和背景噪声等等环境因素干扰影响较大,大多数采集的语音都为带噪语音。
本发明的系统中,对于语音的纯净程度,语音信号捕获的灵敏性等要求较高,采用多个麦克风组成的阵列,对来自不同方向的通道信号进行时间和空间上的处理,将提高信噪比,得到更为干净清晰的语音数据。
采用麦克风整列进行语音增强,提升信噪比的方法主要采用维纳滤波、波束成形。
维纳滤波可以针对每一个麦克风采集的数据,通过滤波去除噪声。本发明采用了维纳滤波算法对每一个麦克风采集的被平稳噪声污染的信号进行降噪。
波束成形就是将每个麦克风的信号延时叠加波束成形。如图11所示,为传统的固定波束成形系统示意图。所述传统的系统包括:延时补偿、以及加权求和两个部分,可以使用公式(15)进行描述:
Figure PCTCN2017077792-appb-000020
在此,y(n)表示波束成形之后的信号,M为麦克风数,αi为第i个麦克风的权重,Δti表示声源到第i个麦克风阵元与到阵列参考阵元的时间差。
所述传统的固定波束形成方法:首先,对阵列中各个麦克风上接收到的信号给予时间补偿,使各通道的语音信号同步;然后,对各通道的信号进行加权以及平均,在此加权系数αi为一固定常数,通常可取1/M,这也是传统的方法叫做固定波束成形的由来。加入时间延迟补偿单元Δti只改变接收信号的相位,抵消不同位置的麦克风在接收方向声波的延迟,使各通道的语音信号同步,这样它们在求和输出中的贡献是相同的。
本发明在所述传统的固定波束成形方法的基础上,做了三方面优化:(1)参考通道的选择,(2)每个通道的N个最佳时延的计算,(3)采取动态通道权重计算方法,而不是固定的1/M。
根据本发明优化的波束成形方法,输出信号y[n]使用公式(16)进行描述:
Figure PCTCN2017077792-appb-000021
其中,
Wm[n]是第m个麦克风在n时刻的相对权重,在n时刻所有权重和为1。
xm[n]为第m个通道在n时刻接收到的信号。
TDOA(m,ref)[n]为第m个通道相对于参考通道的时延,用于将信号在n时刻对齐。实际上,TDOA(m,ref)[n]是每几帧都用互相关方法计算一次的,在此使用的互相关时延估计法是GCC-PHAT(GeneralizedCrossCorrelationwithPhaseTransform)。
(1)参考通道的选择:
本发明使用的优化的波束形成算法能够自动地找到距离声源最中间的、质量最好的麦克风通道,并将此通道作为参考通道。
为了找到参考通道,本发明使用一个参数作为衡量标准,该参数是基于每个通道i与其他所有通道j=1...M,j≠i的时间平均的互相关函数。如果输入的音频有s帧,那么本发明计算该参数的时候把s帧分成200段,即s/200,每次计算1s的长度,下次计算的时候向右移动s/200的帧距离。如公式(17)所示:
Figure PCTCN2017077792-appb-000022
其中,M为麦克风阵列总的通道数,K=200(将音频文件分成200段),作为分母每次计算完再对K求平均。
xcorr[i,j;k]表示通道i与通道j在第k段时的互相关峰值。参考通道选取
Figure PCTCN2017077792-appb-000023
值最大的通道。
(2)每个通道的N个最佳时延的计算:
计算每个通道相对于参考通道的TDOA(Time Delay of Arrival)值的时候。如图11所示,每次取500ms数据,下次计算的时候偏移250ms再取500ms数据。这样的时间间隔使得当说话人改变的时候该算法可以快速地改变波束方向。在这里500ms的数据成为分析窗,250ms成为分析段,因此500ms的数据包括了当前的分析段与下一个分析段。
实际上分析窗和分析段的大小需要做一个平衡。一方面,大的分析窗或者分析段将降低TDOA的准确度。另一方面,使用小的分析窗将降低整个算法的鲁棒性。分析窗如果太小,将提高系统的计算复杂度同时却不能提高输出信号的质量。分析窗与分析段的大小往往由经验决定,在500ms分析窗以及250ms分析段的条件下,该算法表现良好。
假设有两个信号xi(n)【第i个麦克风采集的信号】和xref(n)【参考麦克风采集的信号】,这两个信号的GCC-PHAT可以使用以下公式(18)计算:
Figure PCTCN2017077792-appb-000024
其中,
Xi(f)和Xref(f)为两个信号的傅里叶变换,F-1表示反傅里叶变换,[]*表示取复数共轭,|·|表示取模运算。
Figure PCTCN2017077792-appb-000025
即信号i和信号ref的互相关函数,由于做了幅值标准化,该互相关函数的取值范围为0到1.
那么,两个麦克风信号i和ref的时延可以使用以下公式(19)表示:
Figure PCTCN2017077792-appb-000026
其中,下标1表示第一个最佳时延,因为在该波束形成算法中会计算N个最佳时延,这样以作区别。只取1个最佳时延就是最大化(19),N=4就是选公式(19)中使
Figure PCTCN2017077792-appb-000027
前4大的d1到d4。
尽管两个信号在某个分析窗的
Figure PCTCN2017077792-appb-000028
最大值被计算出来了,这个值对应的时延并不总是指向正确的说话人。在这个波束形成系统中,每两个信号的每个分析窗都会算出N个相对最大的
Figure PCTCN2017077792-appb-000029
在在此N取4(也可以修改成其他值),在做加权求和之前,会从这N个最佳时延里选出最合适的时延。
(3)动态通道权重计算方法:
因为实际上每个麦克风阵列的特性都不一样,导致录音的加性噪声功率谱密度分布不一样。而且,如果两个麦克风相距太远,由于录音房间的冲激响应,两个麦克风的噪声特性以及噪声的幅值也不一样。这个问题可以通过自适应通道权重来解决。第m个通道第c个分析窗的权重(分析窗概念参见前述优化)可以用下式(20)表示:
Figure PCTCN2017077792-appb-000030
其中,α为自适应系数,经验性地设置为α=0.05。
Figure PCTCN2017077792-appb-000031
为通道m和其他已经过最佳时延处理的通道的平均互相关值。
至此,通过麦克风阵列的前段维纳滤波和波束成形可以得到一个干净清晰的语音音频,这也是后续处理得到精准结果的保障。
根据本发明的方法,所述步骤bc)的所述端点检测的目的是从采集到的音频信号中判断出有语音的部分和静音部分,本发明采用基于短时能量的端点检测方法。因为在封闭环境下,比如学生宿舍环境下,一般没有其他嘈杂的噪声,得到的信号信噪比较高,基于短时能量的端点检测方法在保证检测准确度的情况下,实现起来更加简单,对硬件需求更低。
短时能量:
一段音频的时域信号的采样点为s(l),经过加窗处理后的到第n帧的第m个采样点位Sn(m),现用E(n)表示第n帧的短时能量,则:
Figure PCTCN2017077792-appb-000032
其中,n表示第几帧,N表示每帧中采样点的个数。
计算每帧的短时能量之后,通过与事先设定的阈值比较,判断其为静音帧或者有语音的帧。通常,一段信号静音的部分能量较低,有人说话的部分能量较高。
经过上述端点检测处理之后,只提取有语音的部分,去掉静音部分,对提取的有语音的部分进行说话人分段聚类和声纹识别处理。根据本发明的方法,所述步骤bd)中说话人的分段聚类包括步骤:(一)说话人分割和(二)说话人聚类。
如图13所示,说话人分段聚类流程示意图。
说话人分割的目的是找到说话人改变时的转折点,使得输入语音按说话人被分割成语音段:分段1,分段2,分段3…,分段N(举个例子,分段1,分段3可能是同一个人的语音,但是因为中间有另一个人的语音,所以按说话人转折点切开),而每个语音段中仅包含单一说话人的语音数据;说话人聚类的目的是将相同说话人的语音段聚集,使得每一类只包含一个说话人的数据,并使每个人的数据尽可能的在一类数据中(上面的例子,分段1和分段上就可以合在一起)
本发明说话人聚类采用LSP特征来进行,即通过原始语音提取出LSP(Line Spectrum Pair)特征数据,进行后面的计算。
(一)说话人分割
说话人分割的重点就是找到说话人切换的转折点,其中包括单一转折点的检测和多个转折点的检测:
(1)单一转折点检测:
如图14所示,单一转折点检测包括以下步骤:语音特征段提取、基于距离的顺序检测、交叉检测、和转折点确认。所述的语音特征段提取与前述相应的方式相同,或者可以直接使用前述提取的语音特征,在此不再赘述。
1)基于距离的顺序检测:
如图15所示,为基于距离的单转折点顺序检测示意图。该检测方法假设:在语音段最初的一小段时间间隔内,不存在转折点。首先取语音最开始时的语音段(1-3秒)作为模板(Template)窗口,之后将此模板和每个滑动片段(长度和模板的相同)作距离计算,本发明采用“广义似然比”作为度量的距离,可获得距离曲线,其中d(t)表示t时刻的滑动窗口与说话人1的模板窗口之间的距离值。
如图16所示,顺序检测后的距离曲线。由图16中观察可发现,当滑动窗口在第一个说话人的范围内时,模板段和移动窗口均为第一个说话人的语音,所以距离值较小。当移动窗口到达第二个说话人的范围内时,滑动窗口变为第二个说话人的语音,因此距离值逐渐增大。因此可假设在距离值最大时,其附近有第二个说话人的语音的可能性最大。
2)交叉检测:
如图17所示,在顺序检测完成后,通过寻找距离曲线的最大值点来确定第二个说话人的模板窗口。
在找出第二个说话人的模板后,采用前述同样的方法即可得到第二条距离曲线。如图18所示,两条曲线交叉处即为说话人转折点。
3)转折点确认:
如图19所示,在交叉检测时,如果错误的将第一个说话人的语音作为第二个说话人的语音模板,则可能产生虚警错误。为了减少虚警错误,需要对每个转折点进行进一步的确认。转折点的确认如公式22所示:
Figure PCTCN2017077792-appb-000033
上述公式中,sign(·)为符号函数,dcross为两条距离曲线交叉处的距离值。
其中,通过利用说话人2的距离曲线起始到交叉点的这段区域(如图20中方框部分所示),公式(22)中的d(i)就是这一端区域内计算出来的距离。若最后结果为正,则接受此点为说话人转折点;若为负,则拒绝此点为说话人转折点。
(2)多个转折点检测:
找到整段语音中的多个说话人转折点,可在单一转折点检测的基础上完成,步骤如下:
步骤1):首先设定一较大的时间窗(长度为5-15秒),在窗内作单转折点检测。
步骤2):若在上一步骤没找到说话人转折点,则将窗口向右移动(1-3秒),重复步骤1,直到找到说话人转折点,或者语音段结束。
步骤3):若找到说话人转折点,则记录此转折点,并将窗口起始点设到此转折点上,重复步 骤1)-步骤2)。
通过上述步骤,可以找到多个说话人的所有转折点,并据此分段为:分段1到分段N。
由此,通过上述单一转折点的检测和多个转折点的检测完成说话人的分割。
(二)说话人聚类
在完成说话人分割后,接下来,说话人聚类将这些分段聚类,相同说话人的分段合在一起:说话人聚类是聚类技术在语音信号处理方面的一个具体应用,其目的是通过对语音段进行分类,使得每一类只包含同一说话人数据,并且同一说话人的数据都被归并到同一类中。
对于所述的分段聚类,本发明提出一种改进的层次聚类方法(Improved Hierarchical Clustering,IHC),该方法通过最小化类内误差平方和进行合并和确定类别数目,具体步骤如图21所示:
考虑一个语音段的集合X={x1,x2,…,xN},其中xn表示一个语音段对应的特征序列。XN表示那个集合的最后一个特征,而Xn泛指。“其中xn表示一个语音段对应的特征序列。”意思就是集合里面的每一个x都是一个特征序列。说话人聚类意味着要找到集合X的一个划分C={c1,c2,…,cK},而ck中只包含一个说话人的语音数据,并且来自同一个说话人的语音段仅被划分到ck中。
(1)计算距离:
与确定说话人转折点的计算距离方法一样,采用“广义似然比”作为度量的距离。
(2)改进的误差平方和准则:
误差平方和准则即为类内误差平方和最小为准则。在说话人聚类应用中,同一说话人的数据间的距离比较小,而不同说话人数据间的距离比较大,因此误差平方和准则能取得较好的效果。
综上所述,IHC算法的第一步是以距离度量为相似度,以改进的误差平方和准则为准则函数,逐步地两两合并,最终形成一个聚类树。
(3)类别确定:
在说话人聚类中,一个重要的环节就是自动确定数据中客观存在的类别数目,即确定有多少个说话人。本发明采用了一种基于假设检验的类别确定方法,该方法利用假设检验的原理,对聚类树上的每一个合并操作进行检验,检查其合并的合理性,从而确定最终的类别数目。一旦发现有不合理的合并,就认为合并前的类别数目为最终的说话人类别数目。
对于(1)(2)采用了不同的距离计算方法和不同的聚类准则,可以提升聚类的正确性与效果;(3)采用基于假设检验方法,使得聚类的时候不需要认为指定类别个数,因为往往无法事先确定说话的有多少人,但是采用这种方法,就可以根据实际情况,聚成相应的几个类。
以上介绍了本发明的较佳实施方式,旨在使得本发明的精神更加清楚和便于理解,并不是为 了限制本发明,凡在本发明的精神和原则之内,所做的修改、替换、改进,均应包含在本发明所附的权利要求概括的保护范围之内。

Claims (30)

  1. 一种语音门禁和安静环境监控方法,包括以下步骤:
    —语音门禁识别步骤,用于在门禁前进行语音验证,对于采集到的待验证人的音频先后进行语音识别和声纹识别;
    —安静环境监控步骤,用于在安静环境中进行语音监控,先后包括端点检测、说话人分段聚类和声纹识别;
    —中央处理步骤,用于对于语音门禁识别步骤和安静环境监控步骤的数据进行处理。
  2. 根据权利要求1所述的方法,其特征在于,所述语音门禁识别步骤进一步包括:
    aa)待验证人触发声纹验证;
    ab)弹出验证字符串;
    ac)待验证人念读所述验证字符串;
    ad)录取所述念读的音频,首先通过语音识别识别是否说的为正确的字符串,接着采用声纹验证是否为有效的验证人,由此判断是否打开门禁。
  3. 根据权利要求2所述的方法,其特征在于,所述安静环境监控步骤进一步包括:
    ba)在规定时间段,开启监控;
    bb)启动端点检测,判断是否为安静环境;
    bc)如果判断为非安静环境,通过端点检测这段音频;
    bd)对于检测到的所述这段音频,进行说话人分段聚类分析,分析之后将分别区分并得到不同说话人各自的音频数据;
    be)根据已经保存的声纹模型,通过声纹识别对于所述音频数据中的每个音频进行声纹识别,以得到音频发出者的身份信息;
    bf)将所述身份信息及其发出的音频数据和发出时间等信息发送并显示给管理者。
  4. 根据权利要求3所述的方法,其特征在于,在所述步骤bd)中,
    所述说话人分段聚类分析包括说话人分割步骤、说话人聚类步骤和声纹识别步骤;
    所述说话人分割步骤用于找到说话人切换的转折点,包括单一转折点的检测和多个转折点的检测;
    所述单一转折点检测包括基于距离的顺序检测、交叉检测和转折点确认;
    所述多个转折点检测用于找到整段语音中的多个说话人转折点,在所述单一转折点检测的基础上完成,步骤如下:
    步骤1):首先设定一较大的时间窗,长度为5-15秒,在窗内作单转折点检测;
    步骤2):若在上一步骤没找到说话人转折点,则将窗向右移动1-3秒,重复步骤1,直到找到说话人转折点,或者语音段结束;
    步骤3):若找到说话人转折点,则记录此转折点,并将窗口起始点设到此转折点上,重复步骤1)-步骤2)。
  5. 根据权利要求4所述的方法,其特征在于,所述转折点的确认公式:
    Figure PCTCN2017077792-appb-100001
    sign(·)为符号函数,dcross为两条距离曲线交叉处的距离值;
    其中,通过利用说话人的距离曲线起始到交叉点的这段区域,公式中的d(i)就是这一端区域内计算出来的距离,若最后结果为正,则接受此点为说话人转折点;若为负,则拒绝此点为说话人转折点。
  6. 根据权利要求2-5之一所述的方法,其特征在于,
    在所述语音门禁识别步骤中,弹出的验证字符串为随机生成的多位字符串,每次需要念读的信息都是不固定的。
  7. 根据权利要求1-5之一所述的方法,其特征在于,
    所述端点检测通过360度环形麦克风阵列来实现,以保证音频采集的灵敏度和采集的音频的质量。
  8. 根据权利要求2-5之一所述的方法,其特征在于,在所述步骤ad)录取所述念读的音频的基础上,所述语音门禁识别步骤还包括步骤ae),
    即,对于每个验证人,将每次所述念读的音频保存为验证人声纹模型训练音频,直到验证人声纹模型构建成功。
  9. 根据权利要求8所述的方法,其特征在于,所述步骤be)的声纹模型是在所述步骤ae)保存的音频数据的基础上训练得到的。
  10. 根据权利要求9所述的方法,其特征在于,
    待验证人在触发声纹验证时,同时启动面部图像采集,采集待验证人的面部图像,获得面部图像后,在中央处理步骤进行比对,获得待验证人的信息,并且将采集的语音信号与所述注册信息进行关联,形成关联数据库。
  11. 根据权利要求10所述的方法,其特征在于,
    当待验证者进入封闭环境以后,激活待验证者的信息,对于那些已经注册但是没有进入宿舍的人员,系统不激活他们的信息,但是将他们的信息发送给管理者。
  12. 根据权利要求11所述的方法,其特征在于,
    在所述步骤be)中,首先与激活的这些信息进行对比;
    如果经过比对,没有在激活人员里找到匹配的人员信息,就扩大到所有注册人员进行比对, 如果比对成功,产生提示非法进入或者未有效打卡的提示;
    如果没有比对成功,就产生非法闯入的警示信息。
  13. 根据权利要求1-5之一所述的方法,其特征在于,在封闭环境的每个单元中设置:
    至少一个环形麦克风阵列;
    环境亮度识别单元,用于检测宿舍环境的亮度,自动开启或关闭监控;和
    与被监控环境中人员进行交流的声音播放装置。
  14. 根据权利要求1-5之一所述的方法,其特征在于,
    所述中央处理步骤将所述身份信息及其发出的音频数据和发出时间信息发送并显示给管理者,传输到与系统后台或者中央处理步骤关联的监控装置,供监控者直观方便的进行相应的管理,便于采取相应的管理措施。
  15. 一种语音门禁和安静环境监控系统,包括语音门禁识别模块、安静环境监控模块和中央处理模块,其特征在于:
    所述语音门禁识别模块,用于在门禁前进行语音验证,对于采集到的待验证人的音频先后进行语音识别和声纹识别;
    所述安静环境监控模块,用于在安静环境中进行语音监控,依次包括端点检测、说话人分段聚类和声纹识别;
    所述语音门禁识别模块和安静环境监控模块均与中央处理模块相连接。
  16. 根据权利要求15所述的系统,其特征在于:
    所述安静环境监控模块进一步包括说话人分割模块、说话人聚类模块和声纹识别模块;
    所述说话人分割模块用于找到说话人切换的转折点,包括单一转折点的检测和多个转折点的检测;
    所述单一转折点检测包括基于距离的顺序检测、交叉检测和转折点确认;
    所述多个转折点检测用于找到整段语音中的多个说话人转折点,在所述单一转折点检测的基础上完成,步骤如下:
    步骤1):首先设定一较大的时间窗,长度为5-15秒,在窗内作单转折点检测;
    步骤2):若在上一步骤没找到说话人转折点,则将窗向右移动1-3秒,重复步骤1,直到找到说话人转折点,或者语音段结束;
    步骤3):若找到说话人转折点,则记录此转折点,并将窗口起始点设到此转折点上,重复步骤1)-步骤2)。
  17. 根据权利要求16所述的系统,其特征在于,
    所述转折点的确认公式:
    Figure PCTCN2017077792-appb-100002
    sign(·)为符号函数,dcross为两条距离曲线交叉处的距离值;
    其中,通过利用说话人的距离曲线起始到交叉点的这段区域,公式中的d(i)就是这一端区域内计算出来的距离,若最后结果为正,则接受此点为说话人转折点;若为负,则拒绝此点为说话人转折点。
  18. 根据权利要求15-17之一所述的系统,其特征在于:
    所述语音门禁识别模块设置在封闭环境的门外,包括用于采集音频的麦克风、用于触发门禁识别的按钮、和用于显示字符串的显示装置。
  19. 根据权利要求18所述的系统,其特征在于:
    所述语音门禁识别模块还包括与待验证者交互的语音播放装置;
    使用红外检测单元替代所述按钮,使得当有待验证者接近时自动开启系统验证。
  20. 根据权利要求18所述的系统,其特征在于:
    所述语音门禁识别模块进一步包括面部图像采集装置,用于采集待验证者的头像。
  21. 根据权利要求20所述的系统,其特征在于:
    所述语音门禁识别模块进一步包括连接移动终端的接口,所述移动终端通过接口连接后,所述的麦克风、按钮、显示装置和面部图像采集装置的功能由移动终端的麦克风、屏幕虚拟按钮、显示屏、摄像头实现。
  22. 根据权利要求21所述的系统,其特征在于:
    所述移动终端安装有实现语音门禁识别功能的APP或者PC软件客户端。
  23. 根据权利要求22所述的系统,其特征在于:
    所述移动终端通过有线或无线方式与门禁开闭系统连接,以根据验证的结果决定开闭门禁系统。
  24. 根据权利要求15-17之一所述的系统,其特征在于:
    待验证人在进门前,通过触发门禁识别的按钮,启动语音识别,面部图像采集装置同步开启,采集待验证人的面部图像,获得面部图像后,发送到中央处理模块,由中央处理模块进行比对,获得待验证人的注册信息,并且将采集的语音信号与所述注册信息进行关联,形成关联数据库。
  25. 根据权利要求24所述的系统,其特征在于:
    当待验证者进入封闭环境以后,系统就激活待验证者的信息,对于那些已经注册但是没有进入宿舍的人员,系统不激活他们的信息,但是将他们的信息发送到系统管理者。
  26. 根据权利要求25所述的系统,其特征在于:
    系统在进行对比时首先与激活的这些信息进行对比;
    如果经过比对,没有在激活人员里找到匹配的人员信息,就扩大到所有注册人员进行比对,如果比对成功,产生提示非法进入或者未有效打卡的提示;
    如果没有比对成功,就产生非法闯入的警示信息,管理员可以通过语音交互进行信息的确认。
  27. 根据权利要求24所述的系统,其特征在于:
    所述安静环境监控模块设置在封闭环境的每个单元中,包括至少一个环形麦克风阵列。
  28. 根据权利要求15-17之一所述的系统,其特征在于还包括:
    环境亮度识别单元,用于检测宿舍环境的亮度,自动开启或关闭监控;和
    与被监控环境中人员进行交流的声音播放装置。
  29. 根据权利要求28所述的系统,其特征在于还包括:
    所述中央处理模块单独设置在系统后台,可以与所述语音门禁识别模块一体设置;或者与所述安静环境监控模块一体设置,用于处理和显示所述安静环境监控模块获得的监控信息。
  30. 根据权利要求28所述的系统,其特征在于还包括:
    所述中央处理模块将所述身份信息及其发出的音频数据和发出时间信息发送并显示给管理者,传输到与系统后台或者中央处理模块连接的监控装置,供监控者直观方便的进行相应的管理,便于采取相应的管理措施。
PCT/CN2017/077792 2016-07-27 2017-03-23 一种语音门禁和安静环境监控方法及系统 WO2018018906A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610602660.6 2016-07-27
CN201610602660.6A CN106251874B (zh) 2016-07-27 2016-07-27 一种语音门禁和安静环境监控方法及系统

Publications (1)

Publication Number Publication Date
WO2018018906A1 true WO2018018906A1 (zh) 2018-02-01

Family

ID=57604546

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/077792 WO2018018906A1 (zh) 2016-07-27 2017-03-23 一种语音门禁和安静环境监控方法及系统

Country Status (2)

Country Link
CN (1) CN106251874B (zh)
WO (1) WO2018018906A1 (zh)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109961789A (zh) * 2019-04-30 2019-07-02 张玄武 一种基于视频及语音交互服务设备
CN110211595A (zh) * 2019-06-28 2019-09-06 四川长虹电器股份有限公司 一种基于深度学习的说话人聚类系统
CN111691168A (zh) * 2019-03-13 2020-09-22 青岛海尔洗衣机有限公司 一种叠衣机及其控制方法
CN111985231A (zh) * 2020-08-07 2020-11-24 中移(杭州)信息技术有限公司 无监督角色识别方法、装置、电子设备及存储介质
CN112147921A (zh) * 2019-06-28 2020-12-29 百度在线网络技术(北京)有限公司 机器人及其控制方法
WO2021010443A1 (ja) 2019-07-16 2021-01-21 ダイキン工業株式会社 含フッ素エラストマーの製造方法および組成物
CN112652303A (zh) * 2020-08-23 2021-04-13 广州市昇博电子科技有限公司 一种本地引擎语音识别及交互方法
CN112735385A (zh) * 2020-12-30 2021-04-30 科大讯飞股份有限公司 语音端点检测方法、装置、计算机设备及存储介质
CN116229987A (zh) * 2022-12-13 2023-06-06 广州市保伦电子有限公司 一种校园语音识别的方法、装置及存储介质
CN116758938A (zh) * 2023-08-21 2023-09-15 硕橙(厦门)科技有限公司 一种模切机音频感兴趣区域定位方法、装置、设备及介质
CN118098243A (zh) * 2024-04-26 2024-05-28 深译信息科技(珠海)有限公司 音频转化方法、装置及相关设备

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106251874B (zh) * 2016-07-27 2017-08-18 深圳市鹰硕音频科技有限公司 一种语音门禁和安静环境监控方法及系统
CN108242241B (zh) * 2016-12-23 2021-10-26 中国农业大学 一种纯语音快速筛选方法及其装置
CN107274906A (zh) * 2017-06-28 2017-10-20 百度在线网络技术(北京)有限公司 语音信息处理方法、装置、终端及存储介质
CN107195077B (zh) * 2017-07-19 2020-09-18 浙江联运环境工程股份有限公司 瓶子智能回收机
CN107248410A (zh) * 2017-07-19 2017-10-13 浙江联运知慧科技有限公司 声纹识别垃圾箱开门的方法
CN108335392A (zh) * 2018-02-22 2018-07-27 安徽永裕云商企业管理有限公司 一种办公楼门禁管理系统
CN108806695A (zh) * 2018-04-17 2018-11-13 平安科技(深圳)有限公司 自更新的反欺诈方法、装置、计算机设备和存储介质
CN111091844A (zh) * 2018-10-23 2020-05-01 北京嘀嘀无限科技发展有限公司 一种视频处理方法和系统
CN109658299A (zh) * 2018-10-26 2019-04-19 浙江工商职业技术学院 图书馆智能管理系统
CN109859742B (zh) * 2019-01-08 2021-04-09 国家计算机网络与信息安全管理中心 一种说话人分段聚类方法及装置
CN110232928B (zh) * 2019-06-13 2021-05-25 思必驰科技股份有限公司 文本无关说话人验证方法和装置
CN110491392A (zh) * 2019-08-29 2019-11-22 广州国音智能科技有限公司 一种基于说话人身份的音频数据清洗方法、装置和设备
CN110992930A (zh) * 2019-12-06 2020-04-10 广州国音智能科技有限公司 声纹特征提取方法、装置、终端及可读存储介质
CN110992739B (zh) * 2019-12-26 2021-06-01 上海松鼠课堂人工智能科技有限公司 学生在线听写系统
CN111599365B (zh) * 2020-04-08 2023-05-05 云知声智能科技股份有限公司 一种用于声纹识别系统中的自适应阈值生成系统和方法
CN113476022A (zh) * 2020-11-24 2021-10-08 四川远邦益安科技有限公司 一种住校生用睡眠监测系统
CN113096669B (zh) * 2021-03-31 2022-05-27 重庆风云际会智慧科技有限公司 基于角色识别的语音识别系统
CN115273859B (zh) * 2021-04-30 2024-05-28 清华大学 语音验证装置的安全性测试方法及装置
CN113449626B (zh) * 2021-06-23 2023-11-07 中国科学院上海高等研究院 隐马尔科夫模型振动信号分析方法装置、存储介质和终端
CN114696940B (zh) * 2022-03-09 2023-08-25 电子科技大学 一种会议室防录音方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103440686A (zh) * 2013-07-29 2013-12-11 上海交通大学 基于声纹识别、头像识别及位置服务的移动身份验证系统和方法
CN103716470A (zh) * 2012-09-29 2014-04-09 华为技术有限公司 语音质量监控的方法和装置
CN103973441A (zh) * 2013-01-29 2014-08-06 腾讯科技(深圳)有限公司 基于音视频的用户认证方法和装置
CN103971700A (zh) * 2013-08-01 2014-08-06 哈尔滨理工大学 语音监控方法及装置
CN104835497A (zh) * 2015-04-14 2015-08-12 时代亿宝(北京)科技有限公司 一种基于动态口令的声纹打卡系统及方法
CN106251874A (zh) * 2016-07-27 2016-12-21 深圳市鹰硕音频科技有限公司 一种语音门禁和安静环境监控方法及系统

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103716470A (zh) * 2012-09-29 2014-04-09 华为技术有限公司 语音质量监控的方法和装置
CN103973441A (zh) * 2013-01-29 2014-08-06 腾讯科技(深圳)有限公司 基于音视频的用户认证方法和装置
CN103440686A (zh) * 2013-07-29 2013-12-11 上海交通大学 基于声纹识别、头像识别及位置服务的移动身份验证系统和方法
CN103971700A (zh) * 2013-08-01 2014-08-06 哈尔滨理工大学 语音监控方法及装置
CN104835497A (zh) * 2015-04-14 2015-08-12 时代亿宝(北京)科技有限公司 一种基于动态口令的声纹打卡系统及方法
CN106251874A (zh) * 2016-07-27 2016-12-21 深圳市鹰硕音频科技有限公司 一种语音门禁和安静环境监控方法及系统

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111691168A (zh) * 2019-03-13 2020-09-22 青岛海尔洗衣机有限公司 一种叠衣机及其控制方法
CN111691168B (zh) * 2019-03-13 2023-03-28 青岛海尔洗衣机有限公司 一种叠衣机及其控制方法
CN109961789A (zh) * 2019-04-30 2019-07-02 张玄武 一种基于视频及语音交互服务设备
CN109961789B (zh) * 2019-04-30 2023-12-01 张玄武 一种基于视频及语音交互服务设备
CN112147921A (zh) * 2019-06-28 2020-12-29 百度在线网络技术(北京)有限公司 机器人及其控制方法
CN110211595A (zh) * 2019-06-28 2019-09-06 四川长虹电器股份有限公司 一种基于深度学习的说话人聚类系统
WO2021010443A1 (ja) 2019-07-16 2021-01-21 ダイキン工業株式会社 含フッ素エラストマーの製造方法および組成物
CN111985231A (zh) * 2020-08-07 2020-11-24 中移(杭州)信息技术有限公司 无监督角色识别方法、装置、电子设备及存储介质
CN111985231B (zh) * 2020-08-07 2023-12-26 中移(杭州)信息技术有限公司 无监督角色识别方法、装置、电子设备及存储介质
CN112652303A (zh) * 2020-08-23 2021-04-13 广州市昇博电子科技有限公司 一种本地引擎语音识别及交互方法
CN112735385A (zh) * 2020-12-30 2021-04-30 科大讯飞股份有限公司 语音端点检测方法、装置、计算机设备及存储介质
CN112735385B (zh) * 2020-12-30 2024-05-31 中国科学技术大学 语音端点检测方法、装置、计算机设备及存储介质
CN116229987A (zh) * 2022-12-13 2023-06-06 广州市保伦电子有限公司 一种校园语音识别的方法、装置及存储介质
CN116229987B (zh) * 2022-12-13 2023-11-21 广东保伦电子股份有限公司 一种校园语音识别的方法、装置及存储介质
CN116758938A (zh) * 2023-08-21 2023-09-15 硕橙(厦门)科技有限公司 一种模切机音频感兴趣区域定位方法、装置、设备及介质
CN116758938B (zh) * 2023-08-21 2023-11-14 硕橙(厦门)科技有限公司 一种模切机音频感兴趣区域定位方法、装置、设备及介质
CN118098243A (zh) * 2024-04-26 2024-05-28 深译信息科技(珠海)有限公司 音频转化方法、装置及相关设备

Also Published As

Publication number Publication date
CN106251874A (zh) 2016-12-21
CN106251874B (zh) 2017-08-18

Similar Documents

Publication Publication Date Title
WO2018018906A1 (zh) 一种语音门禁和安静环境监控方法及系统
Singh et al. Applications of speaker recognition
Aleksic et al. Audio-visual biometrics
Sehili et al. Sound environment analysis in smart home
Kim et al. Hierarchical approach for abnormal acoustic event classification in an elevator
US12039970B1 (en) System and method for source authentication in voice-controlled automation
Choi et al. Selective background adaptation based abnormal acoustic event recognition for audio surveillance
Yoo et al. Automatic sound recognition for the hearing impaired
KR100779242B1 (ko) 음성 인식/화자 인식 통합 시스템에서의 화자 인식 방법
Saleema et al. Voice biometrics: the promising future of authentication in the internet of things
Saleh et al. Multimodal person identification through the fusion of face and voice biometrics
Morris et al. Multimodal person authentication on a smartphone under realistic conditions
Micheloni et al. Audio–video biometric recognition for non-collaborative access granting
Duraibi et al. Voice Feature Learning using Convolutional Neural Networks Designed to Avoid Replay Attacks
Hari et al. Comprehensive Research on Speaker Recognition and its Challenges
Bredin et al. Making talking-face authentication robust to deliberate imposture
Huang et al. WalkID: Towards context awareness of smart home by identifying walking sounds
CN114512133A (zh) 发声对象识别方法、装置、服务器及存储介质
Hazen et al. Multimodal face and speaker identification for mobile devices
Estrebou et al. Voice recognition based on probabilistic SOM
Malik et al. Speaker Recognition for Device Controlling using MFCC and GMM Algorithm
Mohamed et al. An Overview of the Development of Speaker Recognition Techniques for Various Applications.
Zheng et al. A robust keyword detection system for criminal scene analysis
Suthokumar et al. An analysis of speaker dependent models in replay detection
Shofiyah et al. Voice recognition system for home security keys with mel-frequency cepstral coefficient method and backpropagation artificial neural network

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17833214

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17833214

Country of ref document: EP

Kind code of ref document: A1