CN115116447A - Method and device for acquiring audio registration information and electronic equipment - Google Patents

Method and device for acquiring audio registration information and electronic equipment Download PDF

Info

Publication number
CN115116447A
CN115116447A CN202210783733.1A CN202210783733A CN115116447A CN 115116447 A CN115116447 A CN 115116447A CN 202210783733 A CN202210783733 A CN 202210783733A CN 115116447 A CN115116447 A CN 115116447A
Authority
CN
China
Prior art keywords
audio
voice
user
segments
conference
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210783733.1A
Other languages
Chinese (zh)
Inventor
王斌
姚佳立
李想
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zitiao Network Technology Co Ltd
Original Assignee
Beijing Zitiao Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zitiao Network Technology Co Ltd filed Critical Beijing Zitiao Network Technology Co Ltd
Priority to CN202210783733.1A priority Critical patent/CN115116447A/en
Publication of CN115116447A publication Critical patent/CN115116447A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The disclosure relates to a method and a device for acquiring audio registration information and electronic equipment, and particularly relates to the technical field of voice recognition. The method comprises the following steps: acquiring a plurality of first audio clips of the same user from conference audio; carrying out voice detection on the plurality of first audio clips, and acquiring voice audio clips from the plurality of first audio clips; and acquiring a first digital mark according to the voice audio clip, and registering the first digital mark as the audio registration information of the first user. The embodiment of the disclosure is used for solving the problems that inconvenience is brought to audio registration of a user and the accuracy of audio registration information is not high in the prior art.

Description

Method and device for acquiring audio registration information and electronic equipment
Technical Field
The present disclosure relates to the field of speech recognition technologies, and in particular, to a method and an apparatus for acquiring audio registration information, and an electronic device.
Background
The speaker identification is also called voiceprint identification, because the voice of each speaker has unique characteristics, the voice of different speakers can be effectively identified and distinguished through the characteristics, and the method has the advantages of no loss or forgetting, no need of memory, convenient use and the like, and is widely applied to various aspects of safety verification, control and the like. In the prior art, a text for identification is provided when a user registers, the user actively reads the text to obtain a registration audio of the user, and then voiceprint information of the user is extracted from the registration audio to be used as a registration voiceprint of the user and stored in a voiceprint library for speaker identification; however, in this way, it takes extra time for the user to read the fixed text, and the pronunciation habit and the speed of speech of the user reading the fixed text may be different from those of normal communication, so that the extracted voiceprint information is inaccurate, and the accuracy of speaker recognition is affected.
Disclosure of Invention
In order to solve the technical problems or at least partially solve the technical problems, the present disclosure provides a method, an apparatus, and an electronic device for obtaining audio registration information, which can solve the problems of inconvenience in audio registration and low accuracy of audio registration information of a user in the prior art.
In order to achieve the above object, the embodiments of the present disclosure provide the following technical solutions:
in a first aspect, a method for obtaining audio registration information is provided, where the method includes:
acquiring a plurality of first audio clips of the same user from conference audio;
carrying out voice detection on the plurality of first audio clips, and acquiring voice audio clips from the plurality of first audio clips;
and acquiring a first digital mark according to the voice audio clip, and registering the first digital mark as the audio registration information of the first user.
As an optional implementation manner of the embodiment of the present disclosure, performing voice detection on a plurality of first audio segments, and acquiring a voice audio segment from the plurality of first audio segments includes:
splicing a plurality of first audio segments into a combined audio segment, and carrying out voice detection on the combined audio segment to obtain a voice audio segment;
or the like, or, alternatively,
and carrying out voice detection on each first audio clip in the plurality of first audio clips to obtain a plurality of voice clips, and splicing the plurality of voice clips into voice audio clips.
As an optional implementation manner of the embodiment of the present disclosure, acquiring the first digital mark according to the voice audio clip includes:
dividing the voice audio clip into a plurality of voice audio sub-clips;
extracting the audio characteristic information of each sound audio sub-segment in the plurality of sound audio sub-segments, and carrying out audio clustering according to the audio characteristic information of each sound audio sub-segment to obtain a plurality of clustered sound sub-segments;
determining a target clustering voice segment with the longest duration from a plurality of clustering voice segments;
and acquiring a first digital mark of the target clustered human voice fragment.
As an optional implementation manner of the embodiment of the present disclosure, obtaining a first digital mark of a target clustered human voice segment includes:
acquiring a target audio fragment with preset duration from the target clustered human voice fragments;
a first digitized mark of a target audio segment is obtained.
As an optional implementation manner of the embodiment of the present disclosure, acquiring the first digital mark according to the voice audio clip includes:
extracting a second digital mark of the human voice audio segment;
acquiring at least one historical audio clip;
extracting the digital mark from at least one historical audio clip to obtain at least one third digital mark, wherein the at least one historical audio clip is the audio clip of the first user respectively obtained from at least one historical conference audio;
the first digitized mark is determined based on the second digitized mark and the third digitized mark.
As an optional implementation manner of the embodiment of the present disclosure, acquiring the first digital mark according to the voice audio clip includes:
extracting a second digital mark of the human voice audio segment;
and determining the first digital mark according to the second digital mark and the stored third digital mark, wherein the stored third digital mark is historical audio registration information of the first user.
As an optional implementation manner of the embodiment of the present disclosure, acquiring multiple first audio clips of the same user from conference audio includes:
acquiring a conference record corresponding to conference audio, and displaying the conference record; the conference record comprises: associating the displayed plurality of user identities with a conference subtitle generated based on a plurality of audio clips of conference audio;
and responding to the selection operation aiming at the identification of the target user, and acquiring a plurality of first audio segments associated with the identification of the target user from a plurality of audio segments of the conference audio as a plurality of first audio segments of the same user, wherein the identification of the target user is any one user identification in a plurality of user identifications.
In a second aspect, an apparatus for acquiring audio registration information is provided, the apparatus comprising:
the acquisition module is used for acquiring a plurality of first audio clips of the same user from conference audio;
the detection module is used for carrying out voice detection on the plurality of first audio clips and acquiring voice audio clips from the plurality of first audio clips;
and the registration module is used for acquiring the first digital mark according to the voice audio clip and registering the first digital mark as the audio registration information of the first user.
As an optional implementation manner of the embodiment of the present disclosure, the detection module is specifically configured to splice a plurality of first audio segments into a combined audio segment, and perform voice detection on the combined audio segment to obtain a voice audio segment;
or the like, or a combination thereof,
and carrying out voice detection on each first audio clip in the plurality of first audio clips to obtain a plurality of voice clips, and splicing the plurality of voice clips into voice audio clips.
As an optional implementation manner of the embodiment of the present disclosure, the registration module is specifically configured to divide a voice audio segment into a plurality of voice audio sub-segments;
extracting the audio characteristic information of each sound sub-segment in the plurality of sound sub-segments, and carrying out audio clustering according to the audio characteristic information of each sound sub-segment to obtain a plurality of clustered sound segments;
determining a target clustering voice segment with the longest duration from a plurality of clustering voice segments;
and acquiring a first digital mark of the target clustered human voice fragment.
As an optional implementation manner of the embodiment of the present disclosure, the registration module is specifically configured to obtain a target audio segment with a preset duration from the target clustered human voice segments;
a first digitized mark of a target audio segment is obtained.
As an optional implementation manner of the embodiment of the present disclosure, the registration module is specifically configured to extract a second digital mark of the human voice audio clip;
obtaining at least one historical audio clip;
extracting the digital mark from the at least one historical audio clip to obtain at least one third digital mark, wherein the at least one historical audio clip is the audio clip of the first user respectively obtained from at least one historical conference audio;
the first digitized mark is determined based on the second digitized mark and the third digitized mark.
As an optional implementation manner of the embodiment of the present disclosure, the registration module is specifically configured to extract a second digital mark of the human voice audio clip;
and determining to obtain the first digital mark according to the second digital mark and the stored third digital mark, wherein the stored third digital mark is historical audio registration information of the first user.
As an optional implementation manner of the embodiment of the present disclosure, the obtaining module is specifically configured to obtain a conference record corresponding to a conference audio, and display the conference record; the conference record comprises: associating the displayed plurality of user identities with a conference subtitle generated based on a plurality of audio clips of conference audio;
and responding to the selection operation aiming at the identification of the target user, acquiring a plurality of first audio segments associated with the identification of the target user from a plurality of audio segments of the conference audio as a plurality of first audio segments of the same user, wherein the identification of the target user is any one of the user identifications of the plurality of user identifications.
In a third aspect, an electronic device is provided, including: a processor, a memory and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing a method of obtaining audio registration information as described in the first aspect or any one of its alternative embodiments.
In a fourth aspect, a computer-readable storage medium is provided, comprising: the computer-readable storage medium has stored thereon a computer program which, when executed by a processor, implements a method of obtaining audio registration information as described in the first aspect or any one of its alternative embodiments.
In a fifth aspect, a computer program product is provided, comprising: the computer program product, when run on a computer, causes the computer to implement a method of obtaining audio registration information as in the first aspect or any one of its alternative embodiments.
Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has the following advantages:
according to the method and the device, the audio frequency segment of the user is obtained from the conference audio, then the voice audio frequency segment of the user is obtained through voice detection, then the digital mark is obtained from the voice audio frequency segment, and the voice registration information of the user is registered, so that the user does not need to repeatedly read the specified text and register according to a fixed flow in the scene that the user normally speaks to open a meeting, on one hand, the convenience of the user in voice registration is improved, on the other hand, the difference between the voice characteristic information during registration and the voice characteristic information under the normal speaking condition of the user is avoided, and the accuracy of voiceprint recognition and the adaptability of an application scene are improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.
In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
Fig. 1 is a schematic view of an implementation scenario of a method for acquiring audio registration information according to an embodiment of the present disclosure;
fig. 2 is a first flowchart illustrating a method for acquiring audio registration information according to an embodiment of the present disclosure;
fig. 3 is a schematic diagram of a page for displaying a meeting record according to an embodiment of the present disclosure;
fig. 4A is a schematic flowchart illustrating a second method for acquiring audio registration information according to an embodiment of the disclosure;
fig. 4B is a schematic flowchart illustrating a third method for acquiring audio registration information according to an embodiment of the disclosure;
FIG. 5 is a block diagram of an apparatus for obtaining audio registration information according to an embodiment of the disclosure;
fig. 6 is a block diagram of an electronic device according to an embodiment of the disclosure.
Detailed Description
In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that the embodiments and features of the embodiments of the present disclosure may be combined with each other without conflict.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced in other ways than those described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.
Currently, the audio registration is performed according to a fixed flow, for example, when the near field device performs audio registration, the audio registration is set on an application program at a mobile phone terminal, operation is performed according to steps according to a prompt requirement on the mobile phone, and a specified text provided on the mobile terminal is repeatedly read in the registration process. When the far-field intelligent device performs audio registration, a user initiates a command, and after receiving the command, the intelligent device starts a registration mode and also needs to read a specified text repeatedly according to a prompt to complete the registration. Therefore, the existing audio registration needs to spend extra time of the user to read the specified text, the interaction cost is high, and the pronunciation habit and the speech rate of the user reading the specified text may be different from those of normal communication, so that the voiceprint information extracted in the audio registration process is inaccurate, and the accuracy of speaker recognition is affected.
In order to solve the above problem, an embodiment of the present disclosure provides a method for acquiring audio registration information, which may acquire an audio clip of a user in a scene where speakers communicate normally, such as a conference scene, and then perform voice detection to acquire the voice audio clip of the user, so as to reduce the influence caused by noise; furthermore, the user digital mark is obtained from the voice segment and registered as the audio registration information, so that the complicated process of user audio registration is reduced, the user does not need to spend extra time to read the specified text, the experience of the user is improved, the registration information is extracted by using the audio obtained under the normal communication scene and used for voiceprint recognition, and the accuracy of the voiceprint recognition and the adaptability of the application scene are improved.
It is understood that before or during the application of the technical solutions disclosed in the embodiments of the present disclosure, the user should be informed of the type, usage scope, usage scenario, etc. of the personal information (e.g., user information, information such as digitized marks of audio) related to the present disclosure in a proper manner according to relevant laws and regulations and obtain the authorization of the user.
For example, in the embodiment of the disclosure, in relation to the obtaining of the plurality of first audio segments and the obtaining of the digital mark, in practical applications, before obtaining the plurality of first audio segments and the obtaining of the digital mark, a user authorization may be applied to allow the speaker recognition function to be turned on, and the digital mark corresponding to the user audio may be obtained.
For another example, in the embodiment of the present disclosure, before the step of obtaining the user information, the user may be authorized to allow obtaining the user information, and after the user allows obtaining the user information, the user may obtain the user information.
For another example, when a user's active request is received to request that a certain operation be performed, a prompt message may be sent to the user to explicitly prompt the user that the operation requested to be performed will require the acquisition and use of personal information to the user. Thus, the user can autonomously select whether to provide personal information to software or hardware such as an electronic device, an application program, a server, or a storage medium that performs the operations of the technical solution of the present disclosure, according to the prompt information.
As an optional but non-limiting implementation manner, in response to receiving an active request from the user, the manner of sending the prompt information to the user may be, for example, a pop-up window, and the prompt information may be presented in a text manner in the pop-up window. In addition, a selection control for providing personal information to the electronic device by the user's selection of "agreeing" or "disagreeing" can be carried in the pop-up window.
It is understood that the above notification and user authorization process is only illustrative and not limiting, and other ways of satisfying relevant laws and regulations may be applied to the implementation of the present disclosure.
As shown in fig. 1, fig. 1 is a schematic view of an implementation scenario of a method for acquiring audio registration information according to an embodiment of the present disclosure, where the scenario involves a server 101 and 3 terminal devices, which are respectively a terminal device 102, a terminal device 103, and a terminal device 104, and 3 users are respectively a user a, a user B, and a user C, where the user a uses the terminal device 102, the user B uses the terminal device 103, and the user C uses the terminal device 104 to perform an online conference, where the terminal device 102 used by the user a records conference audio and sends the conference audio to the server 101, and the server 101 acquires audio registration information of the user a for the conference audio and sends the audio registration information back to the terminal device 102 by using the method according to the embodiment of the present disclosure.
In the scenario shown in fig. 1, the terminal device 102 sends the recorded conference audio to the server 101, the server 101 first divides the conference audio into a plurality of audio segments according to different audio signals in the conference audio, so as to distinguish the audio segments of the user a, the user B, and the user C, it should be noted that each user has specificity, correspondingly, each user has specificity in the audio signal of the conference audio, the server 101 distinguishes the conference audio according to different audio signals corresponding to each user, and optionally, the server 101 may associate the distinguished audio signals to corresponding users according to a corresponding relationship between the audio signals stored in the audio signal library and the users.
Optionally, the method for acquiring the audio registration information provided in the embodiment of the present disclosure may be implemented by an electronic device, and in a specific application, the electronic device may be a server, or the electronic device may also be a terminal device. When the electronic device is a server, the execution main body of the execution method may be a server program running in the server and corresponding to an information interaction terminal with a voice interaction function. When the electronic device is a terminal device, the execution subject of the execution method can be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, portable wearable devices, and information interaction terminals with voice interaction functions. In specific application, the information interaction terminal can be an intelligent interaction device with a voice interaction function, such as an intelligent robot, an intelligent household appliance and the like; or, the information interaction terminal may be a client terminal with a voice interaction function, for example, a video client terminal, an educational learning client terminal, and the like. In addition, it will be appreciated that the client may be a web page type client, or, an app (application) type client, as is reasonable.
In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the technical terms used in the description of the embodiments or the prior art will be briefly introduced below.
Speaker Recognition (SR) belongs to a biometric technology, and personal identification is performed by using physiological characteristics or behavioral characteristics inherent to a human body through a computer. It is also known as Voiceprint Recognition (VPR) and is a process of automatically determining whether a speaker is within an established speaker set and who the speaker is by analyzing and extracting a received speaker voice signal.
Clustering is the division of a data set into different classes or clusters according to a certain criterion, so that the similarity of data objects in the same cluster is as large as possible, and the difference of data objects not in the same cluster is also as large as possible.
As shown in fig. 2, fig. 2 is a first schematic flowchart of a method for acquiring audio registration information according to an embodiment of the present disclosure, where the method includes steps S201 to S203:
s201, acquiring a plurality of first audio clips of the same user from conference audio.
The conference audio is audio recorded under the condition that at least one user normally speaks in a conference scene, and comprises an audio fragment of the at least one user. The audio segment includes a human voice audio segment and a noise segment, wherein the noise segment includes but is not limited to a mute segment, an ambient noise, a white noise, for example, the noise segment includes a blank audio, and an electronic noise generated by a device recording process.
In some embodiments, the audio segments for each user are differentiated based on the different voiceprint characteristics of each user in the conference audio. The voiceprint features are sonic spectrum features carrying speech information, and have specificity and relative stability, namely, the voiceprint features of different users have specificity, and the voiceprint features of the same user have relative stability. Therefore, the voiceprint features can be extracted and used for user differentiation to determine the audio segments corresponding to the same user, and it needs to be noted that authorization of the user is required for obtaining the voiceprint features of the user.
In some embodiments, after user differentiation is performed according to voiceprint characteristics of different users, an audio clip corresponding to the same user is determined, where the same user is any user of the at least one user. Therefore, the audio clips corresponding to the users participating in the conference scene are determined, and the corresponding audio registration information can be obtained for the users.
In some embodiments, in the process of acquiring multiple first audio segments of the same user from conference audio, a conference record corresponding to the conference audio is acquired first, and the conference record is displayed. Wherein, the meeting record includes: and associating the displayed plurality of user identities with the conference subtitles generated based on the plurality of audio segments of the conference audio. Since a meeting schedule corresponding to a meeting is usually prepared in advance before some meetings begin to notify the participants of the meeting, the meeting schedule usually includes information about the participants, that is, information about a plurality of users. The relevant information of the user may be a user identity, such as a user name or the like. The embodiment of the present disclosure provides an implementation manner, in which an audio clip and/or a conference subtitle are marked according to a user identity in a conference schedule to generate and display a conference record corresponding to conference audio. Therefore, the above processing steps can be executed for the at least one user participating in the conference, and the user identity of the corresponding user can be labeled for the audio clip and/or the conference caption of each user participating in the conference, so that the obtained conference record can display the audio clip and/or the conference caption of the user and simultaneously display the associated user identity. When the user inquires the conference record corresponding to the conference audio after the conference is finished, the user can know which users the recorded conference caption is, so that the marking effect of the conference record is clearer, and the human-computer interaction performance is better.
Exemplarily, as shown in fig. 3, fig. 3 is a schematic diagram of a page for displaying a conference record provided by an embodiment of the present disclosure, where a video picture of a conference is displayed on the page, and a text record (i.e., text content) corresponding to the conference is displayed, and for each user, an avatar and a user name of the user are marked in the corresponding conference caption, and it can be seen from fig. 3 that the text record shows the conference caption, the avatar and the user name corresponding to two users, including: a user with the user name of 'small A', conference subtitles spoken in a conference by the 'small A' and an avatar of the 'small A'; the user with the user name "little B", and the conference title that "little B" speaks in the conference, and the avatar of little B.
Further, in response to a selection operation for a target user identity, acquiring a plurality of first audio segments associated with the target user identity from the plurality of audio segments of the conference audio as a plurality of first audio segments of the same user, wherein the target user identity is any one of the plurality of user identities.
Illustratively, following the above example, in response to a user selection operation for a user identity "small a", taking "small a" as a target user identity, multiple first audio segments of "small a" are obtained from multiple audio segments of conference audio to obtain multiple first audio segments of the same user.
It is understood that when the editor binds certain first audio piece or pieces with the identity of "gadget a", the executing agent or an electronic device associated with the executing agent of the embodiments of the present disclosure may send a notification message to "gadget a" to inform that the binding operation was performed on its identity and certain audio pieces. The "small a" can check the contents of the conference through the notification message to determine whether the binding operation of the editor described above is erroneous. If the binding is wrong, the small A can be edited again in a certain mode or the editor is informed to modify the binding result, so that adverse effects caused by misoperation are avoided.
In the process of determining the audio segments corresponding to the same user, due to the fact that a pause occurs or a plurality of users alternately speak during the speaking process of the user, the audio of the user is discontinuous, and in order to improve the accuracy of speaker recognition, a plurality of audio segments of the same user need to be acquired from conference audio to be used as the complete audio of the user for extracting the voice audio segments.
S202, carrying out voice detection on the plurality of first audio clips, and obtaining voice audio clips from the plurality of first audio clips.
Wherein, the Voice Detection (VAD) is used to identify and eliminate the long silent segment from the first audio segment, and the Voice Detection method includes but is not limited to: the method comprises the steps of carrying out voice detection by using Automatic Gain Control (AGC) with a voice recognition (SpeechSense) algorithm and carrying out voice detection by using a voice endpoint detector.
Wherein the voice endpoint detector is configured to detect whether voice data of a human voice is present in a noisy environment. The following will describe the process of detecting the voices of the plurality of first audio segments in the embodiment of the present disclosure by taking voice detection by using the voice endpoint detector as an example:
and aiming at each frame of voice signal input into the voice endpoint detector, the voice endpoint detector scores according to the probability that the voice signal in the first audio segment is a voice frame or a noise frame, and when the scoring value of the voice frame is greater than a preset judgment threshold, the voice frame is judged, otherwise, the voice frame is a noise frame. The voice endpoint detector distinguishes the voice frame and the noise frame according to the judgment result so as to remove the noise frame in the first audio segment. The decision threshold of the embodiment is a default decision threshold in a Web Real-Time Communication (Webrtc) source code, and the decision threshold is obtained by analyzing a large amount of data during Webrtc technology development, so that the distinguishing effect and accuracy are improved, and the model training workload of the voice endpoint detector is reduced.
In some embodiments, when detecting the voices of the plurality of first audio segments, the plurality of first audio segments may be first spliced into a combined audio segment, and then the combined audio segment is subjected to voice detection to obtain the voice audio segment of the first user from the combined audio segment.
Illustratively, a plurality of first audio segments of a first user, which are acquired from conference audio, are an audio segment a1, an audio segment a2, and an audio segment A3, and the audio segments are spliced into a combined audio segment a, and a human voice is detected on the combined audio segment a to acquire a human voice audio segment B.
In other embodiments, when performing voice detection on a plurality of first audio segments, voice detection may be performed on each of the plurality of first audio segments to obtain a plurality of voice segments, and then the plurality of voice segments are spliced into a voice audio segment.
Illustratively, a plurality of first audio segments of the first user obtained from conference audio are an audio segment a1, an audio segment a2, and an audio segment A3, and for each of the audio segments, human voice detection is performed to obtain human voice audio segments B1, B2, and B3 in each audio segment, and then the human voice audio segments B1, B2, and B3 are spliced to determine a human voice audio segment B.
In some embodiments, before detecting the voices of the plurality of first audio segments, the first audio segments may be subjected to noise reduction processing, and the noise reduction method includes, but is not limited to: an adaptive (LMS) filter, an adaptive notch filter, a basic spectral subtraction, a wiener filter, etc., and the present disclosure does not limit the noise reduction method.
According to the embodiment, through the method of detecting the voice first and then splicing or the method of detecting the voice first and then splicing, the silent section and the background noise in the environment where the user is located in the conference scene are removed, the influence of noise data in the audio section on the voiceprint recognition effect is reduced, the voice audio section of the first user is obtained, the accuracy of voice detection is improved, the subsequent digital mark of the first user can be conveniently obtained, and therefore the success rate of voiceprint recognition is improved.
S203, acquiring a first digital mark according to the voice audio clip, and registering the first digital mark as audio registration information of the first user.
The first digital mark is a sound wave frequency spectrum carrying speech information, is audio registration information of the first user for audio registration, is used for distinguishing and identifying user sounds, and can be a voiceprint feature.
In some embodiments, after the voice audio segment is obtained, because voice audio of other users except the first user may be mixed in the voice audio segment, on the basis of user differentiation from conference audio, in order to improve the effectiveness, pertinence, and timeliness of obtaining audio registration information and ensure accuracy of user sound feature extraction, further, for the obtained voice audio segment, the voice audio segment is differentiated and segmented according to different sound features of at least one user, a plurality of voice audio sub-segments are obtained after segmentation, and then, audio feature information of each voice audio sub-segment needs to be extracted from the plurality of voice audio sub-segments, and the audio feature information is used for differentiating the voice audio segment of each user.
The audio feature information may be short-Time spectrum features such as Mel-Frequency Cepstral coefficients (MFCCs), Perceptual Linear Prediction (PLP), Filter Banks (fbanks), or features extracted based on a Time-Delay Neural network (TDNN), such as identity-vector (i-vector).
After the audio feature information of each human voice audio sub-segment is extracted from the plurality of human voice audio sub-segments, audio clustering is carried out according to the audio feature information of the human voice audio sub-segments. In the audio clustering process, the similarity between the characteristic information is used as a clustering basis and is realized by adopting various clustering algorithms. Wherein, the clustering algorithm may include: a distance-based clustering algorithm or a density-based clustering algorithm, etc., which are not specifically limited in this disclosure. For example, Spectral Clustering, K-Means Clustering, mean shift Clustering, Expectation-Maximization (EM) Clustering using Gaussian Mixture Model (GMM), agglomerative-hierarchical Clustering, Graph Community Detection (Graph Community Detection), and the like may be employed.
In some embodiments, any way of calculating the similarity between the feature information is applied to audio clustering, for example, the feature sequences of the audio feature information may be compared, and the similarity between the feature sequences is used to calculate the similarity between the audio feature information, or alternatively, the audio feature information may be subjected to vectorization processing, the distance between the vectorized audio feature information is calculated, and the reciprocal of the distance between the voiceprint features is used as the similarity of the voiceprint features. The method is not limited thereto, of course, and the disclosure is not limited thereto.
In some embodiments, a predetermined number of pieces of audio feature information in the audio feature information of the human voice audio sub-segment are taken, similarity between every two pieces of the predetermined number of pieces of audio feature information is calculated, and if the similarity between two pieces of audio feature information is greater than a maximum similarity threshold, the two pieces of voiceprint features are grouped into one type until the similarity between each piece of audio feature information in the predetermined number and each piece of audio feature information in a cluster type where the piece of audio feature information is located is greater than the maximum similarity threshold.
Illustratively, setting a minimum similarity threshold value of 0.50, a maximum similarity threshold value of 0.85 and a preset number of 10, triggering clustering according to audio feature information after the audio feature information of the human voice sub-segments is extracted, selecting 10 pieces of feature information from the extracted audio feature information of the human voice sub-segments, calculating the similarity between every two pieces of the audio feature information, and if the similarity is greater than 0.85, clustering the two pieces of audio feature information into one class until the similarity between each piece of audio feature information in the selected 10 pieces of audio feature information and each piece of audio feature information in a clustering class where the audio feature information is located is greater than 0.85.
And after audio clustering, obtaining a plurality of clustered voice segments, wherein each clustered voice segment is a voice segment of the same user. Because the audio information of the same user needs to be registered, the occupation ratio of the audio segment corresponding to the user in the human voice audio segments is the largest, and the clustered human voice audio segment with the longest audio segment duration in the clustered human voice audio segments can be determined. And determining a target clustering voice segment with the longest duration from the plurality of clustering voice segments, and then acquiring a first digital mark of the target clustering voice segment.
In some embodiments, after determining the target clustered voice segment with the longest duration as the clustered voice segment of the same user, in a normal case, the user can ensure the accuracy of subsequent audio identification by using audio registration information with a shorter duration for audio registration, so the user can preset the duration of the clustered voice segment for audio registration to obtain the target audio segment with a preset duration from the target clustered voice segment, and then obtain the first digitized mark in the target audio segment.
For example, the user may preset the audio duration for audio registration to be 60s, after determining the clustered human voice audio segment of the user, extract 60s of audio from the clustered human voice audio segment, and then obtain the audio registration information of the user in the 60s of audio.
Further, the first digitized indicia of the first user is stored with a voiceprint library.
As shown in fig. 4A, fig. 4A is a schematic flowchart illustrating a second method for acquiring audio registration information according to an embodiment of the disclosure. The embodiment is further expanded and optimized on the basis of the above embodiment, wherein one possible implementation manner of S203 is as follows, steps S203a to S203 d:
and S203a, extracting a second digital mark of the human voice audio fragment.
In some embodiments, after the human sound audio segments are clustered, the digitized mark of the first user is obtained as the second digitized mark from the clustered human sound segments of the plurality of users obtained by the clustering process.
S203b, obtaining at least one historical audio clip.
And the at least one historical audio clip is the audio clip of the first user acquired from the at least one historical conference audio respectively.
S203c, extracting the digitized mark from the at least one historical audio segment to obtain at least one third digitized mark.
In some embodiments, after the second digital mark of the voice audio segment is extracted, at least one historical conference audio segment is first obtained, a plurality of historical voice audio segments of the first user are obtained from the historical conference audio segment, then voice detection is performed on the plurality of historical voice audio segments to obtain a historical voice audio segment, and then a third digital mark is obtained from the historical voice audio segment. It should be noted that, through the above steps, a plurality of third digitized marks can be acquired simultaneously.
S3203c, calculating a first digitized mark according to the second digitized mark and the third digitized mark, and using the first digitized mark as the audio registration information of the first user.
After the second digital mark of the human voice audio clip and the third digital mark of the historical audio clip are obtained, vectorization processing is carried out on the second digital mark and the third digital mark to obtain a second digital mark vector and a third digital mark vector, and averaging is carried out to determine the first digital mark as the audio registration information of the user.
Illustratively, 5 historical conference audio clips are obtained, then the digital marks of the first user are obtained, then the digital marks obtained by current processing and the 5 digital marks are subjected to vectorization processing to obtain 6 digital mark vectors, the 6 digital mark vectors are averaged to obtain an average value of the digital mark vectors, the average value of the digital marks is further obtained, the average value is used as the first digital mark of the first user, and the historical digital marks are combined to update in the process of obtaining the digital marks of the users in the above mode, so that the accuracy of the first digital mark of the users is improved.
S203d, determining the first digitized mark according to the second digitized mark and the third digitized mark.
Further, the first digitized indicia is registered as audio registration information for the first user.
As shown in fig. 4B, fig. 4B is a third schematic flowchart of a method for acquiring audio registration information according to an embodiment of the disclosure, and S203B to S203d may be replaced with S2031.
S2031, determining the first digital mark according to the second digital mark and the saved third digital mark.
Wherein the saved third digitized label is historical audio registration information of the first user.
In some embodiments, in order to ensure the accuracy of the audio registration information of the user, so that the user can perform voiceprint recognition subsequently according to the audio registration information, the digital mark of the historical audio clip corresponding to the user in the voiceprint library can be obtained, the historical audio registration information of the user stored in the voiceprint library is directly used, the historical conference audio does not need to be processed, and the complicated operation flow is reduced. And acquiring a stored third digital mark from a voiceprint library, then calculating a first digital mark by using the second digital mark and the stored third digital mark according to the acquired second digital mark in the voice audio clip, vectorizing and averaging the second digital mark and the stored third digital mark in the calculation process, finally acquiring the first digital mark, determining the first digital mark as the audio registration information of the user, and using the first digital mark for voiceprint recognition of a subsequent user. It should be noted that, the present disclosure does not limit this to the number of the saved third digitized marks, and for example, the first digitized mark is obtained by calculating according to the second digitized mark and 5 saved third digitized marks, and is determined as the latest audio registration information of the first user.
According to the embodiment, the digitalized mark is determined from the historical conference audio or the digitalized mark of the first user is calculated according to the stored digitalized mark, so that the audio registration information of the user is determined according to the combination of a plurality of data, the matching degree of the audio registration information is improved, and the voiceprint registration of the user is facilitated. In summary, by acquiring the audio clip of the user from the conference audio, then performing voice detection to acquire the voice audio clip of the user, and then acquiring the digital mark from the voice audio clip to serve as the audio registration information of the user, it is not necessary for the user to repeatedly read the specified text and register according to the fixed flow in the scene of the user opening a meeting during normal speaking, on one hand, the convenience of the user in performing audio registration is improved, on the other hand, the difference between the voiceprint feature during registration and the voiceprint feature of the user under the normal speaking condition is avoided, and the accuracy of voiceprint recognition and the adaptability of the application scene are improved.
As shown in fig. 5, fig. 5 is a block diagram of an apparatus for acquiring audio registration information according to an embodiment of the present disclosure, where the apparatus includes:
an obtaining module 501, configured to obtain multiple first audio segments of the same user from conference audio;
a detecting module 502, configured to perform voice detection on the multiple first audio segments, and obtain a voice audio segment from the multiple first audio segments;
the registering module 503 is configured to obtain the first digital mark according to the voice audio clip, and register the first digital mark as the audio registration information of the first user.
As an optional implementation manner of the embodiment of the present disclosure, the detecting module 502 is specifically configured to splice a plurality of first audio segments into a combined audio segment;
carrying out voice detection on the combined voice frequency band to obtain a voice audio clip;
or the like, or, alternatively,
and carrying out voice detection on each first audio clip in the plurality of first audio clips to obtain a plurality of voice clips, and splicing the plurality of voice clips into voice audio clips.
As an optional implementation manner of the embodiment of the present disclosure, the registration module 503 is specifically configured to divide the voice audio segment into a plurality of voice audio sub-segments;
extracting the audio characteristic information of each sound audio sub-segment in the plurality of sound audio sub-segments, and carrying out audio clustering according to the audio characteristic information of each sound audio sub-segment to obtain a plurality of clustered sound sub-segments;
determining a target clustering voice segment with the longest duration from a plurality of clustering voice segments;
and acquiring a first digital mark of the target clustered human voice fragment.
As an optional implementation manner of the embodiment of the present disclosure, the registration module 503 is specifically configured to obtain a target audio segment with a preset duration from the target clustered human voice segments;
a first digitized mark of a target audio segment is obtained.
As an optional implementation manner of the embodiment of the present disclosure, the registration module 503 is specifically configured to extract a second digital mark of the human voice audio segment;
acquiring at least one historical audio clip;
extracting the digital mark from the at least one historical audio clip to obtain at least one third digital mark, wherein the at least one historical audio clip is the audio clip of the first user respectively obtained from at least one historical conference audio;
the first digitized mark is determined based on the second digitized mark and the third digitized mark.
As an optional implementation manner of the embodiment of the present disclosure, the registration module 503 is specifically configured to extract a second digital mark of the human voice audio segment;
and determining the first digital mark according to the second digital mark and the stored third digital mark, wherein the stored third digital mark is historical audio registration information of the first user.
As an optional implementation manner of the embodiment of the present disclosure, the obtaining module 501 is specifically configured to obtain a conference record corresponding to a conference audio, and display the conference record; the conference record comprises: associating the displayed plurality of user identities with a conference subtitle generated based on a plurality of audio clips of conference audio;
and responding to the selection operation aiming at the identification of the target user, and acquiring a plurality of first audio segments associated with the identification of the target user from a plurality of audio segments of the conference audio as a plurality of first audio segments of the same user, wherein the identification of the target user is any one user identification in a plurality of user identifications.
In summary, according to the device for acquiring audio registration information, firstly, the audio clip of the user is acquired from the conference audio, then the voice detection is performed to acquire the voice audio clip of the user, and then the digital mark is acquired from the voice audio clip to serve as the audio registration information of the user, so that in a scene where the user normally speaks to open a meeting, the user does not need to repeatedly read a specified text and register according to a fixed flow, on one hand, the convenience of the user in performing audio registration is improved, on the other hand, the difference between the voiceprint characteristic during registration and the voiceprint characteristic of the user under the normal speaking condition is avoided, and the accuracy of voiceprint identification and the adaptability of an application scene are improved.
As shown in fig. 6, fig. 6 is a structural diagram of an electronic device according to an embodiment of the disclosure, where the electronic device includes: a processor, a memory and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the respective processes of the method of obtaining audio registration information in the above-described method embodiments. And the same technical effect can be achieved, and in order to avoid repetition, the description is omitted.
The embodiment of the present disclosure provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process of the method for acquiring audio registration information in the foregoing method embodiments, and can achieve the same technical effect, and in order to avoid repetition, the computer program is not described herein again.
The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.
The embodiments of the present disclosure provide a computer program product, where the computer program is stored, and when being executed by a processor, the computer program implements each process of the method for acquiring audio registration information in the foregoing method embodiments, and can achieve the same technical effect, and in order to avoid repetition, the details are not repeated here.
As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media having computer-usable program code embodied in the media.
In the present disclosure, the Processor may be a Central Processing Unit (CPU), and may also be other general purpose processors, Digital Signal Processors (DSP), Application Specific Integrated Circuits (ASIC), Field-Programmable Gate arrays (FPGA) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
In the present disclosure, the memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
In the present disclosure, computer-readable media include both non-transitory and non-transitory, removable and non-removable storage media. Storage media may implement information storage by any method or technology, and the information may be computer-readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing are merely exemplary embodiments of the present disclosure, which will enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method for obtaining audio registration information, comprising:
acquiring a plurality of first audio clips of the same user from conference audio;
carrying out voice detection on the plurality of first audio clips, and acquiring voice audio clips from the plurality of first audio clips;
and acquiring a first digital mark according to the voice audio clip, and registering the first digital mark as audio registration information of the first user.
2. The method of claim 1, wherein the detecting the human voice for the first audio segments, and obtaining the human voice audio segments from the first audio segments, comprises:
splicing the plurality of first audio segments into a combined audio segment, and carrying out voice detection on the combined audio segment to obtain a voice audio segment;
or the like, or a combination thereof,
and carrying out voice detection on each first audio clip in the plurality of first audio clips to obtain a plurality of voice clips, and splicing the plurality of voice clips into the voice audio clips.
3. The method of claim 1, wherein obtaining the first digitized mark from the human voice audio clip comprises:
dividing the voice audio segment into a plurality of voice audio sub-segments;
extracting the audio characteristic information of each person sound audio sub-segment in the plurality of person sound audio sub-segments, and carrying out audio clustering according to the audio characteristic information of each person sound audio sub-segment to obtain a plurality of clustered person sound segments;
determining a target clustering voice segment with the longest duration from the plurality of clustering voice segments;
and acquiring the first digital mark of the target clustered human voice fragment.
4. The method of claim 3, wherein the obtaining the first digitized label of the target clustered human voice segments comprises:
acquiring a target audio segment with preset duration from the target clustering human voice segments;
obtaining the first digitized mark of the target audio segment.
5. The method of claim 1, wherein obtaining the first digitized mark from the human voice audio clip comprises:
extracting a second digital mark of the human voice audio segment;
acquiring at least one historical audio clip;
extracting a digital mark from the at least one historical audio clip to obtain at least one third digital mark, wherein the at least one historical audio clip is an audio clip of the first user respectively obtained from at least one historical conference audio;
and determining the first digital mark according to the second digital mark and the third digital mark.
6. The method of claim 1, wherein obtaining the first digitized mark from the human voice audio clip comprises:
extracting a second digital mark of the human voice audio segment;
and determining the first digital mark according to the second digital mark and a saved third digital mark, wherein the saved third digital mark is historical audio registration information of the first user.
7. The method of claim 1, wherein the obtaining a plurality of first audio segments of a same user from conference audio comprises:
acquiring a conference record corresponding to the conference audio, and displaying the conference record; the conference record comprises: associating the displayed plurality of user identities with a conference subtitle generated based on a plurality of audio segments of the conference audio;
responding to a selection operation aiming at a target user identity mark, and acquiring a plurality of first audio segments associated with the target user identity mark from a plurality of audio segments of the conference audio as a plurality of first audio segments of the same user, wherein the target user identity mark is any one of the plurality of user identity marks.
8. An apparatus for obtaining audio registration information, comprising:
the acquisition module is used for acquiring a plurality of first audio clips of the same user from conference audio;
the detection module is used for carrying out voice detection on the plurality of first audio clips and acquiring voice audio clips from the plurality of first audio clips;
and the registration module is used for acquiring a first digital mark according to the voice audio clip and registering the first digital mark as the audio registration information of the first user.
9. An electronic device, comprising: processor, memory and computer program stored on the memory and executable on the processor, which computer program, when executed by the processor, implements a method of obtaining audio registration information as claimed in any one of claims 1 to 7.
10. A computer-readable storage medium, comprising: the computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the method of acquiring audio registration information according to any one of claims 1 to 7.
CN202210783733.1A 2022-06-27 2022-06-27 Method and device for acquiring audio registration information and electronic equipment Pending CN115116447A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210783733.1A CN115116447A (en) 2022-06-27 2022-06-27 Method and device for acquiring audio registration information and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210783733.1A CN115116447A (en) 2022-06-27 2022-06-27 Method and device for acquiring audio registration information and electronic equipment

Publications (1)

Publication Number Publication Date
CN115116447A true CN115116447A (en) 2022-09-27

Family

ID=83332768

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210783733.1A Pending CN115116447A (en) 2022-06-27 2022-06-27 Method and device for acquiring audio registration information and electronic equipment

Country Status (1)

Country Link
CN (1) CN115116447A (en)

Similar Documents

Publication Publication Date Title
US11636860B2 (en) Word-level blind diarization of recorded calls with arbitrary number of speakers
US11875820B1 (en) Context driven device arbitration
US9881617B2 (en) Blind diarization of recorded calls with arbitrary number of speakers
US10878824B2 (en) Speech-to-text generation using video-speech matching from a primary speaker
US9875739B2 (en) Speaker separation in diarization
CN102568478B (en) Video play control method and system based on voice recognition
US11430449B2 (en) Voice-controlled management of user profiles
CN108899033B (en) Method and device for determining speaker characteristics
EP3682443B1 (en) Voice-controlled management of user profiles
US11205428B1 (en) Deleting user data using keys
US11848029B2 (en) Method and device for detecting audio signal, and storage medium
CN110990685A (en) Voice search method, voice search device, voice search storage medium and voice search device based on voiceprint
CN112331217B (en) Voiceprint recognition method and device, storage medium and electronic equipment
CN113779208A (en) Method and device for man-machine conversation
CN109065026B (en) Recording control method and device
CN115116447A (en) Method and device for acquiring audio registration information and electronic equipment
CN110232911B (en) Singing following recognition method and device, storage medium and electronic equipment
US11430435B1 (en) Prompts for user feedback
US20220084505A1 (en) Communication between devices in close proximity to improve voice control of the devices
CN115116467A (en) Audio marking method and device and electronic equipment
CN114299985A (en) Audio labeling method and device and electronic equipment
CN114547568A (en) Identity verification method, device and equipment based on voice
CN115620713A (en) Dialog intention recognition method, device, equipment and storage medium
CN115240650A (en) Voice data processing method and device, electronic equipment and storage medium
CN114826709A (en) Identity authentication and acoustic environment detection method, system, electronic device and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination