CN116504250A - Real-time speaker log generation method and device based on speaker registration information - Google Patents

Real-time speaker log generation method and device based on speaker registration information Download PDF

Info

Publication number
CN116504250A
CN116504250A CN202310320352.4A CN202310320352A CN116504250A CN 116504250 A CN116504250 A CN 116504250A CN 202310320352 A CN202310320352 A CN 202310320352A CN 116504250 A CN116504250 A CN 116504250A
Authority
CN
China
Prior art keywords
speaker
real
time
voiceprint feature
feature vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310320352.4A
Other languages
Chinese (zh)
Inventor
洪国强
肖龙源
李海洲
李稀敏
叶志坚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Kuaishangtong Technology Co Ltd
Original Assignee
Xiamen Kuaishangtong Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Kuaishangtong Technology Co Ltd filed Critical Xiamen Kuaishangtong Technology Co Ltd
Priority to CN202310320352.4A priority Critical patent/CN116504250A/en
Publication of CN116504250A publication Critical patent/CN116504250A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/16Hidden Markov models [HMM]

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention discloses a real-time speaker log generation method and device based on speaker registration information, wherein the speaker registration information comprises a registered speaker state set and a corresponding pre-registered voiceprint feature vector sequence; acquiring real-time voice data, framing the real-time voice data, and extracting voiceprint feature vectors from each frame of real-time voice through a voiceprint model to obtain a real-time voiceprint feature vector sequence; inputting the real-time voiceprint feature vector sequence into an improved hidden Markov model, decoding and updating the state set in real time, and determining the real-time speaker state set according to the registered speaker state set; all possible paths are generated according to the real-time speaker state collection, the probability of each speaker path is calculated according to the pre-registered voiceprint feature vector sequence, and the current maximum probability speaker path is used as the optimal speaker path to be output. And the context relation is considered by using the speaker registration information, so that the accuracy is improved.

Description

Real-time speaker log generation method and device based on speaker registration information
Technical Field
The invention relates to the field of natural language processing, in particular to a real-time speaker log generation method and device based on speaker registration information.
Background
Speaker logging is a technique that marks a piece of speech out of the time of the start and stop of each person's speaking. The real-time speaker log is required to continuously judge the speaker corresponding to the current voice segment along with the voice acquisition. The offline speaker log processes a segment of collected speech.
The current real-time speaker log technology needs to register voice prints of speakers possibly appearing in voices in advance, then collect voice print characteristics of the collected voices in frames, compare the voice print characteristics of each frame with registered voice prints, and divide the voice print characteristics to the most similar speakers. The method has the defects of simple allocation mechanism, general effect and lower accuracy rate without fully considering the context relation.
Disclosure of Invention
The technical problems mentioned above are solved. An objective of the embodiments of the present application is to provide a method and an apparatus for generating a real-time speaker log based on speaker registration information, so as to solve the technical problems mentioned in the background section.
In a first aspect, the present invention provides a method for generating a real-time speaker log based on speaker registration information, including the steps of:
s1, acquiring speaker registration information, wherein the speaker registration information comprises a registered speaker state collection and a corresponding pre-registered voiceprint feature vector sequence;
s2, acquiring real-time voice data, framing the real-time voice data, and extracting voiceprint feature vectors from each frame of real-time voice through a voiceprint model to obtain a real-time voiceprint feature vector sequence;
s3, inputting the real-time voiceprint feature vector sequence into an improved hidden Markov model, setting an initial real-time speaker state set as a null set, decoding and updating the state set in real time, and determining the real-time speaker state set according to the registered speaker state set;
s4, generating all possible paths according to the real-time speaker state collection, calculating the probability of each speaker path according to the preregistered voiceprint feature vector sequence, and outputting the current maximum probability speaker path as the optimal speaker path.
Preferably, the calculating the probability of each speaker path according to the pre-registered voiceprint feature vector sequence specifically includes:
the real-time speaker state aggregate is y= { Y 1 ,y 2 ,y 3 ,...,y t },y t E Q, Q is a set of registered speaker states, q= {1,2,3, …, n }, n representing the nth registered speaker;
assuming that the probability of the current path is P, the probability of the current path jumping to the next state is P new Probability of transition P generation probability;
determining transition probability p (y) according to whether the real-time speaker state at the previous t moment is equal to the real-time speaker state at the previous t-1 moment t |y t-1 ),y t Real-time speaker status, y, at the previous t moment t-1 The real-time speaker state at the previous t-1 moment;
according to the number of voiceprint feature vector sets belonging to m real-time speakers in the voiceprint feature vectors newly generated at the moment i, calculating the generation probability as p (x) t |s m ) The real-time voiceprint feature vector sequence is x= { X 1 ,x 2 ,x 3 ,...,x t X, where x t For the newly generated voiceprint feature vector at time t, s m Is the voiceprint feature vector of the mth real-time speaker.
Preferably, the transition probability p (y t |y t-1 ) The calculation process of (2) is as follows:
when y is t =y t-1 At the time p (y) t |y t-1 ) =lopprob; when y is t ≠y t-1 At the time p (y) t |y t-1 ) =1-lopprob; the lopprob is self-circulation probability, and the value range is (0, 1).
Preferably, the probability of generation is p (x t |s m ) The calculation process of (2) is as follows:
at time t, whenWhen (I)>m∈[1,n]Wherein F is a And F c Is the parameter of the ultrasonic wave to be used as the ultrasonic wave, the value range is (0, ++), r m A pre-registered voiceprint feature vector representing an mth registered speaker;
at time t, whenWhen (I)>m∈[1,n]Wherein F is b And F c Is the parameter of the ultrasonic wave to be used as the ultrasonic wave, value range (-infinity).
Preferably, the decoding in step S3 employs a viterbi algorithm.
Preferably, the pre-registered voiceprint feature vector sequence generates a set of pre-registered voiceprint feature vectors for pre-collecting voice data of n registered speakers, r= { R 1 ,r 2 ,...,r n };S m Representing a set of voiceprint feature vectors of m real-time speakers generated and updated during a conversation, S m ={s 1 ,s 2 ,...,s m },m∈[1,n]。
Preferably, at time t, the voiceprint feature vector of the mth real-time speaker calculated from the previous t-1 frame is:
wherein alpha is a hyper-parameter.
In a second aspect, the present invention provides a real-time speaker log generating apparatus based on speaker registration information, including:
the registration information acquisition module is configured to acquire speaker registration information, wherein the speaker registration information comprises a registered speaker state collection and a corresponding pre-registered voiceprint feature vector sequence;
the real-time voiceprint feature acquisition module is configured to acquire real-time voice data, frame the real-time voice data, extract voiceprint feature vectors from each frame of real-time voice through a voiceprint model, and obtain a real-time voiceprint feature vector sequence;
the real-time speaker state set acquisition module is configured to input the real-time voiceprint feature vector sequence into the improved hidden Markov model, set the initial real-time speaker state set as an empty set, decode and update the state set in real time, and determine the real-time speaker state set according to the registered speaker state set;
and the path output module is configured to generate all possible paths according to the real-time speaker state collection, calculate the probability of each speaker path according to the pre-registered voiceprint feature vector sequence, and output the current maximum probability speaker path as the optimal speaker path.
In a third aspect, the present invention provides an electronic device comprising one or more processors; and storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the method as described in any of the implementations of the first aspect.
In a fourth aspect, the present invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method as described in any of the implementations of the first aspect.
Compared with the prior art, the invention has the following beneficial effects:
(1) According to the real-time speaker log generation method based on the speaker registration information, different calculation modes of the generation probabilities are determined according to the situation that the voice print feature vector in the first t-1 moment belongs to the voice print feature vector set of m real-time speakers, and the speaker registration information is considered during distribution, so that the calculation is more accurate.
(2) The voice print feature vector set of m real-time speakers in the real-time speaker log generation method based on the speaker registration information is generated and updated in real time in the dialogue process, and when the voice print feature vector in the counted previous t-1 moment belongs to the voice print feature vector set of m real-time speakers, the voice print feature vector of the m real-time speakers and the voice print feature vector of the m registered speakers are considered in the calculation formula of the generation probability, so that the distribution result can be optimized by combining the context information.
(3) The real-time speaker log generation method based on the speaker registration information provided by the invention utilizes the existing speaker registration information, considers the context relationship, improves the distribution accuracy, avoids outputting speakers except the registered speaker in the optimal speaker path, and has better effect.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is an exemplary device frame pattern to which an embodiment of the present application may be applied;
FIG. 2 is a flow chart of a real-time speaker log generation method based on speaker registration information according to an embodiment of the present application;
FIG. 3 is a decoding flow diagram of a real-time speaker log generation method based on speaker registration information according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a real-time speaker log generation device based on speaker registration information according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a computer device suitable for use in implementing the embodiments of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Fig. 1 illustrates an exemplary device architecture 100 to which the speaker registration information-based real-time speaker diary generation method or the speaker registration information-based real-time speaker diary generation device of embodiments of the present application may be applied.
As shown in fig. 1, the apparatus architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various applications, such as a data processing class application, a file processing class application, and the like, may be installed on the terminal devices 101, 102, 103.
The terminal devices 101, 102, 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices including, but not limited to, smartphones, tablets, laptop and desktop computers, and the like. When the terminal devices 101, 102, 103 are software, they can be installed in the above-listed electronic devices. Which may be implemented as multiple software or software modules (e.g., software or software modules for providing distributed services) or as a single software or software module. The present invention is not particularly limited herein.
The server 105 may be a server providing various services, such as a background data processing server processing files or data uploaded by the terminal devices 101, 102, 103. The background data processing server can process the acquired file or data to generate a processing result.
It should be noted that, the method for generating a real-time speaker log based on speaker registration information provided in the embodiment of the present application may be executed by the server 105, or may be executed by the terminal devices 101, 102, 103, and accordingly, the apparatus for generating a real-time speaker log based on speaker registration information may be set in the server 105, or may be set in the terminal devices 101, 102, 103.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. In the case where the processed data does not need to be acquired from a remote location, the above-described apparatus architecture may not include a network, but only a server or terminal device.
Fig. 2 shows a real-time speaker log generation method based on speaker registration information according to an embodiment of the present application, including the following steps:
s1, acquiring speaker registration information, wherein the speaker registration information comprises a registered speaker state collection and a corresponding pre-registered voiceprint feature vector sequence.
In a specific embodiment, the method is modified based on a Hidden Markov Model (HMM), and the pre-registered voiceprint feature vector sequence R generates a set of pre-registered voiceprint feature vectors for pre-acquired voice data of n registered speakers, namely R= { R 1 ,r 2 ,...,r n };S m Representing a set of voiceprint feature vectors of m real-time speakers generated and updated during a conversation, S m ={s 1 ,s 2 ,...,s m },m∈[1,n]。
In a specific embodiment, at time t, the voiceprint feature vector of the mth real-time speaker calculated from the previous t-1 frame is:
wherein alpha is a hyper-parameter.
Specifically, if n registered speakers are known, the registered speaker state set q= {1,2,3, …, n }, n represents the n-th registered speaker, and the pre-registered voiceprint feature vector of the n-th registered speaker is r n The final pre-registered voiceprint feature vector sequence is R= { R 1 ,r 2 ,...,r n }。
S2, acquiring real-time voice data, framing the real-time voice data, and extracting voiceprint feature vectors from each frame of real-time voice through a voiceprint model to obtain a real-time voiceprint feature vector sequence.
Specifically, the input sequence (observation sequence) of the improved hidden markov model is x= { X 1 ,x 2 ,x 3 ,...,x t And, where t is the current time. And X is to perform framing treatment on a section of real-time voice data, and each frame of real-time voice extracts a voiceprint feature vector through a voiceprint model to form a real-time voiceprint feature vector sequence. The window length and window shift of the frame are set according to the requirement and can be generally set to be 1.5s and 0.25s respectively. The voiceprint model is d-vector, x-vector, resNet and other models.
S3, inputting the real-time voiceprint feature vector sequence into an improved hidden Markov model, setting an initial real-time speaker state set as a null set, decoding and updating the state set in real time, and determining the real-time speaker state set according to the registered speaker state set.
In a specific embodiment, the decoding in step S3 employs a viterbi algorithm.
Specifically, the initial real-time speaker state set is empty, and as the input sequence is processed, a new speaker tag is found from the registered speaker state set and added to the real-time speaker state set, and simultaneously, the voiceprint feature vector set S of the real-time speaker is obtained m Also correspondingly updated, i.e. S m Is generated and updated during the dialog. The initial state probability pi= {1}, this value is 1 because the initial real-time speaker state set is empty. Determining a real-time speaker status set based on the registered speaker status set, the real-time speaker statusThe real-time speaker state set is used for limiting the numerical value in the real-time speaker state set, and under the condition that the information of the registered speakers is known, the context information is considered, so that the calculation result can be more accurate.
S4, generating all possible paths according to the real-time speaker state collection, calculating the probability of each speaker path according to the preregistered voiceprint feature vector sequence, and outputting the current maximum probability speaker path as the optimal speaker path.
In a specific embodiment, the calculating the probability of each speaker path according to the pre-registered voiceprint feature vector sequence specifically includes:
the real-time speaker state aggregate is y= { Y 1 ,y 2 ,y 3 ,...,y t },y t E Q, Q is a set of registered speaker states, q= {1,2,3, …, n }, n representing the nth registered speaker;
assuming that the probability of the current path is P, the probability of the current path jumping to the next state (new speaker) is P new Probability of transition P generation probability;
determining transition probability p (y) according to whether the real-time speaker state at the previous t moment is equal to the real-time speaker state at the previous t-1 moment t |y t-1 ),y t Real-time speaker status, y, at the previous t moment t-1 The real-time speaker state at the previous t-1 moment;
according to the number of voiceprint feature vector sets belonging to m real-time speakers in the voiceprint feature vectors newly generated at the moment i, calculating the generation probability as p (x) t |s m ) The real-time voiceprint feature vector sequence is x= { X 1 ,x 2 ,x 3 ,...,x t X, where x t For the newly generated voiceprint feature vector at time t, s m Sound for mth real-time speakerA texture feature vector.
In a particular embodiment, the transition probabilities p (y t |y t-1 ) The calculation process of (2) is as follows:
when y is t =y t-1 At the time p (y) t |y t-1 ) =lopprob; when y is t ≠y t-1 At the time p (y) t |y t-1 ) =1-lopprob; the lopprob is self-circulation probability, and the value range is (0, 1).
Specifically, if the real-time speaker status at the previous t moment is equal to the real-time speaker status at the previous t-1 moment, then p (y) t |y t-1 ) =lopprob; if the real-time speaker status at the previous t moment is not equal to the real-time speaker status at the previous t-1 moment, p (y) t |y t-1 ) =1-lopprob, lopprob is 0 to 1.
In a particular embodiment, the probability of generation is p (x t |s m ) The calculation process of (2) is as follows:
at time t, whenWhen (I)>m∈[1,n]Wherein F is a And F c Is the parameter of the ultrasonic wave to be used as the ultrasonic wave, the value range is (0, ++), r m A pre-registered voiceprint feature vector representing an mth registered speaker;
at time t, whenWhen (I)>m∈[1,n]Wherein F is b And F c Is the parameter of the ultrasonic wave to be used as the ultrasonic wave, value range (-infinity).
Specifically, in the calculation process of the generation probability, it is required to judge how many frames of real-time voice corresponding real-time voice feature vectors in the previous t-1 frames of real-time voice belong to the voice feature vector set of m real-time speakers, and when no frame real-time voice existsThe real-time voiceprint feature vector of the voice belongs to the voiceprint feature vector set S of m real-time speakers m When it is zero vector, the probability of generation ism∈[1,n]The method comprises the steps of carrying out a first treatment on the surface of the Real-time voiceprint feature vector of framed real-time speech belongs to voiceprint feature vector set S of m real-time speakers m When the generation probability is +.>m∈[1,n]The speaker registration information is considered in calculating the distribution, so that the calculation is more accurate. The number of all possible paths generated by the real-time speaker state collection is m t At the time of the t frame, a path with the highest probability is taken as a decoding result.
Super-parameter loopProb, F a 、F b 、F c The alpha optimization method adopts grid search to prepare a group of marked test data, and the optimal solution is sought.
DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION
Assuming that a section of dialogue is 3 frames of voice data, the corresponding real-time voiceprint feature vector sequence is { x } 1 ,x 2 ,x 3 -wherein the real speaker order is {1,2,1} (the value belongs to unknown). In the dialogue, two persons are shared, registered voices of the two persons are extracted in advance, and a preregistered voiceprint feature vector sequence R= { R is generated 1 ,r 2 Additionally, the registered speaker state set q= {1,2}.
A decoding flow chart is shown in fig. 3, in which each column corresponds to each frame input condition, and each circle represents a speaker to which the corresponding frame jumps. There are 2 speakers in total in the decoding, corresponding to {1,2}.
1. Input x 1 . The possible paths of Y are {1}, {2}, 2 are all, the probability of each path is calculated and S is updated respectively m . Outputting the optimal speaker path: y= {1}. At this time S m ={s 1 }。
2. Input x 2 . The Y possible paths are {1,1}, {1,2}, {2,1}, {2,2}, 4 kinds in totalCalculating probability of each path and updating S m . Outputting the optimal speaker path: y= {1,2}. At this time, a new speaker is generated, S m ={s 1 ,s 2 }。
3. Input x 3 . The Y possible paths are {1,1}, {1,2}, {1,2,1}, {1,2}, {2,1}, {2,1,2, 1}, {2,2}, 8 kinds of distribution calculate the probability of each path and update S m . Outputting the optimal speaker path: y= {1,2,1}. At this time S m ={s 1 ,s 2 }。
Specifically, each frame of new real-time voice data is input, all possible paths can be generated by decoding, the probability of each path is calculated, and the current maximum probability path is used as the optimal speaker path to be output. The actual calculation is performed by using HMM calculation and calculating S according to allocation m Therefore, the allocation of the current real-time voiceprint feature vector x is influenced, S m The real-time speaker journaling generation process changes along with the input of the real-time voiceprint feature vector corresponding to the new real-time voice, and the combination of the context information can be realized.
With further reference to fig. 4, as an implementation of the method shown in the foregoing figures, the present application provides an embodiment of a real-time speaker log generating apparatus based on speaker registration information, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be specifically applied to various electronic devices.
The embodiment of the application provides a real-time speaker log generation device based on speaker registration information, which comprises the following components:
the registration information acquisition module 1 is configured to acquire speaker registration information, wherein the speaker registration information comprises a registered speaker state collection and a corresponding pre-registered voiceprint feature vector sequence;
the real-time voiceprint feature acquisition module 2 is configured to acquire real-time voice data, frame the real-time voice data, extract voiceprint feature vectors from each frame of real-time voice through a voiceprint model, and obtain a real-time voiceprint feature vector sequence;
a real-time speaker state set acquisition module 3 configured to input a real-time voiceprint feature vector sequence into the improved hidden markov model, set an initial real-time speaker state set as an empty set, decode an updated state set in real time, and determine a real-time speaker state set from the registered speaker state set;
the path output module 4 is configured to generate all possible paths according to the real-time speaker state collection, calculate the probability of each speaker path according to the pre-registered voiceprint feature vector sequence, and output the current maximum probability speaker path as the optimal speaker path.
Referring now to fig. 5, there is illustrated a schematic diagram of a computer apparatus 500 suitable for use in implementing an electronic device (e.g., a server or terminal device as illustrated in fig. 1) of an embodiment of the present application. The electronic device shown in fig. 5 is only an example and should not impose any limitation on the functionality and scope of use of the embodiments of the present application.
As shown in fig. 5, the computer apparatus 500 includes a Central Processing Unit (CPU) 501 and a Graphics Processor (GPU) 502, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 503 or a program loaded from a storage section 509 into a Random Access Memory (RAM) 504. In the RAM 504, various programs and data required for the operation of the apparatus 500 are also stored. The CPU 501, GPU502, ROM 503, and RAM 504 are connected to each other through a bus 505. An input/output (I/O) interface 506 is also connected to bus 505.
The following components are connected to the I/O interface 506: an input section 507 including a keyboard, a mouse, and the like; an output portion 508 including a speaker, such as a Liquid Crystal Display (LCD), etc.; a storage section 509 including a hard disk or the like; and a communication section 510 including a network interface card such as a LAN card, a modem, or the like. The communication section 510 performs communication processing via a network such as the internet. The drive 511 may also be connected to the I/O interface 506 as needed. A removable medium 512 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed on the drive 511 as necessary, so that a computer program read therefrom is installed into the storage section 509 as necessary.
In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such embodiments, the computer program may be downloaded and installed from a network via the communication portion 510, and/or installed from the removable media 512. The above-described functions defined in the method of the present application are performed when the computer program is executed by a Central Processing Unit (CPU) 501 and a Graphics Processor (GPU) 502.
It should be noted that the computer readable medium described in the present application may be a computer readable signal medium or a computer readable medium, or any combination of the two. The computer readable medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor apparatus, device, or means, or a combination of any of the foregoing. More specific examples of the computer-readable medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution apparatus, device, or apparatus. In the present application, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may be any computer readable medium that is not a computer readable medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution apparatus, device, or apparatus. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations of the present application may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or it may be connected to an external computer (for example, through the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based devices which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules involved in the embodiments described in the present application may be implemented by software, or may be implemented by hardware. The described modules may also be provided in a processor.
As another aspect, the present application also provides a computer-readable medium that may be contained in the electronic device described in the above embodiment; or may exist alone without being incorporated into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring speaker registration information, wherein the speaker registration information comprises a registered speaker state collection and a corresponding pre-registered voiceprint feature vector sequence; acquiring real-time voice data, framing the real-time voice data, and extracting voiceprint feature vectors from each frame of real-time voice through a voiceprint model to obtain a real-time voiceprint feature vector sequence; inputting the real-time voiceprint feature vector sequence into an improved hidden Markov model, setting an initial real-time speaker state set as a null set, decoding and updating the state set in real time, and determining the real-time speaker state set according to the registered speaker state set; all possible paths are generated according to the real-time speaker state collection, the probability of each speaker path is calculated according to the pre-registered voiceprint feature vector sequence, and the current maximum probability speaker path is used as the optimal speaker path to be output.
The foregoing description is only of the preferred embodiments of the present application and is presented as a description of the principles of the technology being utilized. It will be appreciated by persons skilled in the art that the scope of the invention referred to in this application is not limited to the specific combinations of features described above, but it is intended to cover other embodiments in which any combination of features described above or equivalents thereof is possible without departing from the spirit of the invention. Such as the above-described features and technical features having similar functions (but not limited to) disclosed in the present application are replaced with each other.

Claims (10)

1. A real-time speaker log generation method based on speaker registration information is characterized by comprising the following steps:
s1, acquiring speaker registration information, wherein the speaker registration information comprises a registered speaker state collection and a corresponding pre-registered voiceprint feature vector sequence;
s2, acquiring real-time voice data, framing the real-time voice data, and extracting voiceprint feature vectors from each frame of real-time voice through a voiceprint model to obtain a real-time voiceprint feature vector sequence;
s3, inputting the real-time voiceprint feature vector sequence into an improved hidden Markov model, setting an initial real-time speaker state set as an empty set, decoding and updating the state set in real time, and determining the real-time speaker state set according to the registered speaker state set;
s4, generating all possible paths according to the real-time speaker state collection, calculating the probability of each speaker path according to the pre-registered voiceprint feature vector sequence, and outputting the current maximum probability speaker path as an optimal speaker path.
2. The method for generating real-time speaker journals based on speaker registration information according to claim 1, wherein the calculating the probability of each speaker path based on the pre-registered voiceprint feature vector sequence specifically comprises:
the real-time speaker state aggregate is y= { Y 1 ,y 2 ,y 3 ,...,y t },y t E Q, Q is a set of registered speaker states, q= {1,2,3, …, n }, n representing the nth registered speaker;
assuming that the probability of the current path is P, the probability of the current path jumping to the next state is P new Probability of transition P generation probability;
determining the transition probability p (y) according to whether the real-time speaker status at the previous t moment is equal to the real-time speaker status at the previous t-1 moment t |y t-1 ),y t Is the fact at the previous t momentSpeaker status, y t-1 The real-time speaker state at the previous t-1 moment;
according to the number of voiceprint feature vector sets belonging to m real-time speakers in the voiceprint feature vectors newly generated at the moment i, calculating the generation probability as p (x) t |s m ) The real-time voiceprint feature vector sequence is x= { X 1 ,x 2 ,x 3 ,...,x t X, where x t For the newly generated voiceprint feature vector at time t, s m Is the voiceprint feature vector of the mth real-time speaker.
3. The method for generating real-time speaker journals based on speaker registration information according to claim 2, wherein the transition probability p (y t |y t-1 ) The calculation process of (2) is as follows:
when y is t =y t-1 At the time p (y) t |y t-1 ) =lopprob; when y is t ≠y t-1 At the time p (y) t |y t-1 ) =1-lopprob; the lopprob is self-circulation probability, and the value range is (0, 1).
4. The method for generating real-time speaker log based on speaker registration information according to claim 2, wherein the generation probability is p (x) t |s m ) The calculation process of (2) is as follows:
at time t, whenWhen (I)>m∈[1,n]Wherein F is a And F c Is the parameter of the ultrasonic wave to be used as the ultrasonic wave, the value range is (0, ++), r m A pre-registered voiceprint feature vector representing an mth registered speaker;
at time t, whenWhen (I)>m∈[1,n]Wherein F is b And F c Is the parameter of the ultrasonic wave to be used as the ultrasonic wave, value range (-infinity).
5. The method for generating real-time speaker log based on speaker registration information according to claim 1, wherein the decoding in step S3 uses a viterbi algorithm.
6. The method for generating real-time speaker log based on speaker registration information according to claim 2, wherein the pre-registered voiceprint feature vector sequence generates a set of pre-registered voiceprint feature vectors for voice data of n registered speakers collected in advance, r= { R 1 ,r 2 ,...,r n };S m Representing a set of voiceprint feature vectors of m real-time speakers generated and updated during a conversation, S m ={s 1 ,s 2 ,...,s m },m∈[1,n]。
7. The method for generating a real-time speaker log based on speaker registration information according to claim 2, wherein at time t, the voiceprint feature vector of the mth real-time speaker calculated from the previous t-1 frame is:
wherein alpha is a hyper-parameter.
8. A real-time speaker log generation apparatus based on speaker registration information, comprising:
the registration information acquisition module is configured to acquire speaker registration information, wherein the speaker registration information comprises a registered speaker state collection and a corresponding pre-registered voiceprint feature vector sequence;
the real-time voiceprint feature acquisition module is configured to acquire real-time voice data, frame the real-time voice data, extract voiceprint feature vectors from each frame of real-time voice through a voiceprint model, and obtain a real-time voiceprint feature vector sequence;
a real-time speaker state set acquisition module configured to input the real-time voiceprint feature vector sequence into an improved hidden markov model, set an initial real-time speaker state set as an empty set, decode and update the state set in real time, and determine the real-time speaker state set from a registered speaker state set;
and the path output module is configured to generate all possible paths according to the real-time speaker state collection, calculate the probability of each speaker path according to the pre-registered voiceprint feature vector sequence, and output the current maximum probability speaker path as the optimal speaker path.
9. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs,
when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-7.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-7.
CN202310320352.4A 2023-03-29 2023-03-29 Real-time speaker log generation method and device based on speaker registration information Pending CN116504250A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310320352.4A CN116504250A (en) 2023-03-29 2023-03-29 Real-time speaker log generation method and device based on speaker registration information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310320352.4A CN116504250A (en) 2023-03-29 2023-03-29 Real-time speaker log generation method and device based on speaker registration information

Publications (1)

Publication Number Publication Date
CN116504250A true CN116504250A (en) 2023-07-28

Family

ID=87325697

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310320352.4A Pending CN116504250A (en) 2023-03-29 2023-03-29 Real-time speaker log generation method and device based on speaker registration information

Country Status (1)

Country Link
CN (1) CN116504250A (en)

Similar Documents

Publication Publication Date Title
US11189262B2 (en) Method and apparatus for generating model
CN111933110B (en) Video generation method, generation model training method, device, medium and equipment
CN109545193B (en) Method and apparatus for generating a model
CN109817246A (en) Training method, emotion identification method, device, equipment and the storage medium of emotion recognition model
CN110246488B (en) Voice conversion method and device of semi-optimized cycleGAN model
US11355097B2 (en) Sample-efficient adaptive text-to-speech
JP2023542685A (en) Speech recognition method, speech recognition device, computer equipment, and computer program
US11132996B2 (en) Method and apparatus for outputting information
CN108335694A (en) Far field ambient noise processing method, device, equipment and storage medium
CN112466314A (en) Emotion voice data conversion method and device, computer equipment and storage medium
CN109697978B (en) Method and apparatus for generating a model
CN114360557A (en) Voice tone conversion method, model training method, device, equipment and medium
CN108962226B (en) Method and apparatus for detecting end point of voice
CN110675865B (en) Method and apparatus for training hybrid language recognition models
CN116504250A (en) Real-time speaker log generation method and device based on speaker registration information
CN115631748A (en) Emotion recognition method and device based on voice conversation, electronic equipment and medium
CN109658920B (en) Method and apparatus for generating a model
CN116631379B (en) Speech recognition method, device, equipment and storage medium
CN113096649B (en) Voice prediction method, device, electronic equipment and storage medium
CN113345424B (en) Voice feature extraction method, device, equipment and storage medium
CN113160795B (en) Language feature extraction model training method, device, equipment and storage medium
US20240038213A1 (en) Generating method, generating device, and generating program
Samanta et al. An energy-efficient voice activity detector using reconfigurable Gaussian base normalization deep neural network
Barath et al. Autoregressive Speech-To-Text Alignment is a Critical Component of Neural Text-To-Speech (TTS) Models
CN113793598A (en) Training method of voice processing model, data enhancement method, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination