WO2020147256A1 - 会议内容区分方法、装置、计算机设备及存储介质 - Google Patents

会议内容区分方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2020147256A1
WO2020147256A1 PCT/CN2019/091098 CN2019091098W WO2020147256A1 WO 2020147256 A1 WO2020147256 A1 WO 2020147256A1 CN 2019091098 W CN2019091098 W CN 2019091098W WO 2020147256 A1 WO2020147256 A1 WO 2020147256A1
Authority
WO
WIPO (PCT)
Prior art keywords
conference
speaker
voice
segment
meeting
Prior art date
Application number
PCT/CN2019/091098
Other languages
English (en)
French (fr)
Inventor
胡燕
徐媛
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020147256A1 publication Critical patent/WO2020147256A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • G06F18/24155Bayesian classification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use

Definitions

  • This application relates to the field of artificial intelligence, and in particular to a method, device, computer equipment, and storage medium for distinguishing conference content.
  • the embodiments of the present application provide a method, device, computer equipment, and storage medium for distinguishing meeting content, so as to solve the problem of difficulty in distinguishing meeting content efficiently.
  • the embodiments of the present application provide a method for distinguishing meeting content, including:
  • Target meeting voice segment includes meeting voice segments of at least two different speakers
  • the content of the meeting is distinguished according to the identity of the speaker and the meeting voice segment of the same speaker.
  • an embodiment of the present application provides a device for distinguishing conference content, including:
  • the target segment acquisition module is configured to acquire a target conference speech fragment, wherein the target conference speech fragment includes conference speech fragments of at least two different speakers;
  • the meeting speech fragment acquisition module is used to acquire the speaker transition point of the target meeting speech fragment, and cut the target meeting speech fragment according to the speaker transition point to obtain at least three conference speech fragments, of which one of the speeches Person corresponds to one or more of the conference voice clips;
  • the same speaker voice segment determination module configured to extract segment voice features of the meeting voice segment, cluster the meeting voice segments according to the segment voice feature, and determine the meeting voice segments of the same speaker;
  • the speaker identity determining module is configured to determine the speaker identity of the conference voice clip according to the conference voice clip of the same speaker
  • the distinguishing module is used for distinguishing the content of the meeting according to the identity of the speaker and the meeting voice segment of the same speaker.
  • a computer device in a third aspect, includes a memory, a processor, and computer-readable instructions that are stored in the memory and can run on the processor.
  • the processor executes the computer-readable instructions, the foregoing The steps of the method for distinguishing meeting content.
  • an embodiment of the present application provides a computer non-volatile readable storage medium, including: computer-executable instructions, when the computer-executable instructions are executed, they are used to execute any one of the first aspect. The method of distinguishing the content of the meeting.
  • the acquired target conference speech fragments are cut according to the speaker transition points to obtain at least three conference room speech fragments.
  • the target conference speech fragments including the conference speech fragments of at least two different speakers can be obtained. Realize reasonable cutting, so that each obtained conference room speech fragment comes from a speaker; then extract the fragment speech features of the conference speech fragment, and cluster the conference room speech fragments according to the similarity expressed by the fragment speech features.
  • the conference voice clips of the same speaker are determined, and the conference voice clips are distinguished by category; finally, the speaker identity corresponding to each conference voice clip is determined according to the conference voice clips of the same speaker, so that the speaker identity and the same speaker
  • the conference voice fragments in the conference content determine the specific situation of each conference voice fragment in the conference content, and realize the efficient distinction of the conference room content.
  • FIG. 1 is a flowchart of a method for distinguishing based on meeting content in an embodiment of the present application
  • FIG. 2 is a schematic diagram of a device for distinguishing based on meeting content in an embodiment of the present application
  • Fig. 3 is a schematic diagram of a computer device in an embodiment of the present application.
  • first, second, third, etc. may be used in the embodiments of the present application to describe the preset range, etc., these preset ranges should not be limited to these terms. These terms are only used to distinguish the preset ranges from each other.
  • the first preset range may also be referred to as the second preset range, and similarly, the second preset range may also be referred to as the first preset range.
  • the word “if” as used herein can be interpreted as “when” or “when” or “in response to determination” or “in response to detection”.
  • the phrases “if determined” or “if detected (statement or event stated)” may be interpreted as “when determined” or “in response to determination” or “when detected (statement or event stated) )” or “in response to detection (statement or event stated)”.
  • Fig. 1 shows a flowchart of the method for distinguishing meeting content in this embodiment.
  • the conference content classification method can be applied to a conference content classification system.
  • the system can be used to efficiently classify conference content during on-site meetings and network meetings.
  • the conference content classification system can be specifically applied to computer equipment.
  • the computer device is a device that can perform human-computer interaction with the user, including but not limited to devices such as computers, smart phones, and tablets.
  • the method for distinguishing conference content includes the following steps:
  • S10 Acquire a target meeting voice segment, where the target meeting voice segment includes meeting voice segments of at least two different speakers.
  • the target conference voice segment includes meeting voice segments of at least two different speakers. It should be noted that the meeting voice segment including at least two different speakers can get at least three meeting voice segments when cutting. , Otherwise there is no need to distinguish the content of the meeting. Indicates that the target meeting voice segment is composed of meeting voice segments sent by at least two different speakers.
  • the target meeting voice segment is a mixed voice segment.
  • One purpose of this solution is to mix the meeting voice segments of different speakers. Distinguish and determine the conference voice clips corresponding to different speakers in the target conference voice clips.
  • step S10 obtaining a target conference voice segment specifically includes:
  • the original conference audio clip refers to the voice information recorded in the conference by using a recording device.
  • the conference can be an on-site meeting in which participants participate on-site, or participants can participate online through the Internet. Of course, it can also be an online meeting established with the help of the network that participants participate in on-site.
  • the format of the meeting is not limited here.
  • the conference content discrimination system when the conference is held in the form of an on-site meeting, the speeches of different speakers at the conference will be collected through the recording equipment connected to the computer equipment or the embedded recording equipment. During the collection process The medium is collected continuously by time, so the silent period (the period when no one speaks) during the meeting will also be collected.
  • the voice information recorded during the conference is the original conference voice fragment.
  • the original conference voice fragment includes conference voice fragments sent by different speakers at different times, and also includes silent fragments where no one speaks.
  • the recording module of the mobile device will be used to collect voice information and obtain the original meeting voice fragments. Participants often forget part of the meeting content after online meetings, and thus cannot achieve good meeting results. Therefore, the original meeting audio clips can be processed to distinguish the meeting content, so that the participants can review the required meeting content at any time.
  • S12 Use silence detection to remove the silent segment in the original conference voice segment to obtain the target conference voice segment.
  • silence detection refers to the detection of silent fragments in the original conference speech fragments where no one speaks
  • silent fragments refer to speech fragments where no speaker is speaking.
  • the technology of Voice Activity Detection may be used, including the use of frame amplitude, frame energy, short-term zero-crossing rate, and deep neural network. Therefore, the silent speech fragment in the original conversation fragment is accurately removed, and the conference speech fragment when the speaker is speaking in the original conference speech fragment is retained, which can eliminate the interference of the silent fragment in the original conference speech fragment, and improve the efficiency and efficiency of distinguishing conference content. Accuracy provides an important technical basis.
  • the silent segment can be removed by setting the short-term energy value of the detected voice information.
  • the target meeting can be directly obtained by judging whether the short-term energy value is greater than the preset threshold. Speech fragments.
  • S20 Obtain a speaker transition point of the target conference voice segment, and cut the target conference voice segment according to the speaker transition point to obtain at least three conference voice segments, where one speaker corresponds to one or more conference voice segments.
  • the speaker transition point refers to the junction point of the conference voice clips of different speakers in the target conference voice clip.
  • the conference voice clip of speaker A and the conference voice clip of speaker B are relative to each other on the target conference voice clip. Neighbor, then the point of intersection between the two is the speaker change point.
  • the speaker transition point of the target conference speech segment is acquired, specifically, the speaker transition point on the target conference speech segment is detected based on the Bayes information criterion and the likelihood ratio, where the Bayesian information criterion (Bayesian information criterion, BIC for short) is to estimate the partially unknown state with subjective probability under incomplete intelligence, then use the Bayesian formula to modify the probability of occurrence, and finally use the expected value and modified probability to make the optimal decision.
  • Bayesian information criterion Bayesian information criterion, BIC for short
  • Likelihood ratio (LR) is an indicator that reflects authenticity.
  • the feature points on the target conference speech segment can be compared, and based on the Bayesian information criterion, the likelihood ratio between the feature points on the target conference speech segment can be calculated. Determine whether the feature point is the speaker transition point.
  • the conference speech fragment obtained by cutting corresponds to a certain speech of a certain speaker, that is, a conference speech fragment obtained by cutting belongs to a certain speaker, and cannot belong to multiple speakers at the same time.
  • the conference voice segment mentioned in this embodiment is a voice segment cut according to a speaker transition point and corresponding to a certain speech of a certain speaker.
  • the target conference voice fragments including the conference voice fragments of at least two different speakers can be reasonably cut, so that each obtained conference room voice fragment comes from only one speaker.
  • S30 Extract the segment voice features of the conference voice segment, cluster the conference voice segments according to the segment voice features, and determine the conference voice segments of the same speaker.
  • the segment voice feature is extracted from the conference voice segment and represents the voice feature of the conference voice segment.
  • segment voice features of the conference speech segment it can be determined according to the degree of similarity between the segment voice features that the speech of the conference segment is respectively issued by the speakers.
  • the conference voice segments are clustered according to the voice features of the segments, and the conference voice segments are classified into different categories according to the voice features of the segments, and each category actually corresponds to a speaker.
  • the conference voice fragments of the same speaker can be determined, and the conference voice fragments are classified by category, so that the conference voice fragments from the same speaker are classified into the same category.
  • step S30 it specifically includes:
  • S311 Extract the i-vector feature from the conference speech segment as the segment speech feature through the pre-trained general background model and the Gaussian mixture model;
  • the features extracted from the conference speech clips can be i-vector features.
  • the i-vector feature is based on the Universal Background Model (UBM), and the average value from the Gaussian Mixture Model (GMM) exceeds A compact feature vector extracted from the vector.
  • UBM Universal Background Model
  • GMM Gaussian Mixture Model
  • the i-vector feature also includes information about the soundtrack, microphone, speaking method, voice, etc., which can fully reflect the voiceprint features of the sound.
  • the result of clustering with i-vector features is more accurate, which can improve the accuracy of clustering results.
  • S312 Use the pre-trained double-covariance probability linear discriminant model to model the i-vector feature to obtain the feature expression model of the conference speech segment;
  • the double-covariance probability linear discriminant analysis model is used to extract speaker information from i-vector, which can compare and distinguish segment speech features.
  • the double-covariance probability linear discriminant analysis model assumes that the i-vector is extracted by two other parameters: a speaker's vector y and a residual vector ⁇ .
  • the residual vector ⁇ represents items that are not related to the speaker.
  • the pre-trained double-covariance probability linear discriminant analysis model is used to model the voice characteristics of each segment, which can achieve a more accurate clustering effect when determining the conference voice segments of the same speaker.
  • I ⁇ i 1 ,..., i M ⁇ be a given set of indication vectors about the conference speech segment.
  • ⁇ m represents the i-vector feature extracted from the m-th conference speech segment
  • y represents the conference speech segment
  • i m represents the indicator vector related to the conference speech segment.
  • the speaker-independent vector ⁇ representing the m-th conference speech segment obeys a Gaussian distribution with a mean value of 0 and a covariance of L -1 .
  • the double covariance in the double-covariance probability linear discriminant analysis model comes from y k and ⁇ m respectively . Understandably, the process of modeling is to calculate the feature representation of each conference speech segment in the double-covariance probability linear discriminant analysis model. By establishing a feature expression model of each conference speech segment, the feature expression model can be used to determine the conference speech segment of the same speaker.
  • S313 Use the feature expression model to cluster the conference speech fragments, and determine the conference speech fragments of the same speaker.
  • the feature expression model used to cluster conference speech fragments may specifically adopt the variational Bayes algorithm, where the variational Bayes algorithm (Variational Bayes, VB for short) provides a local optimization algorithm. Excellent, but with an approximate posterior method to determine the solution.
  • variational Bayes algorithm Variational Bayes, VB for short
  • the posterior probability of the conference speech segment and the posterior probability of the speaker are obtained according to the feature expression model and the variational Bayes algorithm, and the posterior probability of the conference speech segment and the posterior probability of the speaker are updated , Get the posterior probability of the speaker's speech in the conference speech segment, so as to determine the conference speech segment of the same speaker.
  • step S30 it further includes:
  • S321 Initialize the number of speakers in the posterior probability of the conference speech segment, and treat each different speaker in the posterior probability of the conference speech segment as a pair.
  • the number of speakers in the posterior probability of initializing the conference speech segment may specifically be initialized to 3 speakers.
  • S322 Calculate the distance between each pair of speakers to obtain the two speakers with the farthest distance.
  • cosine similarity and/or likelihood ratio scores can be used as the standard to measure distance.
  • S323 Repeat the preset number of times to initialize the number of speakers in the posterior probability of the conference speech segment, treat each different speaker in the posterior probability of the conference speech segment as a pair and calculate the distance between each pair of speakers , Get the steps of the two speakers with the farthest distance, get the two speakers with the farthest distance in the preset number of steps, and use the two speakers with the farthest distance in the preset number of steps as the variation The starting point for Bayesian calculations.
  • this step is to repeat steps S321-S322 for a preset number of times (for example, 20 times), and then use the two speakers with the farthest distance among all the steps for the preset number of times as the starting point of the variational Bayes calculation.
  • Steps S321-S323 are the optimization steps for the variational Bayesian algorithm, which can improve the calculation results obtained by the variational Bayesian algorithm when the maximum expectation algorithm is used to iterate more accurately, and finally obtain the accurate results of the speaker in the meeting The posterior probability of what has been said in the speech, so as to better distinguish the conference speech fragments by the speaker.
  • S40 Determine the speaker identity of the conference voice clip according to the conference voice clip of the same speaker.
  • step S30 the conference speech fragments of the same speaker have been distinguished, but the identity of the speaker cannot be determined.
  • the speaker identity of the conference voice fragment can be determined by the fragment voice characteristics of the conference voice fragment, so as to realize the distinction of conference voice fragments at the level of speaker identity.
  • determining the speaker identity of the meeting voice segment according to the meeting voice segment of the same speaker includes:
  • S411 Obtain a preset number of conference voice clips from the conference voice clips of each same speaker, and display them.
  • a preset number of conference speech fragments can be randomly selected from the conference speech fragments of the same speaker. It is only necessary to ensure that the conference speech fragments of each same speaker are at least Only one is selected. For example, there are 3 speakers A, B, and C. Speaker A has 5 conference audio clips, Speaker B has 10 conference audio clips, and Speaker C has 20 conference audio clips. At this time, you can extract two conference voice fragments of speakers A, B, and C respectively. The number of conference voice fragments obtained can be set in advance. At least one conference voice fragment must be guaranteed for each speaker. After obtaining the preset number of meeting audio clips, they are displayed. Specifically, the display may be displayed to participants who know the identities of the speakers at the meeting. The way of display may be to display by playing audio. Participants can determine the speaker identity corresponding to the displayed meeting audio segment according to the displayed meeting audio segment.
  • S412 In response to the display, obtain the speaker identity confirmation instruction, confirm the speaker identity of the preset number of conference voice clips according to the speaker identity confirmation instruction, and obtain the first confirmation result.
  • the speaker identity confirmation instruction is an instruction to confirm the speaker's identity.
  • the conference content distinguishing system obtains the speaker identity confirmation instruction input by the user, and confirms the speaker identity of a preset number of conference speech clips. Understandably, the conference content discrimination system confirms the speaker identity corresponding to the displayed conference voice fragment according to the speaker identity confirmation instruction of the user (participant) after displaying at least one conference voice segment of each same speaker.
  • S413 Determine the speaker identity of the conference voice clip according to the first confirmation result and the conference voice clip of the same speaker.
  • the first confirmation result represents the displayed meeting speech fragments. Since the same meeting speech fragments have been classified into the same category by the clustering method, the first confirmation result and the meeting speech of the same speaker can be classified The fragment directly determines the speaker identity of the conference speech fragment, and can quickly determine the speaker identity of all conference speech fragments.
  • step S40 determining the speaker identity of the meeting voice segment according to the meeting voice segment of the same speaker, further includes:
  • S421 Obtain a preset number of conference speech fragments from the conference speech fragments of each same speaker, and input them into a pre-trained voiceprint recognition model.
  • step S411 instead of displaying the preset number of meeting speech fragments of each of the same speaker's meeting speech fragments, the preset number of speech fragments obtained is automatically recognized.
  • the number of conference speech fragments are input into the voiceprint recognition model so that the voiceprint recognition model can automatically recognize the speaker's identity in the conference speech fragment.
  • S422 Recognizing a preset number of meeting voice clips through the voiceprint recognition model, confirming the speaker identity of the preset number of meeting voice clips, and obtaining a second confirmation result.
  • the voiceprint recognition model is used to automatically recognize the speaker identities of a preset number of conference speech clips. Understandably, the voiceprint recognition model is pre-trained, and the speaker identity of the participants needs to be entered in advance (for example, a pre-entered voiceprint feature is bound to the corresponding speaker identity).
  • voiceprint recognition can directly confirm the speaker identity of a preset number of conference voice clips in the conference content discrimination system, without the need for information interaction with the user. Understandably, when the meeting format is online, such as when several people participate in a WeChat group meeting, the user can automatically confirm the preset number of meetings after entering the personal speaker identity at one time. The identity of the spokesperson of the speech clip does not need to be confirmed by the form of information interaction with the user every time.
  • the voiceprint recognition model is more suitable for online meetings with a small number of people, and can realize fully automatic speaker identification.
  • S423 Determine the speaker identity of the conference voice clip according to the second confirmation result and the conference voice clip of the same speaker.
  • the second confirmation result represents the displayed meeting speech fragments. Since the same meeting speech fragments have been classified into the same category by the clustering method, the second confirmation result and the meeting speech of the same speaker can be classified The fragment directly determines the speaker identity of the conference speech fragment, and can quickly determine the speaker identity of all conference speech fragments.
  • S50 Distinguish the content of the meeting according to the identity of the speaker and the meeting voice clips of the same speaker.
  • the content of the meeting is the speeches of different speakers at the meeting, and the speeches of the different speakers at the meeting are represented by conference audio clips of different speakers. Therefore, in the case of knowing the identity of the speaker and the meeting voice clips of the same speaker, it can be determined which speaker said what in the meeting, and the purpose of distinguishing the content of the meeting is also achieved.
  • step S50 includes inputting the meeting speech fragments of the same speaker into the speech-to-text model according to the identity of the speaker to obtain the meeting content of different speakers, thereby realizing the distinction of meeting content.
  • step S50 it further includes:
  • the pre-trained deep neural network model and neural speech model are used to analyze the content of the meeting and generate meeting minutes and/or execution lists.
  • the deep neural network model and neural speech model are trained based on a large number of meeting minutes and/or execution lists. After learning the deep features of the meeting minutes and/or execution lists, you can perform in-depth analysis on the content of the meeting.
  • the meeting content generates meeting minutes and/or execution lists.
  • the method of generating meeting minutes and/or execution lists does not require manual sorting, and can improve the efficiency of sorting meeting content.
  • the acquired target conference speech fragments are cut according to the speaker transition points to obtain at least three conference room speech fragments.
  • the target conference speech fragments including the conference speech fragments of at least two different speakers can be obtained. Realize reasonable cutting, so that each obtained conference room speech fragment comes from a speaker; then extract the fragment speech features of the conference speech fragment, and cluster the conference room speech fragments according to the similarity expressed by the fragment speech features.
  • the conference voice clips of the same speaker are determined, and the conference voice clips are distinguished by category; finally, the speaker identity corresponding to each conference voice clip is determined according to the conference voice clips of the same speaker, so that the speaker identity and the same speaker
  • the conference voice fragments in the conference content determine the specific situation of each conference voice fragment in the conference content, and realize the efficient distinction of the conference room content.
  • the embodiments of the present application further provide device embodiments that implement the steps and methods in the foregoing method embodiments.
  • Fig. 2 shows a principle block diagram of a conference content distinguishing device corresponding to the conference content distinguishing method in the embodiment one-to-one.
  • the apparatus for distinguishing meeting content includes a target segment acquisition module 10, a meeting speech segment acquisition module 20, a same speaker speech segment determination module 30, a speaker identity determination module 40 and a distinguishing module 50.
  • the target segment acquisition module 10, the meeting speech segment acquisition module 20, the same speaker speech segment determination module 30, the speaker identity determination module 40, and the distinguishing module 50 implement functions corresponding to the steps of the method for distinguishing meeting content in the embodiment.
  • this embodiment will not go into details one by one.
  • the target segment acquisition module 10 is configured to acquire a target conference speech segment, where the target conference speech segment includes meeting speech segments of at least two different speakers.
  • the meeting voice segment acquisition module 20 is used to acquire the speaker transition point of the target meeting voice segment, and cut the target meeting voice segment according to the speaker transition point to obtain at least three meeting voice segments, where one speaker corresponds to one or more meetings Speech fragments.
  • the voice segment determination module 30 of the same speaker is used to extract segment voice features of the conference voice segment, cluster the conference voice segments according to the segment voice features, and determine the conference voice segments of the same speaker.
  • the speaker identity determining module 40 is used to determine the speaker identity of the conference voice segment according to the conference voice segment of the same speaker.
  • the distinguishing module 50 is used for distinguishing the content of the meeting according to the identity of the speaker and the meeting voice clips of the same speaker.
  • the same speaker speech segment determination module 30 includes a segment speech feature extraction unit, a feature expression model acquisition unit, and a same speaker speech segment acquisition unit.
  • the segment speech feature extraction unit is used to extract i-vector features from the conference speech segment as the segment speech feature through the pre-trained general background model and the Gaussian mixture model.
  • the feature expression model acquisition unit is used to use a pre-trained double covariance probability linear discriminant model to model i-vector features to obtain a feature expression model of the conference speech segment.
  • the voice fragment acquisition unit of the same speaker is used to cluster the conference voice fragments using the feature expression model to determine the conference voice fragments of the same speaker.
  • the speaker identity determination module 40 includes a display unit, a first confirmation result acquisition unit, and a first speaker identity determination unit.
  • the display unit is used to obtain and display a preset number of conference voice clips from the conference voice clips of each same speaker.
  • the first confirmation result obtaining unit is configured to obtain the speaker identity confirmation instruction in response to the display, and confirm the speaker identity of the preset number of conference voice clips according to the speaker identity confirmation instruction, and obtain the first confirmation result.
  • the first speaker identity determining unit is used to determine the speaker identity of the conference voice segment according to the first confirmation result and the conference voice segment of the same speaker.
  • the speaker identity determining module 40 further includes an input unit, a second confirmation result obtaining unit, and a second speaker identity determining unit.
  • the input unit is used to obtain a preset number of conference speech fragments from the conference speech fragments of each same speaker, and input them into the pre-trained voiceprint recognition model.
  • the second confirmation result obtaining unit is configured to recognize a preset number of conference speech fragments through the voiceprint recognition model, confirm the speaker identity of the preset number of conference speech fragments, and obtain a second confirmation result.
  • the second speaker identity determining unit is used to determine the speaker identity of the conference voice segment according to the second confirmation result and the conference voice segment of the same speaker.
  • the distinguishing module 50 is specifically configured to input the meeting speech fragments of the same speaker into the speech-to-text model according to the identity of the speaker to obtain meeting content of different speakers.
  • the device for distinguishing meeting content further includes a generating unit for analyzing the meeting content using a pre-trained deep neural network model and a neural speech model to generate meeting minutes and/or execution lists.
  • the target segment acquisition module 10 includes an original conference speech segment acquisition unit and a target conference speech segment acquisition unit.
  • the original conference speech fragment obtaining unit is used to obtain the original conference speech fragment.
  • the target conference voice fragment acquisition unit is used to remove the silent fragments in the original conference voice fragment by using silence detection to obtain the target conference voice fragment.
  • the acquired target conference speech fragments are cut according to the speaker transition points to obtain at least three conference room speech fragments.
  • the target conference speech fragments including the conference speech fragments of at least two different speakers can be obtained. Realize reasonable cutting, so that each obtained conference room speech fragment comes from a speaker; then extract the fragment speech features of the conference speech fragment, and cluster the conference room speech fragments according to the similarity expressed by the fragment speech features.
  • the conference voice clips of the same speaker are determined, and the conference voice clips are distinguished by category; finally, the speaker identity corresponding to each conference voice clip is determined based on the conference voice clips of the same speaker, so that the speaker identity and the same speaker
  • the conference voice fragments in the conference content determine the specific situation of each conference voice fragment in the conference content, and realize the efficient distinction of the conference room content.
  • This embodiment provides a computer non-volatile readable storage medium.
  • the computer non-volatile readable storage medium stores computer readable instructions.
  • the conference content in the embodiment is distinguished. To avoid repetition, I won’t repeat them here.
  • the computer-readable instruction is executed by the processor, the function of each module/unit in the apparatus for distinguishing meeting content in the embodiment is realized. In order to avoid repetition, it will not be repeated here.
  • Fig. 3 is a schematic diagram of a computer device provided by an embodiment of the present application.
  • the computer device 60 of this embodiment includes a processor 61, a memory 62, and computer-readable instructions 63 stored in the memory 62 and running on the processor 61, and the computer-readable instructions 63 are processed.
  • the method for distinguishing meeting content in the embodiment is implemented when the device 61 is executed. To avoid repetition, it will not be repeated here.
  • the computer-readable instruction 63 is executed by the processor 61, the function of each model/unit in the apparatus for distinguishing meeting content in the embodiment is realized. To avoid repetition, it will not be repeated here.
  • the computer device 60 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the computer device 60 may include, but is not limited to, a processor 61 and a memory 62.
  • FIG. 3 is only an example of the computer device 60, and does not constitute a limitation on the computer device 60. It may include more or less components than shown in the figure, or a combination of certain components, or different components.
  • computer equipment may also include input and output devices, network access devices, buses, and so on.
  • the so-called processor 61 may be a central processing unit (Central Processing Unit, CPU), other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the memory 62 may be an internal storage unit of the computer device 60, such as a hard disk or memory of the computer device 60.
  • the memory 62 may also be an external storage device of the computer device 60, such as a plug-in hard disk equipped on the computer device 60, a smart memory card (Smart Media Card, SMC), a Secure Digital (SD) card, and a flash memory card (Flash). Card) and so on.
  • the memory 62 may also include both an internal storage unit of the computer device 60 and an external storage device.
  • the memory 62 is used to store computer readable instructions and other programs and data required by the computer device.
  • the memory 62 can also be used to temporarily store data that has been output or will be output.

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)
  • Emergency Alarm Devices (AREA)

Abstract

一种会议内容区分方法、装置、计算机设备及存储介质,涉及人工智能领域。该会议内容区分方法包括:获取目标会议语音片段,其中,目标会议语音片段包括至少两个不同发言人的会议语音片段(S10);获取目标会议语音片段的发言人转变点,根据发言人转变点切割目标会议语音片段,得到至少三个会议语音片段,其中,一个发言人对应一个或多个会议语音片段(S20);提取会议语音片段的片段语音特征,根据片段语音特征对会议语音片段进行聚类,确定相同发言人的会议语音片段(S30);根据相同发言人的会议语音片段确定会议语音片段的发言人身份(S40);根据发言人身份和相同发言人的会议语音片段区分会议内容(S50)。采用该会议内容区分方法能够对会议内容进行高效区分。

Description

会议内容区分方法、装置、计算机设备及存储介质
本申请以2019年1月16日提交的申请号为201910038369.4,名称为“会议内容区分方法、装置、计算机设备及存储介质”的中国发明专利申请为基础,并要求其优先权。
【技术领域】
本申请涉及人工智能领域,尤其涉及一种会议内容区分方法、装置、计算机设备及存储介质。
【背景技术】
对会议内容进行高效整理一直是个难题。目前,大多数会议内容都是通过人工进行整理的,也有少部分采用语音识别技术,由机器识别发言人的语音并转换为文字记录。但是,机器只能单纯实现从语音到文字上的转换,并不能够对会议内容进行区分、整理。
【发明内容】
有鉴于此,本申请实施例提供了一种会议内容区分方法、装置、计算机设备及存储介质,用以解决难以对会议内容进行高效区分的问题。
第一方面,本申请实施例提供了一种会议内容区分方法,包括:
获取目标会议语音片段,其中,所述目标会议语音片段包括至少两个不同发言人的会议语音片段;
获取所述目标会议语音片段的发言人转变点,根据所述发言人转变点切割所述目标会议语音片段,得到至少三个会议语音片段,其中,一个所述发言人对应一个或多个所述会议语音片段;
提取所述会议语音片段的片段语音特征,根据所述片段语音特征对所述会议语音片段进行聚类,确定相同发言人的会议语音片段;
根据所述相同发言人的会议语音片段确定所述会议语音片段的发言人身份;
根据所述发言人身份和所述相同发言人的会议语音片段区分会议内容。
第二方面,本申请实施例提供了一种会议内容区分装置,包括:
目标片段获取模块,用于获取目标会议语音片段,其中,所述目标会议语音片段包括至少两个不同发言人的会议语音片段;
会议语音片段获取模块,用于获取所述目标会议语音片段的发言人转变点,根据所述发言人转变点切割所述目标会议语音片段,得到至少三个会议语音片段,其中,一个所述发言人对应一个或多个所述会议语音片段;
相同发言人语音片段确定模块,用于提取所述会议语音片段的片段语音特征,根据所述片段语音特征对所述会议语音片段进行聚类,确定相同发言人的会议语音片段;
发言人身份确定模块,用于根据所述相同发言人的会议语音片段确定所述会议语音片段的发言人身份;
区分模块,用于根据所述发言人身份和所述相同发言人的会议语音片段区分会议内容。
第三方面,一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现上述会议内容区分方法的步骤。
第四方面,本申请实施例提供了一种计算机非易失性可读存储介质,包括:计算机可执行指令,当所述计算机可执行指令被运行时,用以执行第一方面任一项所述的会议内容区分方法。
上述技术方案中的一个技术方案具有如下有益效果:
在本申请实施例中,首先将获取的目标会议语音片段根据发言人转变点进行切割,得到至少三个会议室语音片段,能够将包括至少两个不同发言人的会议语音片段的目标会议语音片段实现合理切割,使得每个得到的会议室语音片段来自一位发言人;然后提取会议语音片段的片段语音特征,根据片段语音特征所表达出的相似性对会议室语音片段聚类,根据聚类的结果确定相同发言人的会议语音片段,将会议语音片段按类别进行区分;最后根据相同发言人的会议语音片段确定每一个会议语音片段对应的发言人身份,从而根据发言人身份和相同发言人的会议语音片段确定会议内容中各个会议语音片段具体的所属情况,实现会议室内容的高效区分。
【附图说明】
为了更清楚地说明本申请实施例的技术方案,下面将对实施例中所需要使用的附图作 简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其它的附图。
图1是本申请一实施例中基于会议内容区分方法的一流程图;
图2是本申请一实施例中基于会议内容区分装置的一示意图;
图3是本申请一实施例中计算机设备的一示意图。
【具体实施方式】
为了更好的理解本申请的技术方案,下面结合附图对本申请实施例进行详细描述。
应当明确,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本申请保护的范围。
在本申请实施例中使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本申请。在本申请实施例和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。
应当理解,本文中使用的术语“和/或”仅仅是一种描述关联对象的相同的字段,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。
应当理解,尽管在本申请实施例中可能采用术语第一、第二、第三等来描述预设范围等,但这些预设范围不应限于这些术语。这些术语仅用来将预设范围彼此区分开。例如,在不脱离本申请实施例范围的情况下,第一预设范围也可以被称为第二预设范围,类似地,第二预设范围也可以被称为第一预设范围。
取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”或“响应于检测”。类似地,取决于语境,短语“如果确定”或“如果检测(陈述的条件或事件)”可以被解释成为“当确定时”或“响应于确定”或“当检测(陈述的条件或事件)时”或“响应于检测(陈述的条件或事件)”。
图1示出本实施例中会议内容区分方法的一流程图。该会议内容区分方法可应用在会议内容区分系统中,在进行现场会议、网络会议时均可采用该系统对会议内容进行高效的区分,该会议内容区分系统具体可应用在计算机设备上。其中,该计算机设备是可与用户进行人机交互的设备,包括但不限于电脑、智能手机和平板等设备。如图1所示,该会议内容区分方法包括如下步骤:
S10:获取目标会议语音片段,其中,目标会议语音片段包括至少两个不同发言人的会议语音片段。
可以理解地,目标会议语音片段中包括至少两个不同发言人的会议语音片段,需要说明的是,这里的包括至少两个不同发言人的会议语音片段在切割时至少能得到三个会议语音片段,否则没有必要对会议内容进行区分。表示该目标会议语音片段是由至少两个不同发言人发出的会议语音片段组成的,该目标会议语音片段是一个混杂的语音片段,本方案的一个目的为将混杂不同发言人的会议语音片段进行区分,确定目标会议语音片段中不同发言人各自对应的会议语音片段。
在一实施例中,步骤S10中,获取目标会议语音片段,具体包括:
S11:获取原始会议语音片段。
在一实施例中,原始会议语音片段是指采用录音设备在会议上录取到的语音信息,其中,会议可以是参会人员在现场参与的现场会议,也可以是参会人员通过网络在线上参与的线上会议,当然,也可以是参会人员在现场参与的借助网络建立的线上会议,在此不对会议的举办形式进行限定。
可以理解地,在采用会议内容区分系统时,当会议的举办形式为现场会议时,将通过与计算机设备连接的录音设备或者内嵌的录音设备采集不同发言人在会议上的发言,在采集过程中是按时间连续进行采集,因此会议期间的静默时段(无人发言的时段)也同样会采集下来。在会议期间录取到的语音信息为原始会议语音片段,该原始会议语音片段包括不同时间不同发言人发出的会议语音片段,也包括无人发言的静默片段。
可以理解地,在会议的举办形式为线上会议时,如采用微信群组的方式进行的线上会议,将采用移动设备的录音模块采集语音信息,获取原始会议语音片段。参会人员在线上会议后往往忘记部分会议内容,从而无法达到良好的会议效果,因此,可以对原始会议语音片段进行处理,区分会议内容,让参会人员可以随时回顾所需的会议内容。
S12:采用静音检测去除原始会议语音片段中的静默片段,得到目标会议语音片段。
其中,静音检测是指对原始会议语音片段中无人发言的静默片段的检测,静默片段是指没有发言人进行发言的语音片段。在一实施例中,可以采用语音端点检测(Voice Activity Detection,简称VAD)的技术实现,包括采用帧幅度、帧能量、短时过零率和深度神经网络等方式实现。从而准确去除原始通话片段中的静默语音片段,将原始会议语音片段中发言人发言时的会议语音片段保留下来,可以排除原始会议语音片段中的静默片段的干扰,为提高区分会议内容的效率和准确率提供重要的技术基础。
特别地,在会议的举办形式为线上会议时,可以采用设置检测语音信息的短时能量值的方式去除静默片段,此时可以通过判断短时能量值是否大于预设阈值,直接获取目标会议语音片段。
S20:获取目标会议语音片段的发言人转变点,根据发言人转变点切割目标会议语音片段,得到至少三个会议语音片段,其中,一个发言人对应一个或多个会议语音片段。
可以理解地,发言人转变点是指在目标会议语音片段中不同发言人的会议语音片段的交界点,如发言人A的会议语音片段和发言人B的会议语音片段在目标会议语音片段上相邻,那么两者之间的交界点即发言人转变点。
在一实施例中,获取目标会议语音片段的发言人转变点,具体地,基于贝叶斯信息准则和似然比,检测目标会议语音片段上的发言人转变点,其中,贝叶斯信息准则(Bayesian information criterion,简称BIC)是在不完全情报下,对部分未知的状态用主观概率估计,然后用贝叶斯公式对发生概率进行修正,最后再利用期望值和修正概率做出最优决策。似然比(likelihood ratio,简称LR)是反映真实性的一种指标。可以理解地,基于贝叶斯信息准则和似然比可以对目标会议语音片段上的特征点进行比较,基于贝叶斯信息准则计算目标会议语音片段上的特征点之间的似然比,从而判断特征点是否为发言人转变点。
在得到发言人转变点后,根据发言人转变点切割目标会议语音片段,得到至少三个会议语音片段,其中,一个发言人对应一个或多个会议语音片段。本实施例中,切割得到的会议语音片段对应的是某一个发言人的某次发言,也即是说,一个切割得到的会议语音片段属于某一个发言人,而不能同时属于多个发言人。可以理解地,本实施例中提及的会议语音片段为根据发言人转变点切割的、对应某一发言人的某次发言的语音片段。
在本实施例中,能够将包括至少两个不同发言人的会议语音片段的目标会议语音片段实现合理切割,使得每个得到的会议室语音片段只来自一位发言人。
S30:提取会议语音片段的片段语音特征,根据片段语音特征对会议语音片段进行聚类,确定相同发言人的会议语音片段。
其中,片段语音特征是从会议语音片段提取的、代表会议语音片段的语音特征。
在一实施例中,通过提取会议语音片段的片段语音特征,可以按照片段语音特征间的相似程度来判断会议片段语音分别是由那几个发言人发出的。具体地,根据片段语音特征对会议语音片段进行聚类,按片段语音特征将会议语音片段归成不同的类别,每一个类别实际上对应的就是一个发言人。
在本实施例中,可以确定相同发言人的会议语音片段,将会议语音片段按类别进行区分, 使得来自相同发言人的会议语音片段归成同一类。
进一步地,在步骤S30中,具体包括:
S311:通过预先训练的通用背景模型和高斯混合模型从会议语音片段中提取i-vector特征作为片段语音特征;
具体地,从会议语音片段中提取的特征可以是i-vector特征,i-vector特征是指基于通用背景模型(Universal Background Model,简称UBM),从高斯混合模型(Gaussian mixture model,GMM)均值超矢量中提取的一个紧凑的特征矢量,i-vector特征除了包含说话人的身份信息外,还包括关于声道,话筒,说话方式,语音等信息,可以全面地体现声音的声纹特征,采用该i-vector特征进行聚类得到的结果更为准确,能够提高聚类结果的准确性。
S312:采用预先训练的双协方差概率线性判别模型对i-vector特征建模,得到会议语音片段的特征表达模型;
其中,在片段语音特征识别中,双协方差概率线性判别分析模型用来从i-vector中提取说话人信息,可以对片段语音特征进行比对和区分。双协方差概率线性判别分析模型假设i-vector是由另外两个参数提取的:一个发言人的向量y和一个剩余向量∈,剩余向量∈表示与发言人无关的项。采用预先训练好的双协方差概率线性判别分析模型对每个片段语音特征进行建模,能够在确定相同发言人的会议语音片段时,达到更精确的聚类效果。
在建模前:设在一个会议发言过程中,发言人的总数有S个。将会议语音片段提取的i-vector表示为Φ={φ 1,...,φ M}。对于每一个会议语音片段m=1,…,M,定义一个维度为S*1的指示向量i m,如果发言人s在会议语音片段m中说话了,则i m中的元素i ms=1,如果发言人s在会议语音片段m中没说话,i m中的元素i ms=0。令I={i 1,...,i M}为一个给出的关于会议语音片段的指示向量集合。假设事件为发言人s在一个片段中说话,则给该时间赋上一个先验概率
Figure PCTCN2019091098-appb-000001
对于每个发言人s的样本y s∈N(y;μ,Λ -1),即每个发言人s的样本服从均值为μ,协方差为Λ -1的正态分布,对于每一个会议语音片段,服从于多项式分布Mult(∏)的样本i m,其中∏=(π 1,...,π S)。
有了上述建模的前提条件,特征表达模型的表达式为:φ m=y k+∈ m,其中,φ m表示第m个会议语音片段提取的i-vector特征,y表示会议语音片段中的与发言人关联向量,为了和上述y s中的s做区分,令k为使i mk=1的索引,i m表示与会议语音片段相关的指示向量,
Figure PCTCN2019091098-appb-000002
Figure PCTCN2019091098-appb-000003
表示第m个会议语音片段的发言人无关向量∈服从均值为0,协方差为L -1的高斯分布。双协方差概率线性判别分析模型中的双协方差即分别来自y k和∈ m。可以理解 地,建模的过程即计算每一个会议语音片段在双协方差概率线性判别分析模型中的特征表示。通过建立每一个会议语音片段的特征表达模型,可以利用该特征表达模型确定相同发言人的会议语音片段。
S313:采用特征表达模型对会议语音片段进行聚类,确定相同发言人的会议语音片段。
在一实施例中,采用特征表达模型对会议语音片段进行聚类具体采用的可以是变分贝叶斯算法,其中,变分贝叶斯算法(Variational Bayes,简称VB)是提供一种局部最优,但具有确定解的近似后验方法。
本实施例中,根据特征表达模型和变分贝叶斯算法获取会议语音片段的后验概率和发言人的后验概率,并对会议语音片段的后验概率和发言人的后验概率进行更新,得到发言人在会议语音片段中发过言的后验概率,从而确定相同发言人的会议语音片段。
进一步地,在步骤S30之前,还包括:
S321:初始化会议语音片段的后验概率中发言人的个数,将会议语音片段的后验概率中每个不同发言人作为一对。
在一实施例中,初始化会议语音片段的后验概率中发言人的个数具体可以是初始化为3个发言人。
S322:计算每一对发言人之间的距离,得到距离最远的两个发言人。
其中,在双协方差概率线性判别分析模型中,可以采用余弦相似度和/或似然比分数作为衡量距离的标准。
S323:重复预设次数的初始化会议语音片段的后验概率中发言人的个数,将会议语音片段的后验概率中每个不同发言人作为一对和计算每一对发言人之间的距离,得到距离最远的两个发言人的步骤,得到在预设次数的步骤中距离最远的两个发言人,并将在预设次数的步骤中距离最远的两个发言人作为变分贝叶斯计算的起点。
可以理解地,本步骤为重复预设次数(如20次)的步骤S321-S322,再将所有预设次数的步骤中距离最远的两个发言人作为变分贝叶斯计算的起点。
步骤S321-S323中是对变分贝叶斯算法的优化步骤,可以提高变分贝叶斯算法在采用最大期望算法进行迭代时得到的运算结果更加准确,并最终根据准确地得到发言人在会议语音中说过话的后验概率,从而更好地对会议语音片段按发言人进行区分。
S40:根据相同发言人的会议语音片段确定会议语音片段的发言人身份。
可以理解地,在步骤S30中已将相同发言人的会议语音片段区分开来,但是无法确定发言人身份。本实施例中,根据相同发言人的会议语音片段,可以通过会议语音片段的片段语音特 征确定会议语音片段的发言人身份,从而实现在发言人身份层面上的会议语音片段区分。
进一步地,在步骤S40中,根据相同发言人的会议语音片段确定会议语音片段的发言人身份,包括:
S411:在每一相同发言人的会议语音片段中各获取预设个数的会议语音片段,并进行展示。
在一实施例中,对于已经聚类好的会议语音片段,可以从相同发言人的会议语音片段中随机抽取预设个数的会议语音片段,只需要保证每个相同发言人的会议语音片段至少有一个被抽出来就可以,例如:共有3个发言人A、B和C,发言人A共有5个会议语音片段,发言人B共有10个会议语音片段,发言人C共有20个会议语音片段,这时可以分别抽取发言人A、B和C各两个会议语音片段,获取的会议语音片段个数可以预先设置好,至少要保证每个发言人有一个会议语音片段。在获取预设个数的会议语音片段后,将其进行展示,具体地,展示可以是展示给参会人员,且这些参会人员是知道会议上发言人的发言人身份的。展示的方式可以是采用播放音频的方式进行展示,参会人员可以根据展示的会议语音片段确定展示的会议语音片段所对应的发言人身份。
可以理解地,采用该方式无需预先存储发言人的发言人身份。特别是在不清楚哪些参会人员会发言的情况下,采用存储发言人的发言人身份的方式需要预先采集所有参会人员的声纹特征,从而确定发言人身份。这种方式会明显提高工作量,需要对声纹识别模型进行预训练的操作,并且,不是每个参会人员都有时间参与发言人身份的录入,在大型会议、现场会议的场景下并不合适。采用本实施例展示会议语音片段的方式更为灵活和高效。
S412:响应于展示,获取发言人身份确认指令,根据发言人身份确认指令确认预设个数的会议语音片段的发言人身份,得到第一确认结果。
其中,发言人身份确认指令为确认发言人身份的指令。
在一实施例中,会议内容区分系统获取用户输入的发言人身份确认指令,确认预设个数的会议语音片段的发言人身份。可以理解地,会议内容区分系统在展示每个相同发言人至少一个的会议语音片段后根据用户(参会人员)的发言人身份确认指令确认展示的会议语音片段所对应的发言人身份。
S413:根据第一确认结果和相同发言人的会议语音片段确定会议语音片段的发言人身份。
可以理解地,第一确认结果代表的是展示的会议语音片段,由于已采用聚类的方式将相同的会议语音片段归成同一类,因此,可以根据第一确认结果和相同发言人的会议语音片段直接确定会议语音片段的发言人身份,可以快速确定所有会议语音片段的发言人身份。
进一步地,在步骤S40中,根据相同发言人的会议语音片段确定会议语音片段的发言人身 份,还包括:
S421:在每一相同发言人的会议语音片段中各获取预设个数的会议语音片段,并输入到预先训练的声纹识别模型中。
在一实施例中,与步骤S411相比,不将每一相同发言人的会议语音片段中各获取预设个数的会议语音片段进行展示,而是采用自动识别的方式,将获取的预设个数的会议语音片段输入到声纹识别模型中,以让声纹识别模型自动识别会议语音片段的发言人身份。
S422:通过声纹识别模型识别预设个数的会议语音片段,确认预设个数的会议语音片段的发言人身份,得到第二确认结果。
在一实施例中,采用声纹识别模型自动识别预设个数的会议语音片段的发言人身份。可以理解地,声纹识别模型是预先训练好的,需要预先录入参会人员的发言人身份(如将一个预先录入的声纹特征与对应的发言人身份进行绑定)。采用声纹识别可以在会议内容区分系统直接确认预设个数的会议语音片段的发言人身份,无需与用户进行信息交互。可以理解地,在会议形式为线上形式,如若干人参与微信群组的会议时,用户可以在一次录入个人的发言人身份后,在每次会议时都可以自动确认预设个数的会议语音片段的发言人身份,无需每次都采用与用户进行信息交互的形式确认发言人身份。采用声纹识别模型识别更适用与人数不多的线上会议,能够实现全自动的发言人身份确认。
S423:根据第二确认结果和相同发言人的会议语音片段确定会议语音片段的发言人身份。
可以理解地,第二确认结果代表的是展示的会议语音片段,由于已采用聚类的方式将相同的会议语音片段归成同一类,因此,可以根据第二确认结果和相同发言人的会议语音片段直接确定会议语音片段的发言人身份,可以快速确定所有会议语音片段的发言人身份。
S50:根据发言人身份和相同发言人的会议语音片段区分会议内容。
在一实施例中,会议内容为不同发言人在会议上的发言,该不同发言人在会议上的发言以不同发言人的会议语音片段表示。因此,在知道发言人身份和相同发言人的会议语音片段的情况下,可以确定会议上哪个发言人说了哪些话,也就实现了区分会议内容的目的。
具体地,在步骤S50中,包括将相同发言人的会议语音片段按发言人身份输入到语音转文本模型中,得到不同发言人的会议内容,从而实现了会议内容的区分。
进一步地,在步骤S50之后,还包括:
采用预先训练的深度神经网络模型和神经语音模型对会议内容进行分析,生成会议纪要和/或执行列表。
可以理解地,深度神经网络模型和神经语音模型是根据大量的会议纪要和/或执行列表训 练得到的,学习了会议纪要和/或执行列表的深层特征,可以对会议内容进行深层分析,根据输入的会议内容生成会议纪要和/或执行列表。采用该生成会议纪要和/或执行列表的方式无需人工进行整理,能够提高整理会议内容的效率。
在本申请实施例中,首先将获取的目标会议语音片段根据发言人转变点进行切割,得到至少三个会议室语音片段,能够将包括至少两个不同发言人的会议语音片段的目标会议语音片段实现合理切割,使得每个得到的会议室语音片段来自一位发言人;然后提取会议语音片段的片段语音特征,根据片段语音特征所表达出的相似性对会议室语音片段聚类,根据聚类的结果确定相同发言人的会议语音片段,将会议语音片段按类别进行区分;最后根据相同发言人的会议语音片段确定每一个会议语音片段对应的发言人身份,从而根据发言人身份和相同发言人的会议语音片段确定会议内容中各个会议语音片段具体的所属情况,实现会议室内容的高效区分。
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
基于实施例中所提供的会议内容区分方法,本申请实施例进一步给出实现上述方法实施例中各步骤及方法的装置实施例。
图2示出与实施例中会议内容区分方法一一对应的会议内容区分装置的原理框图。如图2所示,该会议内容区分装置包括目标片段获取模块10、会议语音片段获取模块20、相同发言人语音片段确定模块30、发言人身份确定模块40和区分模块50。其中,目标片段获取模块10、会议语音片段获取模块20、相同发言人语音片段确定模块30、发言人身份确定模块40和区分模块50的实现功能与实施例中会议内容区分方法对应的步骤一一对应,为避免赘述,本实施例不一一详述。
目标片段获取模块10,用于获取目标会议语音片段,其中,目标会议语音片段包括至少两个不同发言人的会议语音片段。
会议语音片段获取模块20,用于获取目标会议语音片段的发言人转变点,根据发言人转变点切割目标会议语音片段,得到至少三个会议语音片段,其中,一个发言人对应一个或多个会议语音片段。
相同发言人语音片段确定模块30,用于提取会议语音片段的片段语音特征,根据片段语音特征对会议语音片段进行聚类,确定相同发言人的会议语音片段。
发言人身份确定模块40,用于根据相同发言人的会议语音片段确定会议语音片段的发言人身份。
区分模块50,用于根据发言人身份和相同发言人的会议语音片段区分会议内容。
可选地,相同发言人语音片段确定模块30包括片段语音特征提取单元、特征表达模型获取单元和相同发言人语音片段获取单元。
片段语音特征提取单元,用于通过预先训练的通用背景模型和高斯混合模型从会议语音片段中提取i-vector特征作为片段语音特征。
特征表达模型获取单元,用于采用预先训练的双协方差概率线性判别模型对i-vector特征建模,得到会议语音片段的特征表达模型。
相同发言人语音片段获取单元,用于采用特征表达模型对会议语音片段进行聚类,确定相同发言人的会议语音片段。
可选地,发言人身份确定模块40包括展示单元、第一确认结果获取单元和第一发言人身份确定单元。
展示单元,用于在每一相同发言人的会议语音片段中各获取预设个数的会议语音片段,并进行展示。
第一确认结果获取单元,用于响应于展示,获取发言人身份确认指令,根据发言人身份确认指令确认预设个数的会议语音片段的发言人身份,得到第一确认结果。
第一发言人身份确定单元,用于根据第一确认结果和相同发言人的会议语音片段确定会议语音片段的发言人身份。
可选地,发言人身份确定模块40还包括输入单元、第二确认结果获取单元和第二发言人身份确定单元。
输入单元,用于在每一相同发言人的会议语音片段中各获取预设个数的会议语音片段,并输入到预先训练的声纹识别模型中。
第二确认结果获取单元,用于通过声纹识别模型识别预设个数的会议语音片段,确认预设个数的会议语音片段的发言人身份,得到第二确认结果。
第二发言人身份确定单元,用于根据第二确认结果和相同发言人的会议语音片段确定会议语音片段的发言人身份。
可选地,区分模块50具体用于将相同发言人的会议语音片段按发言人身份输入到语音转文本模型中,得到不同发言人的会议内容。
可选地,会议内容区分装置还包括生成单元,用于采用预先训练的深度神经网络模型和神经语音模型对会议内容进行分析,生成会议纪要和/或执行列表。
可选地,目标片段获取模块10包括原始会议语音片段获取单元和目标会议语音片段获取单元。
原始会议语音片段获取单元,用于获取原始会议语音片段。
目标会议语音片段获取单元,用于采用静音检测去除原始会议语音片段中的静默片段,得到目标会议语音片段。
在本申请实施例中,首先将获取的目标会议语音片段根据发言人转变点进行切割,得到至少三个会议室语音片段,能够将包括至少两个不同发言人的会议语音片段的目标会议语音片段实现合理切割,使得每个得到的会议室语音片段来自一位发言人;然后提取会议语音片段的片段语音特征,根据片段语音特征所表达出的相似性对会议室语音片段聚类,根据聚类的结果确定相同发言人的会议语音片段,将会议语音片段按类别进行区分;最后根据相同发言人的会议语音片段确定每一个会议语音片段对应的发言人身份,从而根据发言人身份和相同发言人的会议语音片段确定会议内容中各个会议语音片段具体的所属情况,实现会议室内容的高效区分。
本实施例提供一计算机非易失性可读存储介质,该计算机非易失性可读存储介质上存储有计算机可读指令,该计算机可读指令被处理器执行时实现实施例中会议内容区分方法,为避免重复,此处不一一赘述。或者,该计算机可读指令被处理器执行时实现实施例中会议内容区分装置中各模块/单元的功能,为避免重复,此处不一一赘述。
图3是本申请一实施例提供的计算机设备的示意图。如图3所示,该实施例的计算机设备60包括:处理器61、存储器62以及存储在存储器62中并可在处理器61上运行的计算机可读指令63,该计算机可读指令63被处理器61执行时实现实施例中的会议内容区分方法,为避免重复,此处不一一赘述。或者,该计算机可读指令63被处理器61执行时实现实施例中会议内容区分装置中各模型/单元的功能,为避免重复,此处不一一赘述。
计算机设备60可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。计算机设备60可包括,但不仅限于,处理器61、存储器62。本领域技术人员可以理解,图3仅仅是计算机设备60的示例,并不构成对计算机设备60的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如计算机设备还可以包括输入输出设备、网络接入设备、总线等。
所称处理器61可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
存储器62可以是计算机设备60的内部存储单元,例如计算机设备60的硬盘或内存。存储器62也可以是计算机设备60的外部存储设备,例如计算机设备60上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,存储器62还可以既包括计算机设备60的内部存储单元也包括外部存储设备。存储器62用于存储计算机可读指令以及计算机设备所需的其他程序和数据。存储器62还可以用于暂时地存储已经输出或者将要输出的数据。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。
以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。

Claims (20)

  1. 一种会议内容区分方法,其特征在于,所述方法包括:
    获取目标会议语音片段,其中,所述目标会议语音片段包括至少两个不同发言人的会议语音片段;
    获取所述目标会议语音片段的发言人转变点,根据所述发言人转变点切割所述目标会议语音片段,得到至少三个会议语音片段,其中,一个所述发言人对应一个或多个所述会议语音片段;
    提取所述会议语音片段的片段语音特征,根据所述片段语音特征对所述会议语音片段进行聚类,确定相同发言人的会议语音片段;
    根据所述相同发言人的会议语音片段确定所述会议语音片段的发言人身份;
    根据所述发言人身份和所述相同发言人的会议语音片段区分会议内容。
  2. 根据权利要求1所述的方法,其特征在于,所述提取所述会议语音片段的片段语音特征,根据所述片段语音特征对所述会议语音片段进行聚类,确定相同发言人的会议语音片段,包括:
    通过预先训练的通用背景模型和高斯混合模型从所述会议语音片段中提取i-vector特征作为所述片段语音特征;
    采用预先训练的双协方差概率线性判别模型对所述i-vector特征建模,得到会议语音片段的特征表达模型;
    采用所述特征表达模型对所述会议语音片段进行聚类,确定所述相同发言人的会议语音片段。
  3. 根据权利要求1所述的方法,其特征在于,所述根据所述相同发言人的会议语音片段确定所述会议语音片段的发言人身份,包括:
    在每一所述相同发言人的会议语音片段中各获取预设个数的会议语音片段,并进行展示;
    响应于所述展示,获取发言人身份确认指令,根据所述发言人身份确认指令确认所述预设个数的会议语音片段的发言人身份,得到第一确认结果;
    根据所述第一确认结果和所述相同发言人的会议语音片段确定所述会议语音片段的发言人身份。
  4. 根据权利要求1所述的方法,其特征在于,所述根据所述相同发言人的会议语音片段确定所述会议语音片段的发言人身份,还包括:
    在每一所述相同发言人的会议语音片段中各获取预设个数的会议语音片段,并输入到预先训练的声纹识别模型中;
    通过所述声纹识别模型识别所述预设个数的会议语音片段,确认所述预设个数的会议语音片段的发言人身份,得到第二确认结果;
    根据所述第二确认结果和所述相同发言人的会议语音片段确定所述会议语音片段的发言人身份。
  5. 根据权利要求1所述的方法,其特征在于,所述根据所述发言人身份和所述相同发言人的会议语音片段区分会议内容,包括:
    将所述相同发言人的会议语音片段按所述发言人身份输入到语音转文本模型中,得到不同发言人的会议内容。
  6. 根据权利要求1-5任意一项所述的方法,其特征在于,在所述根据所述发言人身份和所述相同发言人的会议语音片段区分会议内容之后,还包括:
    采用预先训练的深度神经网络模型和神经语音模型对所述会议内容进行分析,生成会议纪要和/或执行列表。
  7. 根据权利要求1所述的方法,其特征在于,所述获取目标会议语音片段,包括:
    获取原始会议语音片段;
    采用静音检测去除所述原始会议语音片段中的静默片段,得到所述目标会议语音片段。
  8. 一种会议内容区分装置,其特征在于,所述装置包括:
    目标片段获取模块,用于获取目标会议语音片段,其中,所述目标会议语音片段包括至少两个不同发言人的会议语音片段;
    会议语音片段获取模块,用于获取所述目标会议语音片段的发言人转变点,根据所述发言人转变点切割所述目标会议语音片段,得到至少三个会议语音片段,其中,一个所述发言人对应一个或多个所述会议语音片段;
    相同发言人语音片段确定模块,用于提取所述会议语音片段的片段语音特征,根据所述片段语音特征对所述会议语音片段进行聚类,确定相同发言人的会议语音片段;
    发言人身份确定模块,用于根据所述相同发言人的会议语音片段确定所述会议语音片段的发言人身份;
    区分模块,用于根据所述发言人身份和所述相同发言人的会议语音片段区分会议内容。
  9. 根据权利要求8所述的装置,其特征在于,所述发言人身份确定模块包括展示单元、第一确认结果获取单元和第一发言人身份确定单元:
    展示单元,用于在每一所述相同发言人的会议语音片段中各获取预设个数的会议语音片段,并进行展示;
    第一确认结果获取单元,用于响应于所述展示,获取发言人身份确认指令,根据所述发言人身份确认指令确认所述预设个数的会议语音片段的发言人身份,得到第一确认结果;
    第一发言人身份确定单元,用于根据所述第一确认结果和所述相同发言人的会议语音片段确定所述会议语音片段的发言人身份。
  10. 根据权利要求8所述的装置,其特征在于,所述发言人身份确定模块还包括输入单元、第二确认结果获取单元和第二发言人身份确定单元:
    输入单元,用于在每一所述相同发言人的会议语音片段中各获取预设个数的会议语音片段,并输入到预先训练的声纹识别模型中;
    第二确认结果获取单元,用于通过所述声纹识别模型识别所述预设个数的会议语音片段,确认所述预设个数的会议语音片段的发言人身份,得到第二确认结果;
    第二发言人身份确定单元,用于根据所述第二确认结果和所述相同发言人的会议语音片段确定所述会议语音片段的发言人身份。
  11. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现如下步骤:
    获取目标会议语音片段,其中,所述目标会议语音片段包括至少两个不同发言人的会议语音片段;
    获取所述目标会议语音片段的发言人转变点,根据所述发言人转变点切割所述目标会议语音片段,得到至少三个会议语音片段,其中,一个所述发言人对应一个或多个所述会议语音片段;
    提取所述会议语音片段的片段语音特征,根据所述片段语音特征对所述会议语音片段进行聚类,确定相同发言人的会议语音片段;
    根据所述相同发言人的会议语音片段确定所述会议语音片段的发言人身份;
    根据所述发言人身份和所述相同发言人的会议语音片段区分会议内容。
  12. 根据权利要求11所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时,所述处理器还实现如下步骤:
    通过预先训练的通用背景模型和高斯混合模型从所述会议语音片段中提取i-vector特征作为所述片段语音特征;
    采用预先训练的双协方差概率线性判别模型对所述i-vector特征建模,得到会议语音片段 的特征表达模型;
    采用所述特征表达模型对所述会议语音片段进行聚类,确定所述相同发言人的会议语音片段。
  13. 根据权利要求11所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时,所述处理器还实现如下步骤:
    在每一所述相同发言人的会议语音片段中各获取预设个数的会议语音片段,并进行展示;
    响应于所述展示,获取发言人身份确认指令,根据所述发言人身份确认指令确认所述预设个数的会议语音片段的发言人身份,得到第一确认结果;
    根据所述第一确认结果和所述相同发言人的会议语音片段确定所述会议语音片段的发言人身份。
  14. 根据权利要求11所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时,所述处理器还实现如下步骤:
    在每一所述相同发言人的会议语音片段中各获取预设个数的会议语音片段,并输入到预先训练的声纹识别模型中;
    通过所述声纹识别模型识别所述预设个数的会议语音片段,确认所述预设个数的会议语音片段的发言人身份,得到第二确认结果;
    根据所述第二确认结果和所述相同发言人的会议语音片段确定所述会议语音片段的发言人身份。
  15. 根据权利要求11所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时,所述处理器还实现如下步骤:
    将所述相同发言人的会议语音片段按所述发言人身份输入到语音转文本模型中,得到不同发言人的会议内容。
  16. 一种计算机非易失性可读存储介质,所述计算机非易失性可读存储介质存储有计算机可读指令,其特征在于,所述计算机可读指令被处理器执行时实现如下步骤:获取目标会议语音片段,其中,所述目标会议语音片段包括至少两个不同发言人的会议语音片段;
    获取所述目标会议语音片段的发言人转变点,根据所述发言人转变点切割所述目标会议语音片段,得到至少三个会议语音片段,其中,一个所述发言人对应一个或多个所述会议语音片段;
    提取所述会议语音片段的片段语音特征,根据所述片段语音特征对所述会议语音片段进行聚类,确定相同发言人的会议语音片段;
    根据所述相同发言人的会议语音片段确定所述会议语音片段的发言人身份;
    根据所述发言人身份和所述相同发言人的会议语音片段区分会议内容。
  17. 根据权利要求16所述的计算机非易失性可读存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还实现如下步骤:
    通过预先训练的通用背景模型和高斯混合模型从所述会议语音片段中提取i-vector特征作为所述片段语音特征;
    采用预先训练的双协方差概率线性判别模型对所述i-vector特征建模,得到会议语音片段的特征表达模型;
    采用所述特征表达模型对所述会议语音片段进行聚类,确定所述相同发言人的会议语音片段。
  18. 根据权利要求16所述的计算机非易失性可读存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还实现如下步骤:
    在每一所述相同发言人的会议语音片段中各获取预设个数的会议语音片段,并进行展示;
    响应于所述展示,获取发言人身份确认指令,根据所述发言人身份确认指令确认所述预设个数的会议语音片段的发言人身份,得到第一确认结果;
    根据所述第一确认结果和所述相同发言人的会议语音片段确定所述会议语音片段的发言人身份。
  19. 根据权利要求16所述的计算机非易失性可读存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还实现如下步骤:
    在每一所述相同发言人的会议语音片段中各获取预设个数的会议语音片段,并输入到预先训练的声纹识别模型中;
    通过所述声纹识别模型识别所述预设个数的会议语音片段,确认所述预设个数的会议语音片段的发言人身份,得到第二确认结果;
    根据所述第二确认结果和所述相同发言人的会议语音片段确定所述会议语音片段的发言人身份。
  20. 根据权利要求16所述的计算机非易失性可读存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还实现如下步骤:
    将所述相同发言人的会议语音片段按所述发言人身份输入到语音转文本模型中,得到不同发言人的会议内容。
PCT/CN2019/091098 2019-01-16 2019-06-13 会议内容区分方法、装置、计算机设备及存储介质 WO2020147256A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910038369.4 2019-01-16
CN201910038369.4A CN109960743A (zh) 2019-01-16 2019-01-16 会议内容区分方法、装置、计算机设备及存储介质

Publications (1)

Publication Number Publication Date
WO2020147256A1 true WO2020147256A1 (zh) 2020-07-23

Family

ID=67023487

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/091098 WO2020147256A1 (zh) 2019-01-16 2019-06-13 会议内容区分方法、装置、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN109960743A (zh)
WO (1) WO2020147256A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114694650A (zh) * 2022-03-29 2022-07-01 青岛海尔科技有限公司 智能设备的控制方法和装置、存储介质及电子设备

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110544481B (zh) * 2019-08-27 2022-09-20 华中师范大学 一种基于声纹识别的s-t分类方法、装置及设备终端
CN110807370B (zh) * 2019-10-12 2024-01-30 南京星耀智能科技有限公司 一种基于多模态的会议发言人身份无感确认方法
CN110827853A (zh) * 2019-11-11 2020-02-21 广州国音智能科技有限公司 语音特征信息提取方法、终端及可读存储介质
CN111128253B (zh) * 2019-12-13 2022-03-01 北京小米智能科技有限公司 音频剪辑方法及装置
CN111798870A (zh) * 2020-09-08 2020-10-20 共道网络科技有限公司 会话环节确定方法、装置及设备、存储介质
CN112053691B (zh) * 2020-09-21 2023-04-07 广州迷听科技有限公司 会议辅助方法、装置、电子设备及存储介质
CN112652313B (zh) * 2020-12-24 2023-04-07 北京百度网讯科技有限公司 声纹识别的方法、装置、设备、存储介质以及程序产品
CN113539269A (zh) * 2021-07-20 2021-10-22 上海明略人工智能(集团)有限公司 音频信息处理方法、系统和计算机可读存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040214558A1 (en) * 2001-12-19 2004-10-28 Bellsouth Intellectual Property Corporation Establishing a conference call from a call-log
CN107545898A (zh) * 2017-08-07 2018-01-05 清华大学 一种区分说话人语音的处理方法及装置
CN107689225A (zh) * 2017-09-29 2018-02-13 福建实达电脑设备有限公司 一种自动生成会议记录的方法
CN108766445A (zh) * 2018-05-30 2018-11-06 苏州思必驰信息科技有限公司 声纹识别方法及系统
CN108986826A (zh) * 2018-08-14 2018-12-11 中国平安人寿保险股份有限公司 自动生成会议记录的方法、电子装置及可读存储介质

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030236663A1 (en) * 2002-06-19 2003-12-25 Koninklijke Philips Electronics N.V. Mega speaker identification (ID) system and corresponding methods therefor
CN102543063B (zh) * 2011-12-07 2013-07-24 华南理工大学 基于说话人分割与聚类的多说话人语速估计方法
CN103530432A (zh) * 2013-09-24 2014-01-22 华南理工大学 一种具有语音提取功能的会议记录器及语音提取方法
CN104021785A (zh) * 2014-05-28 2014-09-03 华南理工大学 一种提取会议中最重要嘉宾语音的方法
CN108022583A (zh) * 2017-11-17 2018-05-11 平安科技(深圳)有限公司 会议纪要生成方法、应用服务器及计算机可读存储介质
CN107967912B (zh) * 2017-11-28 2022-02-25 广州势必可赢网络科技有限公司 一种人声分割方法及装置
CN108922538B (zh) * 2018-05-29 2023-04-07 平安科技(深圳)有限公司 会议信息记录方法、装置、计算机设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040214558A1 (en) * 2001-12-19 2004-10-28 Bellsouth Intellectual Property Corporation Establishing a conference call from a call-log
CN107545898A (zh) * 2017-08-07 2018-01-05 清华大学 一种区分说话人语音的处理方法及装置
CN107689225A (zh) * 2017-09-29 2018-02-13 福建实达电脑设备有限公司 一种自动生成会议记录的方法
CN108766445A (zh) * 2018-05-30 2018-11-06 苏州思必驰信息科技有限公司 声纹识别方法及系统
CN108986826A (zh) * 2018-08-14 2018-12-11 中国平安人寿保险股份有限公司 自动生成会议记录的方法、电子装置及可读存储介质

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114694650A (zh) * 2022-03-29 2022-07-01 青岛海尔科技有限公司 智能设备的控制方法和装置、存储介质及电子设备

Also Published As

Publication number Publication date
CN109960743A (zh) 2019-07-02

Similar Documents

Publication Publication Date Title
WO2020147256A1 (zh) 会议内容区分方法、装置、计算机设备及存储介质
US11417343B2 (en) Automatic speaker identification in calls using multiple speaker-identification parameters
Anguera et al. Speaker diarization: A review of recent research
WO2020211354A1 (zh) 基于说话内容的说话者身份识别方法、装置及存储介质
US11854550B2 (en) Determining input for speech processing engine
WO2017084197A1 (zh) 一种基于情感识别的智能家居控制方法及其系统
WO2020098083A1 (zh) 通话分离方法、装置、计算机设备及存储介质
WO2021135685A1 (zh) 身份认证的方法以及装置
Sahoo et al. Emotion recognition from audio-visual data using rule based decision level fusion
WO2020147407A1 (zh) 一种会议记录生成方法、装置、存储介质及计算机设备
Minotto et al. Multimodal multi-channel on-line speaker diarization using sensor fusion through SVM
WO2020107834A1 (zh) 唇语识别的验证内容生成方法及相关装置
US11756572B2 (en) Self-supervised speech representations for fake audio detection
CN110634472A (zh) 一种语音识别方法、服务器及计算机可读存储介质
WO2023088448A1 (zh) 语音处理方法、设备及存储介质
WO2020019831A1 (zh) 特定人群识别方法、电子装置及计算机可读存储介质
Zhang et al. Multimodal Deception Detection Using Automatically Extracted Acoustic, Visual, and Lexical Features.
Huang et al. Detecting the instant of emotion change from speech using a martingale framework
CN113744742B (zh) 对话场景下的角色识别方法、装置和系统
JP4143541B2 (ja) 動作モデルを使用して非煩雑的に話者を検証するための方法及びシステム
Akinrinmade et al. Creation of a Nigerian voice corpus for indigenous speaker recognition
CN114138960A (zh) 用户意图识别方法、装置、设备及介质
JP2011191542A (ja) 音声分類装置、音声分類方法、及び音声分類用プログラム
Primorac et al. Audio-visual biometric recognition via joint sparse representations
Sailaja et al. Text Independent Speaker Identification Using Finite Doubly Truncated Gaussian Mixture Model

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19910306

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19910306

Country of ref document: EP

Kind code of ref document: A1